Optimize Stage Number

Hi all,

I have already implemented the basic functionality of a P4 program on Tofino2, including IP lookup and port matching with a TCAM + SRAM bucket/bitmap design, and the program works correctly.

My current problem is that the stage usage is too high, and I want to optimize it.

In particular, for the SRAM path I use the port high bits to select a bucket and the low 5 bits to test one bit in a 32-bit bitmap. Right now this bit test is implemented with a long if-else chain selecting bitmap[0] … bitmap[31] for both source and destination ports. (In SRAM, I would perform a bitmap check as shown in the figure below)

code:
**`// -------------------------------------------------------------------
// Step 2: SRC Port Lookup (Stage 2)
// -------------------------------------------------------------------

// Try SRC TCAM path
src_tcam_table.apply();

// Try SRC SRAM path (parallel with TCAM)
src_sram_table.apply();

// Bitmap membership test for SRAM path
if (ig_md.src_sram_bucket_hit) {
    // Extract bitmap bit at position src_remainder using compact if-else
    if (ig_md.src_remainder == 5w0) ig_md.src_bitmap_bit = ig_md.src_bitmap[0:0];
    else if (ig_md.src_remainder == 5w1) ig_md.src_bitmap_bit = ig_md.src_bitmap[1:1];
    else if (ig_md.src_remainder == 5w2) ig_md.src_bitmap_bit = ig_md.src_bitmap[2:2];
    else if (ig_md.src_remainder == 5w3) ig_md.src_bitmap_bit = ig_md.src_bitmap[3:3];
    else if (ig_md.src_remainder == 5w4) ig_md.src_bitmap_bit = ig_md.src_bitmap[4:4];
    else if (ig_md.src_remainder == 5w5) ig_md.src_bitmap_bit = ig_md.src_bitmap[5:5];
    else if (ig_md.src_remainder == 5w6) ig_md.src_bitmap_bit = ig_md.src_bitmap[6:6];
    else if (ig_md.src_remainder == 5w7) ig_md.src_bitmap_bit = ig_md.src_bitmap[7:7];
    else if (ig_md.src_remainder == 5w8) ig_md.src_bitmap_bit = ig_md.src_bitmap[8:8];
    else if (ig_md.src_remainder == 5w9) ig_md.src_bitmap_bit = ig_md.src_bitmap[9:9];
    else if (ig_md.src_remainder == 5w10) ig_md.src_bitmap_bit = ig_md.src_bitmap[10:10];
    else if (ig_md.src_remainder == 5w11) ig_md.src_bitmap_bit = ig_md.src_bitmap[11:11];
    else if (ig_md.src_remainder == 5w12) ig_md.src_bitmap_bit = ig_md.src_bitmap[12:12];
    else if (ig_md.src_remainder == 5w13) ig_md.src_bitmap_bit = ig_md.src_bitmap[13:13];
    else if (ig_md.src_remainder == 5w14) ig_md.src_bitmap_bit = ig_md.src_bitmap[14:14];
    else if (ig_md.src_remainder == 5w15) ig_md.src_bitmap_bit = ig_md.src_bitmap[15:15];
    else if (ig_md.src_remainder == 5w16) ig_md.src_bitmap_bit = ig_md.src_bitmap[16:16];
    else if (ig_md.src_remainder == 5w17) ig_md.src_bitmap_bit = ig_md.src_bitmap[17:17];
    else if (ig_md.src_remainder == 5w18) ig_md.src_bitmap_bit = ig_md.src_bitmap[18:18];
    else if (ig_md.src_remainder == 5w19) ig_md.src_bitmap_bit = ig_md.src_bitmap[19:19];
    else if (ig_md.src_remainder == 5w20) ig_md.src_bitmap_bit = ig_md.src_bitmap[20:20];
    else if (ig_md.src_remainder == 5w21) ig_md.src_bitmap_bit = ig_md.src_bitmap[21:21];
    else if (ig_md.src_remainder == 5w22) ig_md.src_bitmap_bit = ig_md.src_bitmap[22:22];
    else if (ig_md.src_remainder == 5w23) ig_md.src_bitmap_bit = ig_md.src_bitmap[23:23];
    else if (ig_md.src_remainder == 5w24) ig_md.src_bitmap_bit = ig_md.src_bitmap[24:24];
    else if (ig_md.src_remainder == 5w25) ig_md.src_bitmap_bit = ig_md.src_bitmap[25:25];
    else if (ig_md.src_remainder == 5w26) ig_md.src_bitmap_bit = ig_md.src_bitmap[26:26];
    else if (ig_md.src_remainder == 5w27) ig_md.src_bitmap_bit = ig_md.src_bitmap[27:27];
    else if (ig_md.src_remainder == 5w28) ig_md.src_bitmap_bit = ig_md.src_bitmap[28:28];
    else if (ig_md.src_remainder == 5w29) ig_md.src_bitmap_bit = ig_md.src_bitmap[29:29];
    else if (ig_md.src_remainder == 5w30) ig_md.src_bitmap_bit = ig_md.src_bitmap[30:30];
    else ig_md.src_bitmap_bit = ig_md.src_bitmap[31:31];
    
    if (ig_md.src_bitmap_bit == 1) {
        ig_md.src_sram_bitmap_hit = true;
    }
}`**

I would like to ask:

  1. Is this bitmap bit-selection logic likely to be a major reason for high stage usage?
  2. What is the recommended way to optimize this kind of bitmap membership test on Tofino2?
  3. Are there common design patterns to reduce stage count for TCAM + SRAM + bitmap pipelines?

Any suggestions would be very helpful. Thanks!

The behavior of that if-then-else-if chain is exactly the behavior of a P4 table that has ig_md.src_remainder as an exact match key, and one table entry per condition. The actions you have are probably most easily represented as a differently-named action for each branch, because the difference between them is in the bit position of a bit slice. I would recommend trying out such a source code change and see if it reduces the number of stages.