Optimize Stage Number

Hi all,

I have already implemented the basic functionality of a P4 program on Tofino2, including IP lookup and port matching with a TCAM + SRAM bucket/bitmap design, and the program works correctly.

My current problem is that the stage usage is too high, and I want to optimize it.

In particular, for the SRAM path I use the port high bits to select a bucket and the low 5 bits to test one bit in a 32-bit bitmap. Right now this bit test is implemented with a long if-else chain selecting bitmap[0] … bitmap[31] for both source and destination ports. (In SRAM, I would perform a bitmap check as shown in the figure below)

code:
**`// -------------------------------------------------------------------
// Step 2: SRC Port Lookup (Stage 2)
// -------------------------------------------------------------------

// Try SRC TCAM path
src_tcam_table.apply();

// Try SRC SRAM path (parallel with TCAM)
src_sram_table.apply();

// Bitmap membership test for SRAM path
if (ig_md.src_sram_bucket_hit) {
    // Extract bitmap bit at position src_remainder using compact if-else
    if (ig_md.src_remainder == 5w0) ig_md.src_bitmap_bit = ig_md.src_bitmap[0:0];
    else if (ig_md.src_remainder == 5w1) ig_md.src_bitmap_bit = ig_md.src_bitmap[1:1];
    else if (ig_md.src_remainder == 5w2) ig_md.src_bitmap_bit = ig_md.src_bitmap[2:2];
    else if (ig_md.src_remainder == 5w3) ig_md.src_bitmap_bit = ig_md.src_bitmap[3:3];
    else if (ig_md.src_remainder == 5w4) ig_md.src_bitmap_bit = ig_md.src_bitmap[4:4];
    else if (ig_md.src_remainder == 5w5) ig_md.src_bitmap_bit = ig_md.src_bitmap[5:5];
    else if (ig_md.src_remainder == 5w6) ig_md.src_bitmap_bit = ig_md.src_bitmap[6:6];
    else if (ig_md.src_remainder == 5w7) ig_md.src_bitmap_bit = ig_md.src_bitmap[7:7];
    else if (ig_md.src_remainder == 5w8) ig_md.src_bitmap_bit = ig_md.src_bitmap[8:8];
    else if (ig_md.src_remainder == 5w9) ig_md.src_bitmap_bit = ig_md.src_bitmap[9:9];
    else if (ig_md.src_remainder == 5w10) ig_md.src_bitmap_bit = ig_md.src_bitmap[10:10];
    else if (ig_md.src_remainder == 5w11) ig_md.src_bitmap_bit = ig_md.src_bitmap[11:11];
    else if (ig_md.src_remainder == 5w12) ig_md.src_bitmap_bit = ig_md.src_bitmap[12:12];
    else if (ig_md.src_remainder == 5w13) ig_md.src_bitmap_bit = ig_md.src_bitmap[13:13];
    else if (ig_md.src_remainder == 5w14) ig_md.src_bitmap_bit = ig_md.src_bitmap[14:14];
    else if (ig_md.src_remainder == 5w15) ig_md.src_bitmap_bit = ig_md.src_bitmap[15:15];
    else if (ig_md.src_remainder == 5w16) ig_md.src_bitmap_bit = ig_md.src_bitmap[16:16];
    else if (ig_md.src_remainder == 5w17) ig_md.src_bitmap_bit = ig_md.src_bitmap[17:17];
    else if (ig_md.src_remainder == 5w18) ig_md.src_bitmap_bit = ig_md.src_bitmap[18:18];
    else if (ig_md.src_remainder == 5w19) ig_md.src_bitmap_bit = ig_md.src_bitmap[19:19];
    else if (ig_md.src_remainder == 5w20) ig_md.src_bitmap_bit = ig_md.src_bitmap[20:20];
    else if (ig_md.src_remainder == 5w21) ig_md.src_bitmap_bit = ig_md.src_bitmap[21:21];
    else if (ig_md.src_remainder == 5w22) ig_md.src_bitmap_bit = ig_md.src_bitmap[22:22];
    else if (ig_md.src_remainder == 5w23) ig_md.src_bitmap_bit = ig_md.src_bitmap[23:23];
    else if (ig_md.src_remainder == 5w24) ig_md.src_bitmap_bit = ig_md.src_bitmap[24:24];
    else if (ig_md.src_remainder == 5w25) ig_md.src_bitmap_bit = ig_md.src_bitmap[25:25];
    else if (ig_md.src_remainder == 5w26) ig_md.src_bitmap_bit = ig_md.src_bitmap[26:26];
    else if (ig_md.src_remainder == 5w27) ig_md.src_bitmap_bit = ig_md.src_bitmap[27:27];
    else if (ig_md.src_remainder == 5w28) ig_md.src_bitmap_bit = ig_md.src_bitmap[28:28];
    else if (ig_md.src_remainder == 5w29) ig_md.src_bitmap_bit = ig_md.src_bitmap[29:29];
    else if (ig_md.src_remainder == 5w30) ig_md.src_bitmap_bit = ig_md.src_bitmap[30:30];
    else ig_md.src_bitmap_bit = ig_md.src_bitmap[31:31];
    
    if (ig_md.src_bitmap_bit == 1) {
        ig_md.src_sram_bitmap_hit = true;
    }
}`**

I would like to ask:

  1. Is this bitmap bit-selection logic likely to be a major reason for high stage usage?
  2. What is the recommended way to optimize this kind of bitmap membership test on Tofino2?
  3. Are there common design patterns to reduce stage count for TCAM + SRAM + bitmap pipelines?

Any suggestions would be very helpful. Thanks!

The behavior of that if-then-else-if chain is exactly the behavior of a P4 table that has ig_md.src_remainder as an exact match key, and one table entry per condition. The actions you have are probably most easily represented as a differently-named action for each branch, because the difference between them is in the bit position of a bit slice. I would recommend trying out such a source code change and see if it reduces the number of stages.

Thats P4 Insight,we call p4i

Unfortunately Intel did not include P4 Insight as part of the open-p4studio repository, I would guess due to unwillingness to spend even more on employee salaries dedicated to that task. Note that I am grateful that Intel spent as much as they did to release what is part of open-p4studio today.

@1418915702 ,

It is quite difficult to say exactly what is going on without seeing the whole program.

Tofino architecture and Tofino compiler are pretty good a fitting multiple nested if() statements into a single stage, so I do not think this particular piece of code is the only problem, although it is pretty awkward. Most probably the big if() statement takes 2 stages and the next one – one more.

There are many design patterns one can utilize to save on the number of stages, but they do require detailed understanding of Tofino architecture. The class recordings are available for purchase at P4ica Archives. I would recommend classes ICA-1141 and ICA-1142 for starters. Some of them are taught in the classes ICA-XFG. You can reach out to academy@p4ica.com for more details.

In addition to that, P4ica LLC offers consulting services, specifically in that area, helping with Tofino program optimization. You can reach out to consulting@p4ica.com for more information.

Given your discrepancy (19 stages vs. 12) I would highly recommending utilizing their services. Very often program fitting requires not only coding the algorithm in the most Tofino-friendly way, but also choosing the most Tofino-friendly algorithm in the first place. This type of work often requires collaboration. In general, C-like coding and even BMv2-like coding styles do not produce good results on Tofino.

Happy hacking,
Vladimir

Note also, that the second if() statement can be merged with each of the actions in the table @andyfingerhut mentions…

Happy hacking,
Vladimir

Dear @uboot,

Please look at my recent post. It explains how to obtain the same information (although in less graphical form) directly from the compiler.

I even wonder if it would be possible to ask an AI to recreate p4insight, since all the info is there :slight_smile: (or can be recomputed)

Happy hacking,
Vladimir