Hi all, I’m fairly new to P4 and still wrapping my head around certain parts. I understand that the general gist is that the P4 compiler will try to take advantage of any possible parallelization whenever possible. For example, if multiple lookups can be parallelized within one Match Action Unit, then the P4 compiler will take care of that.
I was wondering what typically happens when there is only one lookup table (either TCAM or SRAM) that is very large. For example, to keep it vendor anonymous, if there are x MAUs and y TCAM blocks per MAU according to a vendor design, what happens when I create one P4 lookup table that only uses lpm and I fill it up with more entries than can be stored in y TCAM blocks.
Is that even allowed? Will the P4 compiler somehow “split” the one logical TCAM lookup table across multiple MAUs and parallelize it somehow? In that case, how will the results be unified and decided once all the separate lookups have been completed? Lastly, is the pipeline strictly linear or can the input from the parser directly travel to MAU 2 for example, and do a lookup there without going through MAU 1 and MAU 0?
Thank you! Any guidance will be appreciated.
First, not all P4 implementations will necessarily have multiple stages in a pipeline. A P4 implementation could be a parallel array of CPU cores, with each packet going to one CPU core, executing until completion, then when that CPU core is finished with that packet, the packet goes out, and the CPU core is assigned the next input packet. There are literally an unlimited number of ways to implement P4 in various target devices.
That said, your questions seem specifically asking about a multiple-stage hardware pipeline, with a limited amount of memory accessible in each pipeline stage.
In such cases, one way to split a large table into physically separate stages worth of memory is to associate a priority value with every entry.
The first stage with part 1 of the table entries does a lookup, and gets back the highest priority matching entry that is in that part of the table. This might not be the highest priority matching entry in the entire table, though, so its action is NOT executed yet. Instead the hardware remembers the priority value and some kind of index to the matching entry, for possible later execution of its action.
The second stage, containing part 2 of the table entries, does another lookup with the same search key, and gets back the highest priority matching entry that is in part 2 of the entries. If its priority is higher than the one found in the earlier stage, its priority and index replace the one remembered. If it is not higher, then the new search result is ignored.
Continue that way for as many stages as you wish (or have available). When the last stage with part of the entries is done, you now know the highest priority matching entry, and its index, and can look up that index in some memory that contains the desired action and action parameters, and perform the action.
Thank you for your response Andy! It was very helpful. 2 minor clarifications:
Is the priority value splitting that you mentioned something that is done automatically by the P4 compiler if the implementation supports it? or as the P4 programmer, will I have to create multiple separate lookup tables manually?
Using the example you mentioned with the first stage and second stage holding parts 1 and 2 of the table, is it possible to “parallelize” the two MAU stage lookups at the same time so that it takes up one clock cycle to complete the lookup and resolve the match or does the pipeline have to flow linearly so such a lookup across two MAUs will take 2 clock cycles?
As always, thank you so much!
Reponse to your point number 1: Tofino’s P4 compiler can do the splitting of single P4 tables in your source code across multiple stages automatically. No need for you to do it manually yourself.
Reponse to your point number 2: The two lookups are definitely pipelined, so it definitely does not lower packet rate performance to do things this way. Having them in separate stages does add to the latency of packet processing, but not by much.
Point number 1 makes sense now!
A minor bit of confusion regarding point 2:
When you say the two lookups are definitely pipelined, does that mean they are pipelined in one pipe unit across multiple stages (horizontally) or pipelined across multiple pipes (vertically)? I think my confusion is around where you say it definitely does not lower packet rate performance but it does add to the latency of packet processing. How can both be true? I might be misunderstanding something - apologies!
They are pipelined across (usually) consecutive stages in the same pipeline. If the pipeline is fixed depth, this does not increase the latency, but if there are “early exit points” from the pipeline, as some pipelines have, then using more stages for useful processing work will increase the latency.
In Tofino at least, and the RMT architecture as described in research papers, because all stages only access SRAM and TCAM on-chip, the throughput of each stage can be made to be 1 packet per clock cycle throughput, guaranteed 100% of the time. This is true regardless of the number of stages in the same pipeline.
If I’m understanding correctly, for the Tofino for example, the large TCAM table would be split across two stages for example MAU 1 and MAU 2 and one logical lookup in the original TCAM table would become two sequential lookups into MAU 1 and then sequentially after it would go to MAU 2 and so the lookup itself would take twice as long?
Does there exist any way to parallelize this for Tofino so that one large TCAM table lookup would be implemented in a manner such that the lookup itself would take the same amount of time it would take to do a normal TCAM lookup in a smaller table that is able to fit in one stage? Or would the added latency of two sequential lookups just be the unavoidable cost of using such a large TCAM table? I really appreciate your time and help.
So regarding your phrase “twice as long”.
What are you trying to optimize or avoid here?
If the throughput for a table small enough to be done in one MAU stage is N billion packets/second, and the throughput when a larger table is split over two MAU stages is also N billion packets/second, so what if it takes “twice as long”?
I’m trying to conceptually map an algorithm I have in mind to a n-stage pipeline such as Tofino. The basic structure of the algorithm is it starts with one lookup into a large TCAM table and then spends the rest of the stages doing lookups/matches in SRAM. For example, if the vendor-specific switch has 10 MAUs, the first stage would be doing a lookup into the large TCAM table and the other 9 subsequent stages would be the rest of the algorithm thus it will take 10 stages total. I’m trying to avoid the problem: if one MAU doesn’t have enough TCAM to support the lookup table and the TCAM table must be split into let’s say 3 stages, then I would have to adapt the rest of algorithm to be able to fit into 7 stages instead of the current 9 or is there a way for the large lookup TCAM to only take up one stage so there are still 9 stages available for use? Does this help clarify it?
Are the 9 stages of subsequent lookups dependent on the result of the TCAM table lookup at the beginning? If yes, then I agree, having the large TCAM table lookup split over multiple stages will be an issue for you, as it will increase the number of stages required.
If the 9 stages of subsequent lookups are independent of the result of the TCAM table lookup, then it should be possible for the later parts of the TCAM lookup table to be done in parallel with the SRAM table lookups, because each MAU stage can do multiple table lookups in parallel (on different tables).
I see now it would be the dependent case so it will impact the algorithm.
On a related note, a common way I see pipeline architectures that contain MAU units defined is each MAU has x TCAM blocks and y SRAM pages. Are the x and y amounts of TCAM/SRAM referring to the amount available solely for the key portion of a key, action, action data table? As an example, if an MAU is described as having 100 SRAM pages available and my algorithm will use 100 SRAM pages worth of keys + additional 20 pages for the action data, is that fine? Or will the action data SRAM amount also need to be factored in so that the key+action data will not exceed 100 pages total?
For your last question, if you are asking about Tofino details, I would recommend contacting Intel support via one of the methods described in this answer: Intel(r) Tofino(tm) Family, TNA, and P4Studio questions - #2