I was wondering if there is a way to implement and use SRAM bits for range matches in P4.
The approaches for representing range matches with SRAM not in P4 that I could think of are:
- Use a bitmap array and flip bits to 1 for values covered by the range
- Store range endpoints (min, max) and perform a comparison check against them to determine if in the range
- Expand range into separate exact match values
My understanding of P4 is that it does things in a Match Action approach. I’m a bit stuck on how to possibly implement range matches using SRAM here. It seems that Option 3 is the easiest here as it fits perfectly with the match action table approach. One would just expand a range wildcard entry such as 192.0.. into a bunch of exact match entries however this is obviously not very scalable. I’m asking because I’ve found many research papers in academia about new SRAM approaches for classification but I am unsure if they are portable to P4.
P4 defines the semantics of what the different match kinds mean, and how they should behave when you apply a table with an arbitrary set of entries installed by the control plane, but it does not specify how an implementation must perform these tasks.
I would say that any approach that achieves the desired price/power/performance/feature tradeoffs that are acceptable to at least some users is “portable to P4”. I would probably even go so far as to say that aproaches that do NOT achieve tradeoffs acceptable to some users are “portable to P4”, but it is unlikely that anyone would want to use those approaches.
I see. If I’m understanding correctly, are you saying something along the lines of this:
It’s completely dependent on the implementation for how SRAM-based range is actually implemented. As long as it can be represented in some form in P4 and there exists some implementation that at the lower level uses SRAM while achieving the desired specs, it will technically be portable to P4. So using an example of representing a 16 bit src port range value for Ports 0-65535, as long as we are able to write in P4, 0-65535 as a range match table entry and there exists some implementation that will take this P4 range entry and make it work (can possibly use either of the aforementioned 3 SRAM approaches. Is this correct?
As a follow-up then, from your personal experience, is there an SRAM implementation for P4 ranges that you have commonly seen that is used? Or are P4 range types usually/always using TCAM in the implementation.
If you want deterministic packet processing performance with low latency, then the most common techniques I have seen are the following:
- use a plain old regular TCAM, and encode ranges using potentially many TCAM entries as described here: https://github.com/jafingerhut/p4-guide/tree/master/match-range-using-tcam
Advantages: Can be implemented with off-the-shelf TCAM hardware designs, with no customization or changes required.
Disadvantages: The number of TCAM entries required to represent a single rule with ranges on fields can be quite large in the worst case.
-
Use #1 as a fallback, but add extra hardware before the TCAM that for a small finite number R of them, e.g. R=8 or R=16, calculate an R-bit vector indicating whether the desired fields are in that range (then the bit is 1), or outside of that range (then the bit value is 0), and feed this R-bit vector as an additional field into the TCAM. This enables some ranges to be encoded with a single TCAM entry that would require a large number of TCAM entries with approach #1, so helps reduce TCAM usage in the typical/average case, but the worst case is almost as bad as #1.
-
Create a customized TCAM design where a limited set of fields have not only value/match hardware logic, but also min/max comparison logic.
Advantages: Range entries require only one TCAM entry.
Disadvantages: Custom TCAM design that most ASIC foundries don’t include in their standard libraries off the shelf, so you probably have to design it yourself, which can take a significant number of person-years if you want to optimize it for ASIC die area and power.
To clarify and provide some further context, the thought exercise I’m doing is the following:
Nowadays there are chips like Tofino which come with ample SRAM and TCAM. If I’m doing a packet classification match for a IPv6 source address field against ranges in a huge table with lots of entries such that there is not enough TCAM, is there a way to utilize SRAM as well to offload some of these ranges? Is there a way we can recreate/simulate TCAM range matching with SRAM in a scalable way besides outright expanding TCAM wildcard entries into many SRAM exact match entries in a 1-to-many naïve approach?
I feel like I’m missing something or just being dumb because I see so many SRAM packet classification approaches that match against rules that are constructed from ranges but I cannot wrap my head around how to translate/recreate these algorithms in P4 without changing from SRAM to TCAM. I’m sorry if this is making things more confusing.
So high level there are 2 possible approaches you might be wanting to follow:
-
For a particular hardware implementation, e.g. Tofino1, how can I scale range matching to larger tables by using SRAM in the way that Tofino1 lets P4 developers use SRAM, which is most often via P4 tables with all exact match keys.
-
If I was designing new hardware, what algorithms are available that could help scale range matching to larger tables by using a combination of TCAM, SRAM, and/or new custom hardware logic that I pay a team of hardware designers to create?
Which of these approaches are you going for?
For #1, you can’t change an existing hardware design, but must live with the ways to access TCAM and SRAM that the existing chip has.
#2 allows many more possibilities to be considered, but the disadvantage is that unless someone designs a chip that works that way, it is academic research.
1 Like
Note: I would bet a significant amount of money that if you are searching through research papers looking for what they suggest in this area, most of them follow approach #2, without regard for whether their ideas can be implemented in a particular existing chip or not.
If the research paper says “we implemented this in vendor ASIC Whizbang78B …”, then I would bet that they have actually tried to live within Whizbang78B’s existing hardware design and that approach does so. (in case it is not obvious, I am making up the name Whizbang78B – I don’t know of any such ASIC)
1 Like
I see. So using Tofino as the example here and following the approach 1 you detailed above about being limited to/living within the way of the existing chip, if for example Tofino uses SRAM mainly for P4 tables with all exact match keys and our goal is to take approach 1 and not design new hardware, then using SRAM for range matching is simply not possible because of the hardware limitation. So in that case, our primary option would be to force SRAM usage by taking advantage and transforming our TCAM approach into whatever SRAM usage is supported by Tofino. So since it uses SRAM for tables w/ exact matches, we would have to convert TCAM wildcard entries via expansion to many exact matches in this case. Is this better?
I would not say “it is not possible in Tofino”, I would rather say that one might have to get creative with using multiple ternary and/or exact match tables in series, and perhaps fancy control plane code for deciding what entries to install in them, in order to achieve more scalable range matching behavior. I don’t have a particular approach in mind here to give you at this time. I am saying that claiming “it is impossible” is a pretty big claim that I suspect is not actually true, because people can be pretty ingenious when it comes to having 4 or 5 table lookups in a sequence, with data dependencies between them and fancy control plane software.
1 Like
That totally makes sense now. Thanks for the clarification. Much appreciated.