TCAM Block Question

Hi all,

I have a more low-level conceptual question regarding TCAM. It seems that a common building block/unit of granularity of TCAM are TCAM blocks of fixed width x height. My understanding of TCAM is that it performs parallel lookups (like normal CAM) while supporting wildcards too. Assume that you have a Match Action Table that completely/perfectly fills up one TCAM block. My understanding (please confirm) is that in this case, all the TCAM’s rows will be matched against the search key in parallel with the key width limit being the TCAM block width.

With this, I wanted to ask - is TCAM fragmentation a prominent issue?
Take this scenario: We have two large TCAM blocks available each of height y and width x bits. We also have 2 completely independent Match-Action Tables each containing 3 entries. Assuming the TCAM dimensions are large say x=50, y=200, will the 2 Match-Action Tables be mapped to separate TCAM blocks or are they able to coexist on the same TCAM block. In other words, will the TCAM usage be fragmented across 2 TCAM blocks thus wasting space in each block or can they be condensed and stored together on just 1 TCAM block by occupying TCAM rows sequentially. I drew up this diagram to illustrate this situation:

The answer as in most things is “it depends”.

There is a very easy way to enable multiple logical ternary tables to share a single physical TCAM block – introduce a new hidden field that isn’t part of the P4 program that represents a “logical table id”, e.g. 4 bits wide if you want to enable 16 logical tables to share the same physical TCAM, and in every entry for logical table X, make those bits of the entry exact match on all 4 bits of value X.

Other entries from other logical tables can share that same physical TCAM, by having their entries exact match in those 4 bit positions with a different value Y != X.

That makes sense! From your experience, does this usually manifest itself as an invisible feature that P4 programmers don’t need to perform manually (maybe the P4 compiler or environment automatically adds this hidden field behind the scenes) to condense multiple logical table or does this usually require some manual intervention by the P4 programmer where he/she modifies the P4 table definition by adding another logical table id field into the code itself.

Also, to clarify, this may be a more fundamental TCAM question (my understanding of TCAM is very shallow so apologies), can a subset of a single physical TCAM block be selectively queried or are TCAM blocks queried as an entire unit? So assuming the lack of a hidden field or logical table id to distinguish logical tables from one another in which case multiple logical tables are just forcibly stored together on the same physical TCAM block, is this even a possibility? Can multiple logical ternary tables share a physical TCAM block w/o the presence of a logical table id? (referring to the picture from earlier, assuming TCAM block 1 doesn’t have logical ids, can we query just between the first 3 entries?) Thank you!

I have seen many commercially produced systems with TCAMs that use this logical table id field as a hidden feature, so you need not even do anything to make it happen. It is all done for you “under the hood”.

For power saving reasons, many TCAMs have a way to control a subset of “blocks” that are powered on, on a search-by-search basis, but these blocks are usually on the order of 512 or 1K entries, or so, in large TCAMs, so the logical table id technique is still useful for enabling multiple logical tables to share one of those blocks.

Your question: So assuming the lack of a hidden field or logical table id to distinguish logical tables from one another in which case multiple logical tables are just forcibly stored together on the same physical TCAM block, is this even a possibility?

My answer: Of course a system doesn’t have to use this logical table id technique. There might be other ways for the entries of different logical tables to share a single physical TCAM block, but the logical table id is so easy to understand, flexible, and inexpensive, I have a hard time imagining another technique that is better than it.

One last question - is this issue of mapping logical tables to physical units also relevant for SRAM pages? or is this mainly a TCAM block issue? For example, if we have multiple logical tables consisting of only exact matches using SRAM, does this remain an issue? My intuition is this doesn’t matter for SRAM because unlike TCAM which is content addressable, SRAM is queried by address so the content can be stored wherever irrespective of SRAM page boundaries and just be queried via the location address. Is this logical?

For a hash table that you wanted to use only a subset of an SRAM block, shared with other logical tables, it would seem to me straightforward to have a start address and size independently configured for each logical table, and configure those such that they used no SRAM entries in common.

You could allow those entry ranges to overlap if you wanted to for some reason, but then you would need something like the logical table id field inside of the entry to enable the data plane to distinguish whether it hit a matching entry for the correct logical table id, or not.

Note: There is usually more than one way to solve these kinds of problems. Do not take my answers as “this is the only way to do it”.

I know for a fact, that the technique that @andyfingerhut described for TCAMs is used for SRAM-based exact match tables as well.

Ultimately, there are a lot of factors (within the overall device architecture) that will dictate the best solution for a given case.

As of today (June 2023), one really needs to understand their specific target in order to be able to write highly optimized non-trivial programs for it. This is, by the way, no different than how one would write in, C or FORTRAN just 40 years ago. 20-40 years from now things will probably change :slight_smile:

Happy hacking,
Vladimir

read en.wiki / tcam then find the white papers about cooltcams… thats what these chipmakers do these days… not to forget to google bgp rib compression, a kinda hack against the limitations from the control plane:

1.1.1.1/32 —> eth1
2.2.2.2/32 —> eth1


9.9.9.9/32 → eth2

is 2 tcam entries if u do all these properly…

the idea on live bgp in the default-free-zone from a real router’s point of view: https://twitter.com/rare_freerouter/status/1610952335756238848/photo/1

and since then the code just evolved a littlebit

hmmm sram based tcam is also a thing, try/do/thinkabout the following:

0.0.0.0/1 ----> sram block 1 (optionally → tcam block 1 in stage 2)
128.0.0.0/1 ----> sram block 2 (optionally → tcam block 2 in stage 2)
224.0.0.0/16 —> is the atomic bomb in bmv2 experiment on this forum…
224.0.0.0/4 ----> is the atomic bomb in bmv2 experiment on this forum…
240.0.0.0/4 —> even more atomic bombic in bmv2 stuffs go here…

not this, but the logic behind that stuff is this, basically…

recall im under the nda of 4+ big router vendors && the intel/barefoot as jey another p4 chipmaker…

@ThePuriProdigy @vgurevich ^^^^^^^^^^^^^^

Regarding @andyfingerhut 's technique about TCAM logical table id, I fail to see why that is necessary in the context of SRAM exact match tables. If you have multiple logical exact match tables and you are trying to map them to the same physical unit, since SRAM is addressed via addresses instead of the content, can’t the multiple logical tables just be placed onto the shared physical unit by displacing at different addresses? No need for adding another field it seems - I could be way off base here.

As I tried to say, it is definitely possible to configure different exact match tables in SRAM, implemented as hash tables, in completely disjoint ranges of SRAM entries, such that the logical id technique is UNNECESSARY, because any SRAM entry can contain an entry from at most one logical table.

It is also possible (but not necessary) to allow different such tables to overlap in the ranges of SRAM entries that they are allowed to use. IF someone implemented a system like that, THEN they would need some way to distinguish an exact match entry as to which logical table it belonged to.

Why might someone ever want to design a system where a single SRAM word might contain entries from multiple different logical tables? One potential benefit of such an approach is that it allows entries from a small hash table to be spread over a larger number of physical locations, which can improve hash table utilization. For hash tables, bigger ones tend to have higher average utilization that can be achieved because of a “law of large numbers” effect – if you have 1 million buckets, the variances of how many entries you can use is much smaller than if you have 100 buckets, where the variances are HUGE (at least as measured by a percentage of the raw number of entries).