Question about the SRAM lookup process in the two-stage ALPM matching

Hello,

I am studying and trying to reproduce the ALPM (Algorithmic Longest Prefix Match) example in P4 as a baseline algorithm for my research.

While reading the ALPM patent, I encountered a question about how the SRAM lookup stage (the second stage) actually works in hardware.

Background

According to the patent, the ALPM lookup consists of two stages:

  1. TCAM stage (or ATCAM layer)

    • Used to determine which SRAM partition should be accessed based on the high bits of the prefix.
  2. SRAM stage (ALPM layer)

    • Performs the actual longest prefix match within the selected partition.

My question is about the second stage:

How is the lookup performed inside an SRAM partition within a single pipeline stage?

Reasoning

It should not be a linear or sequential search, because:

  • If the partition contained hundreds of entries and the lookup were sequential, it would require hundreds of pipeline stages to compare them all, which is unrealistic.

  • Even in the example shown in the patent, if the process were sequential, the partition would need to be replicated multiple times (e.g., eight times for eight prefixes), which clearly doesn’t match how hardware is designed.

However, in the P4 source code provided by Intel, I only see the following definitions:
Alpm(number_partitions = 1024, subtrees_per_partition = 2) alpm;
Alpm(number_partitions = 1024, subtrees_per_partition = 2,
atcam_subset_width = 18, shift_granularity = 1) alpm_ipv4_subset_key;
Alpm(number_partitions = 1024, subtrees_per_partition = 2,
atcam_subset_width = 32, shift_granularity = 2) alpm_ipv6_subset_key;

My main questions
1.How is the lookup inside each SRAM partition actually implemented?
2.In hardware, how is the SRAM lookup completed within one pipeline stage?

I would really appreciate any clarification or insight about how ALPM performs the SRAM-stage lookup internally.
Any explanation or reference would be extremely helpful.

Dear @1418915702 ,

Patents are notoriously difficult to decipher, but take a look at page 39, column 18 lines 24-45 of the patent.

We also used to discuss this in much more clear way in the class ICA-1142 of Intel Connectivity Academy. The archives are available (but you need to have the valid NDA with Intel still).

You might be able to get some clues from the compiler source code, e.g. here, although it does not directly answer your question WRT to how the lookup of several entries is done in a single stage.

Here is a basic idea. Think of a regular exact match table, stored in SRAM. It can have N=1..8 ways and inside each way M=1..8 entries can be packed. So, a given entry can be located in NxM places inside the table (or not be there at all). So, inside a single stage the HW should be able to retrieve NxM potential candidates and perform NxM equality comparisons. However, these can be easily done in parallel. After that, it will have a bitmap of successful matches. It is also possible to quickly find the right-most (or left-most) bit there and then use the corresponding entry or notice that no bits are set (that’s a miss).

To make this hardware work as a small TCAM one needs to make very few simple adjustments:

  1. Store both the key value and the key mask in the entry
  2. Modify the comparators, so that they perform masked compare instead of exact match (equality)

Happy hacking,
Vladimir