Data Consistency Issues When Using Stateful Data in P4 Programs

Hi, I would like to ask a question about potential data inconsistency caused by parallel packet processing. I searched the forum briefly but couldn’t find a related topic.

While writing a P4 program on BMv2 (v1model), I encountered a conceptual issue. In the data plane, it seems natural to process multiple packets in parallel. However, when using stateful elements (like registers) that are updated based on packet-triggered operations, I wonder if this can lead to concurrent access problems.

For example, suppose I want to count the number of packets belonging to a particular flow. For each incoming packet, I read a register, increment its value by 1, and write it back. If two packets arrive at nearly the same time, they might both read the old value before either write occurs, leading to an incorrect count.

My question is:

  1. When writing P4 programs based on the P4 language specification, does this kind of issue depend on the underlying architecture?
  2. Do I need to explicitly consider concurrency issues like this when designing such stateful logic?

Thanks in advance!

On the Tofino implementation, packets enter the ingress pipeline in some order, and remain in that same relative order throughout the ingress pipeline until they reach the traffic manager. Similarly for the egress pipeline, although there the order that packets enter the egress pipeline is determined by the traffic manager’s packet scheduling logic, not by the order that packets arrived to the device from outside. In that implementation of the P4 register array extern, the relative order that the register is read, modified, and written is exactly the same as the order that packets entered the ingress or egress pipeline in which the register instance exists.

In other implementations, they might follow a similar design as Tofino, or they might choose something different. For example, in an FPGA or programmable NIC where there are large tables located in DRAM, because there is often an on-chip cache for such DRAM tables, each packet that causes a lookup in that DRAM table that gets a cache hit can finish the lookup with relatively low latency, but packets that get a cache miss will experience a much higher latency before the lookup result is available. An implementation might very well allow packets that experience a cache hit to “jump ahead” of packets that got a cache miss, improving average latency and throughput for the device as a whole. In such a device, the order that packets access stateful elements before that DRAM table lookup could be different than the order that they access stateful elements after that DRAM table, in the same pipeline.

Nothing in the P4 language specification prohibits such a design, and I believe that is a good thing. It does mean that if you care about the relative order of access to stateful elements, you should ask detailed questions of those who support the device you are programming to ask them what its behavior is, and if they provide any extra features, e.g. perhaps P4 annotations, that enable you as a developer to control or restrict those behaviors.

I should quickly add that it would be a very bad idea if an implementer chose to make a device where the following was possible:

(1) packet #1 reads element X from a P4 register array

(2) packet #2 reads element X from a P4 register array

(3) packet #2 writes a modified value back to element X

(4) packet #1 writes a modified value back to element X, effectively undoing the update of packet #2 completely.

Any good device vendor knows that this behavior would be practically useless for users of their device, and designs their device to avoid it. If they don’t, user and the implementer would likely both consider that a bug in the implementation.

Dear @kmftangchaoyang ,

To add to @andyfingerhut ‘s excellent answer, let me point out that this topic is discussed in the section 18.4.1. Concurrency model of the spec, which defines a special @atomic annotation that can be used to guarantee atomic execution of bigger blocks of code.

It also states that extern methods are supposed to be executed atomically.

On Tofino, this annotation has no effect, because everything is atomic (even if you read, modify and write back a register value that will be translated into a single RegisterAction() extern with an apply() method).

I’m not sure how well BMv2 respects this annotation.

Other hardware platforms might employ different strategies, but as Andy pointed out, they typically provide specialized atomic operations that are usually expressed as extern methods.

Happy hacking,

Vladimir

Thank you very much @andyfingerhut and @p4prof for your detailed answers!
It’s reassuring to know that well-designed hardware platforms avoid such concurrency issues — that really helps reduce the burden on P4 programmers like me.
Really appreciate your insights — this helped a lot!