Unexpected Queue Depth Spikes in BMv2 (v1model) During Congestion Control

Hi!
When I write a P4 program on BMv2 using the v1model, I need to monitor queue length information for my congestion control algorithm. I obtain this information by reading standard_metadata.deq_qdepth at the egress when each packet dequeues, to achieve a certain level of real-time queue monitoring.

My congestion control process works as follows: I implemented a real-time token bucket in the ingress (I didn’t use the built-in meter because I need to dynamically update the rate in the data plane, which meter doesn’t allow). When the queue length increases, I reduce the rate using an algorithm similar to QCN. Otherwise, I linearly increase the rate over time (currently 400bps every 50ms over a 1Mbps link).

However, when I examined the monitoring data, I noticed sudden increases in queue length. For example, in the following logs (first column: timestamp, second column: standard_metadata.deq_qdepth printed via log_msg):
23:44:09.487,0
23:44:09.488,0
23:44:09.496,0
23:44:09.497,0
23:44:09.505,0
23:44:09.506,0
23:44:09.514,0
23:44:09.645,27
23:44:09.646,26
23:44:09.647,26
23:44:09.647,25
23:44:09.648,24
23:44:09.649,23
23:44:09.650,22
23:44:09.651,21
23:44:09.651,20
23:44:09.652,19
23:44:09.652,18
23:44:09.653,17
23:44:09.653,16
23:44:09.654,15
23:44:09.655,14
23:44:09.655,13
23:44:09.656,12
23:44:09.656,11
23:44:09.657,10
23:44:09.658,9
23:44:09.658,8
23:44:09.659,7
23:44:09.659,6
23:44:09.659,5
23:44:09.659,4
23:44:09.660,3
23:44:09.661,2
23:44:09.661,1
23:44:09.662,0
23:44:09.665,0
23:44:09.675,0
23:44:09.685,0
23:44:09.689,0
23:44:09.706,0
23:44:09.707,0
23:44:09.718,0

Between 23:44:09.514 and 23:44:09.645, there is a time gap of over 100ms, and the queue length suddenly jumps to 27. Prior to that, packets were dequeuing roughly every 8ms on average. This sudden jump seems odd.

Since my rate control implementation at the sending edge is based on a pacing time interval derived from rate (not window-based), it shouldn’t cause a burst of packets all at once.

My questions are:
What could be the reason for this behavior? Is it possible that BMv2 processes enqueue/dequeue in batches, and therefore the deq_qdepth reading is not guaranteed to be accurate for every individual packet?

If so, is this batching behavior a specific characteristic of BMv2, or should I assume that commercial switches also exhibit similar batch processing, and thus I should design my congestion control algorithm based on this assumption?

This is a behavior specific to BMv2, and perhaps some other “software switches”, i.e. switches implemented entirely in software running on a general purpose CPU.

It would be very unusual for a commercial hardware-based switch to exhibit behavior like this, in normal circumstances.

One example of a circumstance where even a hardware-based Ethernet switch might exhibit behavior like this is if you enabled a feature like PFC (Priority Flow Control), and the downstream device sent an XOFF message to the switch, because the downstream device was temporarily congested. This would cause the switch receiving the XOFF to stop transmitting packets on that output port for some time, and any packets destined for that output port would build up in one or more queues leading to that port, until the downstream device sent an XON message to the switch.

Thank you very much for your response.
I have a couple of follow-up questions that I hope you could share your thoughts on:

  1. Why is this batching behavior more common in software switches like BMv2, but considered rare in hardware switches?
    Is it mainly because software switches do not process packets strictly one-by-one, but instead handle them in batches (e.g., due to scheduling or buffer flushing mechanisms in the software runtime)?

  2. Given this behavior in BMv2, does it make sense to rely on standard_metadata.deq_qdepth for real-time queue monitoring?
    In my case, this batching effect makes the queue depth appear to suddenly jump, which misleads my congestion control logic.

    Do you have any suggestions for monitoring queue status more accurately under such circumstances? For example, do you think it’s reasonable to:

    • Sample queue depth based on packet intervals to reduce the impact of batching?

    • Use a smoothed or moving average of the queue depth over time?

Your insights and experience would be extremely helpful to me. I’m very much looking forward to your reply and any suggestions you might have!

Thanks again!

  1. Although I have measured and noticed this behavior, I have not taken the time to determine the root cause of why it occurs. I do not think that just because it is a software switch, that it must have this behavior, but I also do not know what kinds or magnitude of code changes would be required to create a version of BMv2 that is significantly better for real-time performance behavior that more closely matches real switches. There are plenty of network performance simulation environments, e.g. NS3, that are focused on performance simulations, but I do not know if anyone has integrated P4-programmable switches into NS3. This paper [1] I just found a moment ago from a Google search might be promising (note: I have never used any of the code mentioned in that paper myself, since I only learned of it today).
  2. In BMv2’s current form, I do not have any advice for trying to run performance-sensitive features on it, other than “be aware of the differences, and likely that means it will not be useful for this purpose, except perhaps for debugging correctness of your P4 code”.

[1] https://dl.acm.org/doi/10.1145/3747204.3747210