Why bmv2 switch queue buildup is not behaving as expected?

Dear Community,

I am doing experiment on Fabric-Platform using bmv2 switches. I am calculating min, max and average enq_qdepth on every packet and patching the values on each 0.001 second onto the header and receving it on receiver side. The code is shown below :

control MyEgress(inout headers hdr,
                 inout metadata meta,
                 inout standard_metadata_t standard_metadata) {
        register<bit<32>>(512) input_port_pkt_count;
        apply {

        //Find sum
        bit<32> jemi;
        input_port_pkt_count.read(jemi, (bit<32>) 0);
        jemi = jemi + (bit<32>) standard_metadata.enq_qdepth;
        input_port_pkt_count.write((bit<32>) 0, jemi);

        bit<32> counter1;
        input_port_pkt_count.read(counter1, (bit<32>) 1);
        counter1 = counter1 + 1;
        input_port_pkt_count.write((bit<32>) 1, counter1);

        //Find minimum
        bit<32> jmin;
        input_port_pkt_count.read(jmin, (bit<32>) 2);
        if (((bit<32>)standard_metadata.enq_qdepth < (bit<32>)jmin))     {
        jmin = (bit<32>)standard_metadata.enq_qdepth;
        input_port_pkt_count.write((bit<32>) 2, jmin);
        //input_port_pkt_count.write((bit<32>) 2, jmin);

        //Find maximum
        bit<32> jmax;
        input_port_pkt_count.read(jmax, (bit<32>) 3);
        if (((bit<32>)standard_metadata.enq_qdepth > (bit<32>)jmax))     {
        jmax = (bit<32>)standard_metadata.enq_qdepth;
        input_port_pkt_count.write((bit<32>) 3, jmax);

 if ((hdr.ipv4.protocol != 0x01) && (hdr.ipv4.protocol != 0x6) && (hdr.ipv4.protocol!= 0x11))

           input_port_pkt_count.read(jemi, (bit<32>) 0);
           hdr.my_meta.enq_timestamp = jemi;
           jemi = 0;
           input_port_pkt_count.write((bit<32>) 0, jemi);

           input_port_pkt_count.read(counter1, (bit<32>) 1);
           hdr.my_meta.deq_timedelta = (bit<32>) counter1;
           counter1= 0;
           input_port_pkt_count.write((bit<32>) 1, counter1);

           input_port_pkt_count.read(jmin, (bit<32>) 2);
           hdr.my_meta.enq_qdepth = (bit<32>) jmin;
           jmin = 100;
           input_port_pkt_count.write((bit<32>) 2, jmin);

           input_port_pkt_count.read(jmax, (bit<32>) 3);
           hdr.my_meta.deq_qdepth = (bit<32>) jmax;
           jmax = 0;
           input_port_pkt_count.write((bit<32>) 3, jmax);

           //bit<32> max_enq;
           //input_port_pkt_count.read(max_enq, (bit<32>) 2);
           //if (((bit<32>)standard_metadata.enq_qdepth > (bit<32>)max_enq))     {
           //max_enq = (bit<32>)standard_metadata.enq_qdepth ;}
           //input_port_pkt_count.write((bit<32>) 2, max_enq);

           //hdr.my_meta.deq_qdepth = (bit<32>) max_enq;

The experiment result is as follows:


I am making congestion using iperf3 :

iperf3 -c -u -b -1 -t 20

After pausing 10 seconds I am running the same command again.
On the graph, the part until 40 second, bmv2 switch is behaving as expected but on the 2nd congestion the max enq_qdepth is always below 20 while it is expected to be around 60s. What may be the issue?

I checked the deq_timedelta values with only average calculation and with min,max and average calculation. While calculating only average the max deq_timedelta value was 60, but while calculating min,max and average all together I was experiencing max deq_timedelta value of 1400 which is 23 times more than the first calculation. May the calculation time be effecting the queue buildup?
Below I am attaching deq_timedelta graph as well if it will help to resolve the issue?


The p4 program on github : https://github.com/nagmat1/Routing_enq_deq_depth/blob/main/switch/min_max.p4

I am limiting the bandwidth on the switch by :

sudo tc qdisc add dev enp7s0 root handle 1:0 netem delay 1ms
sudo tc qdisc add dev enp7s0 parent 1:1 handle 10: tbf rate 1gbit buffer 160000 limit 300000

UDP packets are send by more than 10Gbits/s and on the receiver side it is accepted by 995Mbits/s
Kind regards,

Several years ago I did some experiments (unpublished) with BMv2 and some hosts, and trying to build up congestion on certain simulated links.

I did not publish my steps to reproduce this, but I recall finding situations where for a virtual Ethernet interface between two entities (e.g. two instances of BMv2, or an instance of BMv2 and a simulated host), if you rate limit it, you might wish that it behaves like this:

(a) A physical constant-bit rate link , e.g. 10 Gbit/sec cable, accepts one packet only when any previous packet sent to it is finished being transmitted. There is no “queue” of packets built up between BMv2 and the cable.

but if I am remembering my earlier results correctly, it behaves more like (b)

(b) a virtual Ethernet interface that has been configured to a low rate limit, e.g. 1 Mbit/sec, when you try to send it packets faster than its configured rate, it accepts packets faster than that, but puts them into a software FIFO of packets waiting to be transmitted over the veth link. I believe it does this up to some maximum software buffer size determined by some layer of software in the kernel (I am not sure which software in the kernel determines this limit), and then does not accept new packets until the queue drops below that upper limit.

I believe I even saw cases of (b) where after the software FIFO for the veth filled up, no new packets were sent from BMv2 to the veth until the FIFO queue was drained completely, or close to being empty.

Effectively the links had a long term average rate of the configured rate, e.g. 1 Mbit/sec, but it would alternate over hundreds of milliseconds between accepting packets much faster than 1 Mbit/sec, and much slower than 1 Mbit/sec.

I believe that BMv2 might also have backpressure from its “traffic manager” queues back to ingress processing, such that ingress processing is paused if the traffic manager queues reach their threshold. Again, it has been a couple of years since my experiments, so I may be misremembering some of the details.

In general, I found that BMv2 plus rate-limited veth links behaves much different performance-wise from a network of physical switches connected by physical constant-bit rate Etthernet cables.

1 Like