Reading global register value in the ingress

Hello,

I am using the bmv2 software switch. I am trying to access the queue length for a specific port number in the ingress so I can reroute traffic in case that port is congested. I realized I can only access the enq_qdepth value in egress. Therefore, I have defined a global register to share this value between the egress and ingress control blocks where egress writes the enq_qdepth for each port in the corresponding index of the global register and the ingress reads from the register.

Does this approach make sense? Can I read a value related to standard metadata and write it to a global register and then try to read that value from another control block?

Here is the problem:
According to the log file of the switch, the value written to the register is nonzero because a queue is built up (I have also verified this using the simple_switch_CLI program):

[16:28:54.649] [bmv2] [T] [thread 12964] [602.0] [cxt 0] Wrote register 'approx_queue_depth_pkts' at index 1 with value 16
.
.
.
[16:28:54.649] [bmv2] [T] [thread 12964] [603.0] [cxt 0] Wrote register 'approx_queue_depth_pkts' at index 1 with value 15

However, the value read from the register in the ingress is always 0:

[16:28:50.748] [bmv2] [T] [thread 12961] [602.0] [cxt 0] Read register 'approx_queue_depth_pkts' at index 1 read value 0
.
.
.
[16:28:50.760] [bmv2] [T] [thread 12961] [603.0] [cxt 0] Read register 'approx_queue_depth_pkts' at index 1 read value 0

I don’t know why this is happening. I have also taken a look at https://github.com/jafingerhut/p4-guide/blob/master/demo6/demo-global-register.p4_16.p4 for reference. When I simulate this code, it works. But the difference with what I am trying to do is that the value written to the register is not derived from standard metadata in this case.

Here is a snippet of my code that I think is most relevant.

Ingress

apply {
        phy_forward.apply();
        log_msg("Stat in ingress: egress_spec={},  enq_qdepth={}, deq_qdepth={}",
            {stdmeta.egress_spec, stdmeta.enq_qdepth, stdmeta.deq_qdepth});
        approx_queue_depth_pkts.read(tmp_count, (bit<32>) 1);
        log_msg("Register read in ingress: {}",
            {tmp_count});
        if(tmp_count > 7) {
            stdmeta.egress_spec = (bit<9>) 3;
        }
    }

egress

        log_msg("Stat in egress: enq_qdepth={}, deq_qdepth={}",
            {stdmeta.enq_qdepth, stdmeta.deq_qdepth});
        bit<32> value = (bit<32>)stdmeta.deq_qdepth;
        approx_queue_depth_pkts.write((bit<32>) 1, value);

Here is my complete code:

#include <core.p4>
#include <v1model.p4>

typedef bit<9> egressSpec_t;

header ethernet_t {
    bit<48> dstAddr;
    bit<48> srcAddr;
    bit<16> etherType;
}

struct metadata_t {
}

struct headers_t {
    ethernet_t ethernet;
}

typedef bit<32> PktCount_t;
register<PktCount_t>(3) approx_queue_depth_pkts;

parser parserImpl(packet_in packet,
                  out headers_t hdr,
                  inout metadata_t meta,
                  inout standard_metadata_t stdmeta)
{
    state start {
        packet.extract(hdr.ethernet);
        transition accept;
    }
}

control ingressImpl(inout headers_t hdr,
                    inout metadata_t meta,
                    inout standard_metadata_t stdmeta)
{
    PktCount_t tmp_count;

    action drop() {
        mark_to_drop(stdmeta);
    }

    action forward(egressSpec_t port) {
        stdmeta.egress_spec = port;
    }

    table phy_forward {
        key = {
            stdmeta.ingress_port: exact;
        }
        actions = {
            forward;
            drop;
        }
        size = 1024;
        default_action = drop();
    }

    apply {
        phy_forward.apply();
        log_msg("Stat in ingress: egress_spec={},  enq_qdepth={}, deq_qdepth={}",
            {stdmeta.egress_spec, stdmeta.enq_qdepth, stdmeta.deq_qdepth});
        approx_queue_depth_pkts.read(tmp_count, (bit<32>) 1);   
        log_msg("Register read in ingress: {}",
            {tmp_count});
        if(tmp_count > 7) {
            stdmeta.egress_spec = (bit<9>) 3;
        }     
    }
}

control egressImpl(inout headers_t hdr,
                   inout metadata_t meta,
                   inout standard_metadata_t stdmeta)
{
    apply {        
        log_msg("Stat in egress: enq_qdepth={}, deq_qdepth={}",
            {stdmeta.enq_qdepth, stdmeta.deq_qdepth});
		bit<32> value = (bit<32>)stdmeta.deq_qdepth;
        approx_queue_depth_pkts.write((bit<32>) 1, value);
    }
}

control deparserImpl(packet_out packet,
                     in headers_t hdr)
{
    apply {
        packet.emit(hdr.ethernet);
    }
}

control verifyChecksum(inout headers_t hdr, inout metadata_t meta) {
    apply { }
}

control updateChecksum(inout headers_t hdr, inout metadata_t meta) {
    apply { }
}

V1Switch(parserImpl(),
         verifyChecksum(),
         ingressImpl(),
         egressImpl(),
         updateChecksum(),
         deparserImpl()) main;

Welcome @amjal,

You should be able to keep values statefully and let subsequent packets’ control blocks read. I mean if you write comething in the egress control, then I understand that this is only useful for the following packet and its ingress egress control ‘read’, according to your use case (see next answer, consider where you declare the register, which might matter).

Maybe my comment is not helpful but when I checked your log output, the ‘reads’ are happening before the ‘writes’. Maybe this is not what you wanted to express but please make sure that you did not check wrong log outputs.

writes

vs. reads

Let us know if this is not the actual problem.

Cheers,

1 Like

I do not recall off hand whether a P4 register can be made accessible from both ingress and egress in BMv2.

Even if that is not possible, you can also consider periodically sending packets that read enq_qdepth in egress, recirculate the packet with user-defined metadata that contains a copy of the enq_qdepth value that packet saw, and then back in ingress, write that value to a P4 register in ingress.

1 Like

Thank you for your reply.

I have defined the register globally. The data I am trying to store is the size of the queue, so it should be packet-independent? I am not sure if I got your point correctly.

Regarding your second point, when I search the whole log file, I don’t get any read for that register which returns a value other than 0. But there are writes to the same register with values other than 0. Also, near the end of the test, I use simple_switch_CLI to get the value of the register while the switch is running and it returns the nonzero value. So I think the problem is with the ‘read.’

Also, while we are talking about the log file, I have realized due to the multi-threaded nature of the bmv2 switch, the time stamps for different control blocks run by different threads do not have “inter-thread” synchronization. Therefore, if I want a chronological order among threads I need to look at the value of the fifth column which is showing the number 602.0 in this case:

Is my understanding correct? And what is this number?

Thank you for your answer,

I guess I could use recirculation or other workarounds, although I suspect they introduce some overhead. Ideally I would like to be able to have a single register read/write at each control block for congestion detection.

It seems that sharing a register between ingress and egress is in general possible as is shown in this case which I assume is a code belonging to one of your Github repositories and mentioned in the my question:

In this code, for each packet in the ingress, the value of the register is incremented and for each packet in the egress, the value of the register is decremented. The register is the same global register for both ingress and egress.

However, any data from standard metadata written to the global register in the egress, won’t be reflected in the ingress read. I mean this part of my code:

Whereas, if I add an increment or decrement operation right after this operation, it will take effect in the ingress. This is where I would like to understand how I am mistaken and hopefully resolve my issue.

If you have a complete P4 program that demonstrates the problem you are experiencing, and are willing to publish a link to it, that enable someone else to see the details and look for problems in it. I cannot tell what might be going wrong from the one sample line of code you have shown above.

I think I found it. Mentioning here for the record. whenever I use a write from standard metadata to a global register, I can find 4 - 5 second intervals in the log file where the SimpleSwitch::egress_thread seems to be inactive. If the duration of the test is long, there are multiple such intervals. For instance:

[19:08:36.406] [bmv2] [T] [thread 3970] [552.0] [cxt 0] Wrote register 'approx_queue_depth_pkts' at index 1 with value 0
.
.
.
[19:08:40.745] [bmv2] [T] [thread 3970] [553.0] [cxt 0] Wrote register 'approx_queue_depth_pkts' at index 1 with value 63

These are two consecutive egress activities and they are 4 seconds apart. During this interval, the SimpleSwitch::ingress_thread has been active and feeding packets to queues. Also worth mentioning, while the egress thread is processing packet number 552, the ingress thread is processing packet number 660.

Anyways, You can see that after four seconds the egress thread has ‘awaken’ to a full queue and just flushes out the packets until the queue is empty and there is no ingress activity meanwhile. This is the last write during this time:

[19:08:40.754] [bmv2] [T] [thread 3970] [616.0] [cxt 0] Wrote register 'approx_queue_depth_pkts' at index 1 with value 0

After transmitting all those packets out, there is another 3 second gap after which the ingress thread wakes up when the queue is empty. Therefore it is always reading value 0:

[19:08:43.806] [bmv2] [T] [thread 3967] [661.0] [cxt 0] Read register 'approx_queue_depth_pkts' at index 1 read value 0

I don’t understand this concurrency behavior. The system I am running this on has 4 cores. But after setting the queue rate to a very low number to a point that essentially the queue is always full, the ingress thread does read non-zero values. Thanks to @ederollora for highlighting log timestamps. That made me go take a closer look at the log files.

I do not know the reason for the precise behavior you are seeing, but everyone should realize that BMv2 was NOT developed to simulate a switch ASIC in its performance properties. If you are doing things that might be very timing sensitive, you should be using some kind of system that correctly emulates the performance properties of the system you want to run on.

Just an off topic question, is it possible to log the register value to a separate file from a p4 switch instead of its own log file?

There is nothing built into the BMv2 / simple_switch / simple_switch_grpc process that enables this now that I know of, but it is a pile of C++ code that can be changed and recompiled, of course :slight_smile:

There is the capability to use nanomsg IPC messages for some purposes, but I have not used that capability myself, so I do not know what kinds of information is transmitted out from BMv2 through that channel.