CPU idle during stess test

I installed the switch as follows

./autogen.sh && ./configure --disable-logging-macros --disable-elogger 'CFLAGS=-O3' 'CXXFLAGS=-O3'

Then I ran the benchmark script stress_test_ipv4.py, here is what I got

Running iperf measurement 1 of 5
924 Mbps
Running iperf measurement 2 of 5
809 Mbps
Running iperf measurement 3 of 5
779 Mbps
Running iperf measurement 4 of 5
907 Mbps
Running iperf measurement 5 of 5
1158 Mbps
Median throughput is 809 Mbps

Although most people achieve 1Gbps throughput in this test, the throughput is still too low as I am using an 80 cores CPU (Intel Xeon Gold 6230). I monitored the CPU utilization and found only a few CPU went up to 70% while the rest of the cores were completely idle.

Same thing happened as I launched the simple_switch_grpc and connected it to real physical interfaces to perform stress test.

Is it normal to have such CPU utilization? Can we improve the performance?

1 Like

To have high CPU utilization on an 80-core system, or any multi-core system, means you must run software that takes advantage of parallel processes and/or threads in a productive way.

I took a quick look at stress_test_ipv4.py, and I may have missed something, but from that quick look it appears to be using iperf to send traffic as fast as possible from one simulated host to one other simulated host through a single instance of the simple_switch process.

simple_switch (and its closely related process simple_switch_grpc) already can take advantage of multiple threads, and thus CPU cores, by running separate threads for different input ports for ingress processing, and for different output ports for egress processing. For stress_test_ipv4.py, the maximum parallelism attainable that way is 2 cores, one for ingress on the one input port where packets are arriving, and another for egress on the one output port where packets are leaving.

One could imagine trying to change simple_switch’s implementation so that it used multiple threads to process different packets from the same input port in ingress, or going to the same output port on egress, but that is not how it is currently written. Also while that might be straightforward to do for some P4 programs, if there are accesses to P4 meters and/or register arrays, there would need to be a lot of synchronization between those threads to share data between them. All of the ideas in this paragraph would be a very significant change in coding to the behavioral-model C++ code to implement.

Another idea would be to implement a form of pipeline parallelism, e.g. divide up ingress processing for packets from the same input port into stages, like Tofino and some other switch ASICs do physically, having one CPU core do the first 1/N of the table lookups, a second CPU core do the second 1/N of the table lookups, etc. That will have no speedup benefit if the ingress code only does a single table lookup, of course. It would also be a very significant change in coding to the behavioral-model C++ code to implement.

So yes there are ways, but they are a lot of work, and bmv2 is primarily a software switch for development and debug of P4 programs. If you want a higher performance software switch that runs programs written in P4, I do not know exactly where it is feature-wise and bug-wise as of 2022-Jan, but the DPDK back end for the p4c compiler may achieve better packets/sec per CPU core utilized some day. I do not know what its strategy is for dividing up packet processing work between CPU cores, but I would guess it is more likely to be BMv2’s current approach, not one of the more parallel ways I describe later above.