How to tune the bmv2 parameter so that it can get a larger cache and maximum number of processors when running p4 programs?

When I run the p4 program through simple_switch, when I use the iperf tool for udp testing, I find that only one process is working with the top command, and the cpu usage of a single process quickly reaches 100%, but the total cpu usage is only less than 5%, how can I adjust the p4 program or bmv2 configuration parameters to get more throughput and less latency for my p4 program?
I read in the forum about ways to get the best performance out of the bmv2 switch by recompiling it.

./configure 'CXXFLAGS=-g -O3' 'CFLAGS=-g -O3' --disable-logging-macros --disable-elogger

If this is the way to go, is there anything else to do after the make and make install commands are executed?

Hi @Duang

Here you can find more detail about it behavioral-model/performance.md at main · p4lang/behavioral-model · GitHub despite that I don’t think you can improve performance further because the purpose and architecture of the bmv2 is design to provide fully comptibility with P4 language and not in achieve high performance

I do not recall if it is the default behavior, or whether it requires command line options to enable, but BMv2 can use separate threads for ingress processing vs. egress processing, and also for ingress processing for packets arriving on different input ports, and for egress processing for packets being transmitted on different output ports.

That will not help total throughput if all of your packets are going from one input port to one output port, and most of the processing is ingress processing, though.