I am writing a P4 program on an Ubuntu virtual machine in VMware, with 8.5GB of allocated memory and an 8-core CPU. Since my program is quite large, I encountered the error “Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS” during compilation. Apart from increasing the virtual machine’s memory, are there any other ways to resolve this issue?
Have you run a program like top
in a separate terminal while trying to compile your P4 program, and verified that some P4 compiler process like p4c-bm2-ss
is using multiple GBytes of RESident memory before this error occurs? Or is it some other process using a lot of memory?
If it is the P4 compiler, you say your P4 program is “quite large”. Could you usefully, or at least as an experiment, create a copy of the program and delete or comment out a large chunk of it to see if it compiles with the amount of memory your system has now?
What is the output of p4c --version
on your system? I ask because during the past 6-12 months several P4 compiler developers have been working on changes to speed up the compiler and enable it to use less memory than older versions. Trying out the most recent version, e.g something built from source code from today or any time during the last month, might help reduce the memory required.
All of that said, if you are using the latest version of the P4 compiler, and you need everything in your large program, I have seen some evidence that the error message you are seeing might be due to the GC library used by the P4 compiler needing more than 8GBytes of RAM. There appear to be environment variables that can be used to increase this limit, like export GC_MAXIMUM_HEAP_SIZE=16G
that I found mentioned in an Internet search on this page: Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS when allocating many sets · Issue #5610 · crystal-lang/crystal · GitHub
But if. you need that option to increase the RAM that the P4 compiler’s GC can use, then you would also need to run it on a system with more RAM available.
Thank you for your reply. Below are my responses to your questions:
- I ran the
top
command and found thatp4c-bm2-ss
is consuming a large amount of memory. There are no other programs competing for resources withp4c
on the system. - After commenting out some parts of the code, my P4 program can be compiled and run successfully.
- The version of the
p4c
program I am using is indeed quite outdated, as I have not updated it since downloading it last September.
Thanks again for your response. I will try using the latest version of the p4c
program. If the issue persists, I will consider allocating more memory.
I have now updated p4c
to the latest version (v1.2.5.1). However, since my program uses a large number of macro substitutions and the codebase is quite large, even the latest version of p4c
cannot compile my program quickly. Therefore, I would like to know if there are any solutions to speed up the compilation of my program.
The command I use for compilation is as follows:
p4c --target bmv2 --arch v1model main.p4
Is your P4 program something you are able and willing to publish? If so, it might help others who have tried to improve the performance of p4c to profile the code while compiling your program, and see where most of the time is being spent, and/or look at the source code and suggest alternative ways of achieving similar behavior with lower compilation time (if such a way exists).
If you are willing to publish it, rather than copying and pasting it here, I would recommend putting your code on a public web site, e.g. a public repository on Github or a similar web site, and put a link to it in a comment here.
Although this is one of my research projects, I am willing to make my P4 code public. The code repository is available at the following link:
This project may appear somewhat immature, so I sincerely welcome your feedback and suggestions for improvement.
I was able to compile your program with some extra debug output showing which passes were running at the time, and I ran top
in a separate terminal window to see when the memory grew.
p4c-bm2-ss -vv --arch v1model main.p4
I could see that most of the processing time, and most of the memory use increase, occurred while executing two passes, both with SimplifyDefUse
in their names. There are some recent issues where p4c developers are profiling the memory use of p4c and have seen large memory use for some P4 programs in that pass, and your program certainly seems to be one of those: Elevated memory usage in def-use · Issue #4872 · p4lang/p4c · GitHub
I will add a comment on that p4c issue linking to this discussion, in case they may want to use your program as another example to consider.
Ok, thank you for your efforts and contributions. I am looking forward to the future improvements of p4c.
Do you have any solutions or suggestions to temporarily address my issue? Currently, I’m unable to make further progress on my project.
I’ve already explored alternative approaches. Initially, I suspected the core problem was that repeatedly using the macro definition (EXECUTE_ONE_INSTRUCTION
) caused the code size to grow excessively. Recently, I attempted a new strategy by introducing control(instruction_execute_3)
, which executes three instances of EXECUTE_ONE_INSTRUCTION
. I then called instruction_execute_3
repeatedly in the original control(MyIngress)
using apply
, aiming to handle and execute more instructions.
However, after examining the compiled JSON code, I noticed that the compiled code size increases with the number of instruction_execute_3
calls. This suggests that p4c
does not reuse instruction_execute_3
, resulting in behavior similar to the previous repeated use of macro definitions. Unfortunately, this approach hasn’t resolved the issue.
control instruction_execute_3(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
#include "instruction/address_mode.p4"
#include "instruction/operator.p4"
apply {
EXECUTE_ONE_INSTRUCTION;
EXECUTE_ONE_INSTRUCTION;
EXECUTE_ONE_INSTRUCTION;
}
}
control MyIngress(inout headers hdr,
inout metadata meta,
inout standard_metadata_t standard_metadata) {
apply {
if (hdr.instruction.isValid()) {
if (hdr.instruction.command_type == INSTRUCTION_COMMAND_TYPE.PUBLISH_PROGRAM) {
save_intructions(hdr.instruction.start_pc);
}
else if (hdr.instruction.command_type == INSTRUCTION_COMMAND_TYPE.EXECUTE_PROGRAM) {
instruction_execute_3.apply(hdr, meta, standard_metadata);
instruction_execute_3.apply(hdr, meta, standard_metadata);
instruction_execute_3.apply(hdr, meta, standard_metadata);
instruction_execute_3.apply(hdr, meta, standard_metadata);
instruction_execute_3.apply(hdr, meta, standard_metadata);
instruction_execute_3.apply(hdr, meta, standard_metadata);
}
}
}
}
I can understand why the modified program you describe did not improve the situation for you. It is because the P4 compiler expands all sub-controls at compile time, like “inline” can do in a C program. The resulting inline-expanded program is the same as your original, and so the compile time is no different.
One possibility you may want to consider, although you may find it a bit onerous:
If you can write a P4 program where you recirculate the packet after each execution of EXECUTE_ONE_INSTRUCTION, so that it is only invoked exactly one time in your ingress control, and the packet is recirculated as many times as needed in order to cause EXECUTE_ONE_INSTRUCTION to be executed the desired number of times, then the compile time for that control should be much smaller.
If you restrict this repeated execution to the ingress control, you could also use the resubmit feature instead of recirculation. You can read more about the details of how to resubmit or recirculate packets in the v1model architecture here: behavioral-model/docs/simple_switch.md at main · p4lang/behavioral-model · GitHub
Thank you very much for your response. Your suggestions have given me some inspiration. I have come up with a solution where a certain number of instructions (e.g., 10) are executed at a time, and additional instructions are processed through two or three resubmit
. While this is a trade-off, sacrificing performance for generality, I believe that other optimization solutions can be found in the future.