A few people have mentioned the idea of having a fixed-function component in a P4-programmable device that is similar to a Traffic Manager, in that it can store full packets and later send them back out, but different in what events trigger the storing and later reading back of the packets.
For the sake of a name, let us call this new fixed-function component a “random access packet buffer”, or RAPB for short.
Imagine writing a P4 control with some new intrinsic metadata output that said “when this P4 control is finished executing for this packet, if flag store_packet is true, then the RAPB will write this packet’s data somewhere into its memory that is currently free, i.e. not used by some other packet, and then it will send out the address where this packet was stored, or an error status indicating that there was no room available to store the packet, so it was discarded.”
The address and failed_to_store metadata values are sent as an “event” to another P4-programmable control that is logically after the RAPB, and that P4 code could do whatever you want with those values, but a typical example would be “if failed_to_store is false, record the address in some P4 register array with an appropriate index for my P4 program to read it back later”.
At any later time while processing a packet, you could output intrinsic metadata field read_packet (boolean) and a read address, and the RAPB on seeing the metadata read_packet=true would read the packet at the provided read address, and send it out to some P4-programmable control, which could then process it however you like in P4 code.
At yet another later time, you could output intrinsic metadata that could deallocate that packet. As an optimization for the common case, the RAPB could also support both reading and deallocating the packet at the same time, but having an option to read a packet but not yet deallocate would enable additional use cases, too.
I do not know of any P4-programmable device with a high-performance low-cost RAPB in it. There might be one or more that I have not heard of, though.
If I wanted to create such a P4-programmable device, I would look first to doing it in an FPGA, because then you could implement it in FPGA logic. Failing that, using a CPU port (or multiple ports) you could of course “implement a RAPB”, but then its price-performance-cost ratios would be whatever you can get with a general purpose CPU, which tends to be more $ and power-hungry for a given level of packet rate performance.