Sanity check: MMF-based controller for BMv2/Mininet with host-side policing

Hi all,

I’m prototyping a small P4Runtime controller that tries to enforce network-wide max–min fairness (progressive filling) over a 6-switch mesh in Mininet/BMv2. I’d really appreciate a sanity check: does this architecture make sense, and what would you improve?

What I’m running

  • Data plane: v1model BMv2 (simple_switch_grpc). P4 has two tables:

    1. MyIngress.ipv4_lpm: match hdr.ipv4.dstAddr (LPM) → action ipv4_forward(dstAddr, port)

    2. MyIngress.ipv4_src: match exact hdr.ipv4.srcAddr → direct counter + action count_packet() (no data-plane change; just to attach counters)

  • Control plane: Python + P4Runtime. I push static IPv4 routing to all switches and install the ipv4_src table only on edge switches (those that have directly attached hosts).

  • Topology: 6 switches (s1..s6), 8 hosts (h1..h8). Links are declared in a JSON topology file. I parse:

    • switch–switch capacities (mostly 10 Mb/s in my test)

    • host–switch egress “caps” (currently 10 Mb/s; used to clamp per-host rate)

    • port maps (so I can program the correct out ports)

  • Traffic model: JSON traffic_matrix with flows like:
    {
    “id”: “flow_1_to_8_heavy”,
    “src”: “h1”,
    “dst”: “10.0.6.8”,
    “protocol”: “UDP”,
    “dst_port”: 5201,
    “duration”: 60,
    “demand_mbps”: 15.0
    }
    I use these demands for the MMF solver; measurement is only to detect “active” senders.

Controller loop (every ~1 s)

  1. Read all direct counters for MyIngress.ipv4_src on every switch via P4Runtime Read().

  2. For each edge switch entry, derive bytes/second per src IP, with some jitter smoothing on dt.

  3. Mark a source as active if rate >= 0.05 Mb/s.

  4. Build paths (static single shortest path) for active flows using the topology graph.

  5. Run progressive filling (max–min fairness) over undirected edges using capacities from the topology and demands from the traffic matrix. This returns per-flow allocations in Mb/s.

  6. Aggregate allocations per source IP (sum over that host’s flows).

  7. Clamp per-source allocations by the host’s egress cap (host→switch link capacity).

  8. Enforce the effective per-host limit using Linux TBF in the host namespace:
    sudo mnexec -a tc qdisc replace dev eth0 root tbf rate mbit burst 1000kbit latency 50ms

Example results

  • With two “heavy” cross-mesh flows (h1→h8, h2→h7) each demanding 15 Mb/s and four additional 8–10 Mb/s flows on partly disjoint paths, the MMF solver typically yields something like:
    allocations: { flow_1_to_8_heavy: 4.0, flow_2_to_7_heavy: 4.0, flow_3_to_4_mid: 8.0, flow_5_to_6_mid: 10.0, flow_1_to_5_diag: 10.0, flow_3_to_6_diag: 10.0 }
    per-source sums: { h1: 14.0, h2: 4.0, h3: 18.0, h5: 10.0 }
    effective clamp by host egress cap (10 Mb/s) then pushes: h1→10, h2→4, h3→10, h5→10.

  • The controller logs show the MMF numbers and the applied TBF, and iperf3 server logs line up with the enforced caps.

  • CSV with counter deltas sometimes shows underreported instantaneous Mb/s because Read() can be sparse across iterations; I now clamp dt in the rate calculation to stabilize it.

What I’m unsure about and would love feedback on

  1. Counters and flow granularity. Counting exact srcAddr at the edge is simple but conflates multiple flows from one host. Would you recommend moving to a 5-tuple key for “activity detection” (src/dst/proto/ports)? On v1model BMv2, would you do this with a hash-based table + registers, or send digests to the controller?

  2. Where to police. For Mininet this host-side TBF is easy and predictable, but would you instead try to use data-plane meters (direct/indirect) or an egress policer in P4? Any accuracy/timing caveats with BMv2 meters for this use?

  3. Read path robustness. Is it expected that a P4Runtime table Read() over direct counters occasionally returns a sparse set of entries on BMv2, causing uneven dt per key? Is there a better pattern (e.g., reading by exact key per src, clearing counters, or using idle timeouts)?

  4. Directional capacity. I currently treat link capacity as undirected per edge. Would you model capacities per direction in the solver for a more realistic result?

  5. Routing and fairness. Paths are single static shortest paths; no ECMP. Any recommended teaching-grade way to combine MMF with ECMP on BMv2 without getting into full blown TE?

  6. Control loop tuning. I only push a new TBF if the change exceeds 5% to avoid churn. Loop period is 1 s. Any better heuristics you’ve used?

  7. Activation threshold. I mark a src as “active” at 0.05 Mb/s. Would you tie this to demand_mbps, or keep a fixed floor?

  8. Validation methods. Beyond iperf3 UDP, I plan to try TCP CUBIC/BBR and link failure tests (drop capacity on one s–s edge and watch allocations shift). Any other quick checks you’d recommend?

Repro details if anyone wants to try something similar

  • BMv2 simple_switch_grpc with gRPC ports 50051–50056, P4 compiled with p4c-bm2-ss.

  • Mininet topology in JSON; controller parses switch ports, link delays, and capacities.

  • iperf3 servers run in the destination namespaces; clients are started from the traffic_matrix.json.

  • Controller builds an IP→hostname map from topology to locate host PIDs for tc.

If this approach is off the rails, please say so. If it’s reasonable, I’d be grateful for pointers to better practices (e.g., per-flow activity tracking, where to place policers/meters in v1model, and more robust counter polling). Happy to share code snippets or a trimmed repo if that helps.

Thanks in advance!

I will limit my responses to those of your questions that I have some knowledge about, and hope that others may respond to questions that I do not.

First, realize that BMv2 is primarily intended for debugging the behavior of P4 programs. It is not primarily intended to give packet processing performance or queueing behavior that matches what a real physical switch would do. For example, people have often asked questions about why queues seem to build up and drain in odd ways in BMv2 tests where queues of packets leading to an output port of a switch build up. A project called P4sim has been published that integrates BMv2 into the NS3 network performance simulation system, and may be better at simulating performance-sensitive behavior of a network, but I have not confirmed this by my own experiments. See here for more details on installing P4sim on a system with BMv2: added docs for installing p4sim by Vineet1101 · Pull Request #706 · p4lang/tutorials · GitHub

  1. Directional capacity - networks with physical copper or fiber optic cables between them typically have link capacities that are independent in the two directions between two network devices A and B, e.g. for a link with a capacity of 100 gigabits/sec, you can simultaneously send 100 gigabits/sec of traffic from A to B and 100 gigabits/sec of traffic from B to A. For Wi-Fi or other kinds of radio links between devices, the capacity can be shared between the two directions, or separate (e.g. separate if they are using different radio frequencies in the two directions, and there are no other transmitters in that frequency range nearby that are interfering).

Any accuracy/timing caveats with BMv2 meters for this use?

I have not done any experiments with BMv2 to determine precisely what its accuracy and “granularity” is for policing rates. I do know that the units used by BMv2 for metering should match what is documented here: behavioral-model/docs/runtime_CLI.md at main · p4lang/behavioral-model · GitHub

I would recommend trying a very simple experiment with a single traffic flow all hitting a single P4 meter configured at different rates, and measure what it actually does, if you want to know how precisely BMv2 lets you control this.

Read path robustness

Again, focused experiments with BMv2 are what I would recommend if you think this could become a factor in your use of reading many counters.

If you suspect, or actually confirm, that the rate of reading multiple counters in BMv2 is too slow for your use case, there is a technique used in multiple network telemetry systems that may be useful to you.

Keep 2 sets of counters, call them set 0 and set 1. Have a table or P4 register called something like counter_set_to_update that is accessed in your P4 program before you update counters, that tells the P4 program whether to update set 0, or set 1.

Have your controller initialize a switch to use set 0 at the beginning. When you want to get a consistent snapshot of many counters, have your controller write counter_set_to_use with the value 1. After that write is complete, you know that all of the counters in set 0 will remain at their current values, i.e. the P4 program will stop updating set 0, and only update set 1.

Your controller can now read the counters in set 0 at whatever rate it is capable of doing, but those values will all accurately represent their values at a single point in time (just before counter_set_to_use was changed from 0 to 1).

Later when you want to read another consistent set of counter values, your controller should first write counter_set_to_use with 0, then go and read all of the counters in set 1.

Repeat periodically, alternating between counter_set_to_use between the values 0 and 1.