Hi all,
I’m prototyping a small P4Runtime controller that tries to enforce network-wide max–min fairness (progressive filling) over a 6-switch mesh in Mininet/BMv2. I’d really appreciate a sanity check: does this architecture make sense, and what would you improve?
What I’m running
-
Data plane: v1model BMv2 (simple_switch_grpc). P4 has two tables:
-
MyIngress.ipv4_lpm: match hdr.ipv4.dstAddr (LPM) → action ipv4_forward(dstAddr, port)
-
MyIngress.ipv4_src: match exact hdr.ipv4.srcAddr → direct counter + action count_packet() (no data-plane change; just to attach counters)
-
-
Control plane: Python + P4Runtime. I push static IPv4 routing to all switches and install the ipv4_src table only on edge switches (those that have directly attached hosts).
-
Topology: 6 switches (s1..s6), 8 hosts (h1..h8). Links are declared in a JSON topology file. I parse:
-
switch–switch capacities (mostly 10 Mb/s in my test)
-
host–switch egress “caps” (currently 10 Mb/s; used to clamp per-host rate)
-
port maps (so I can program the correct out ports)
-
-
Traffic model: JSON traffic_matrix with flows like:
{
“id”: “flow_1_to_8_heavy”,
“src”: “h1”,
“dst”: “10.0.6.8”,
“protocol”: “UDP”,
“dst_port”: 5201,
“duration”: 60,
“demand_mbps”: 15.0
}
I use these demands for the MMF solver; measurement is only to detect “active” senders.
Controller loop (every ~1 s)
-
Read all direct counters for MyIngress.ipv4_src on every switch via P4Runtime Read().
-
For each edge switch entry, derive bytes/second per src IP, with some jitter smoothing on dt.
-
Mark a source as active if rate >= 0.05 Mb/s.
-
Build paths (static single shortest path) for active flows using the topology graph.
-
Run progressive filling (max–min fairness) over undirected edges using capacities from the topology and demands from the traffic matrix. This returns per-flow allocations in Mb/s.
-
Aggregate allocations per source IP (sum over that host’s flows).
-
Clamp per-source allocations by the host’s egress cap (host→switch link capacity).
-
Enforce the effective per-host limit using Linux TBF in the host namespace:
sudo mnexec -a tc qdisc replace dev eth0 root tbf rate mbit burst 1000kbit latency 50ms
Example results
-
With two “heavy” cross-mesh flows (h1→h8, h2→h7) each demanding 15 Mb/s and four additional 8–10 Mb/s flows on partly disjoint paths, the MMF solver typically yields something like:
allocations: { flow_1_to_8_heavy: 4.0, flow_2_to_7_heavy: 4.0, flow_3_to_4_mid: 8.0, flow_5_to_6_mid: 10.0, flow_1_to_5_diag: 10.0, flow_3_to_6_diag: 10.0 }
per-source sums: { h1: 14.0, h2: 4.0, h3: 18.0, h5: 10.0 }
effective clamp by host egress cap (10 Mb/s) then pushes: h1→10, h2→4, h3→10, h5→10. -
The controller logs show the MMF numbers and the applied TBF, and iperf3 server logs line up with the enforced caps.
-
CSV with counter deltas sometimes shows underreported instantaneous Mb/s because Read() can be sparse across iterations; I now clamp dt in the rate calculation to stabilize it.
What I’m unsure about and would love feedback on
-
Counters and flow granularity. Counting exact srcAddr at the edge is simple but conflates multiple flows from one host. Would you recommend moving to a 5-tuple key for “activity detection” (src/dst/proto/ports)? On v1model BMv2, would you do this with a hash-based table + registers, or send digests to the controller?
-
Where to police. For Mininet this host-side TBF is easy and predictable, but would you instead try to use data-plane meters (direct/indirect) or an egress policer in P4? Any accuracy/timing caveats with BMv2 meters for this use?
-
Read path robustness. Is it expected that a P4Runtime table Read() over direct counters occasionally returns a sparse set of entries on BMv2, causing uneven dt per key? Is there a better pattern (e.g., reading by exact key per src, clearing counters, or using idle timeouts)?
-
Directional capacity. I currently treat link capacity as undirected per edge. Would you model capacities per direction in the solver for a more realistic result?
-
Routing and fairness. Paths are single static shortest paths; no ECMP. Any recommended teaching-grade way to combine MMF with ECMP on BMv2 without getting into full blown TE?
-
Control loop tuning. I only push a new TBF if the change exceeds 5% to avoid churn. Loop period is 1 s. Any better heuristics you’ve used?
-
Activation threshold. I mark a src as “active” at 0.05 Mb/s. Would you tie this to demand_mbps, or keep a fixed floor?
-
Validation methods. Beyond iperf3 UDP, I plan to try TCP CUBIC/BBR and link failure tests (drop capacity on one s–s edge and watch allocations shift). Any other quick checks you’d recommend?
Repro details if anyone wants to try something similar
-
BMv2 simple_switch_grpc with gRPC ports 50051–50056, P4 compiled with p4c-bm2-ss.
-
Mininet topology in JSON; controller parses switch ports, link delays, and capacities.
-
iperf3 servers run in the destination namespaces; clients are started from the traffic_matrix.json.
-
Controller builds an IP→hostname map from topology to locate host PIDs for tc.
If this approach is off the rails, please say so. If it’s reasonable, I’d be grateful for pointers to better practices (e.g., per-flow activity tracking, where to place policers/meters in v1model, and more robust counter polling). Happy to share code snippets or a trimmed repo if that helps.
Thanks in advance!