Dpdk-TAP-Driver

This project is intended to evaluate network performance by utilizing DPDK (testpmd), virtual TAP interfaces, and the tcpreplay utility. The structured workflow for implementation and analysis is outlined below:

1. Installing and Building DPDK with Function Tracing Support

Download the Latest DPDK Version

Retrieve the latest release from the official DPDK website

Extract the Archive

tar xJf dpdk-<version>.tar.xz
cd dpdk-<version>

Configure the Build Environment Using Meson
```
meson setup build \
-Dexamples=all \
-Dlibdir=lib \
-Denable_trace_fp=true \
-Dc_args="-finstrument-functions"
```
The -Dc_args="-finstrument-functions" Meson configuration flag ensures that function entry and exit points are instrumented during compilation. This is essential for LTTng to capture user-space function traces; without it, the resulting trace data may be incomplete or empty.

Build and Install Using Ninja
```
cd build
ninja
meson install
ldconfig
```
Note: Root privileges are required for the last two commands.

The compiled binaries will be located in the /build/app directory.

2. Configure hugepages and mount 1GB pagesize

echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
mkdir /mnt/huge
mount -t hugetlbfs pagesize=1GB /mnt/huge

Alternatively, the following method can be used:

sudo sysctl -w vm.nr_hugepages=1024
mount -t hugetlbfs none /dev/hugepages

3. Create two TAP interfaces for DPDK's TAP Poll Mode Driver (PMD)

Within the dpdk-/build directory, execute the testpmd application using the following command:

sudo LD_PRELOAD=/usr/lib/x86_64-linux-gnu/liblttng-ust-cyg-profile.so.1 ./app/dpdk-testpmd -l 0-1 -n 2   --vdev=net_tap0,iface=tap0   --vdev=net_tap1,iface=tap1   --   -i

Functionality:

Creates net_tap0 and net_tap1 virtual devices
Assigns 2 CPU cores (-l 0-1)
Starts in interactive mode (-i)

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/liblttng-ust-cyg-profile.so forces the DPDK application to load LTTng's function tracing library first, enabling detailed profiling of function calls for performance analysis. This allows tracking exact timing and frequency of every function call in DPDK (like packet processing functions) to identify bottlenecks.

Note:

Depending on your kernel version, the exact liblttng-ust-cyg-profile.so file may not be present, even if the liblttng-ust-dev package has been installed. To determine the specific version available on your system, you can use the following command:

ls /usr/lib/x86_64-linux-gnu/ | grep liblttng-ust-cyg-profile.so

For example, the output on one system may appear as follows:

The terminal output should appear as follows after executing the testpmd command:

By executing the command show port stats all, you will be able to view the statistics for all ports.

4. Create Additional RX/TX Queues

In the following step, we will add a new queue while operating in TAP mode. However, before proceeding, it is recommended to retrieve relevant configuration details using the show config fwd command.

Forwarding Mode: io -> This means packets received on a port are simply transmitted out through the corresponding transmit queue of another port.
Ports: 2 -> Two physical or virtual ports are active.
Cores: 1 -> One logical core (Core 5 in this case) is being used to process all forwarding tasks.
Streams: 2 -> Two forwarding streams are configured, each representing a receive/transmit (RX/TX) path between ports.
NUMA support: Enabled -> The application is aware of NUMA (Non-Uniform Memory Access) node placement.
MP Allocation Mode: native -> Memory pools are allocated in native DPDK mode.

To add a new queue, we need to perform the following actions in testpmd:

Stoping all ports: port stop all
Creating new RX and TX queues: port config all rxq 2 & port config all txq 2
Start the ports again: port start all

After the queues have been created, the output of show config fwd should appear as follows:

5. Create Flow Filtering Rule in testpmd

To direct specific types of traffic to designated queues, you can create a flow filtering rule in testpmd using the following command.

flow create 0 ingress pattern eth / ipv4 / udp / end actions queue index 0 / end

If the operation completes successfully, you should observe an output similar to the following:

This command creates a rule on port 0 that:

Matches any incoming UDP over IPv4 over Ethernet traffic.
Routes the matched packets to RX queue 0.

6. Install and Run tcpreplay

Subsequently, we need to clone tcpreplay from the TCP-Replay repository to send the traffic to one of the interfaces we've created using testpmd. After downloading the tcpreplay-4.5.1.tar.gz archive, compile and install it. Once the installation is complete, open a new terminal session and execute the following command within that terminal.

./configure --disable-tuntap
make
sudo make install

Note:

You may need to install some build tools, such as make or automake, to successfully compile the TCPReplay project. You can typically install them on Ubuntu with:

sudo apt update
sudo apt install build-essential automake

After completing the installation, you can run the previously generated pcap file using the command shown below.

tcpreplay -i tap0 --loop=1000 ./Capture.pcap

Next, in the testpmd terminal window, enter the command start, followed by show port stats all to observe the packet flow on TAP interface 0.

7. Setting Up an LTTng Trace Session

In order to Automate the LTTng capture, create a shell script to configure the LTTng session. The script initializes the session, adds the necessary context fields, starts tracing, sleeps for a specified duration, and then stops and destroys the session.

touch script.sh
chmod +x script.sh
nano script.sh

Paste the following commands into the file:

#!/bin/bash
lttng create libpcap
lttng enable-channel --userspace --num-subbuf=4 --subbuf-size=40M channel0
#lttng enable-channel --userspace channel0
lttng enable-event --channel channel0 --userspace --all
lttng add-context --channel channel0 --userspace --type=vpid --type=vtid --type=procname
lttng start
sleep 2
lttng stop
lttng destroy

1 Tracing Analysis - UDP Filtering

Executive summary

The 2-second LTTng capture (~11 million events) indicates that most of the additional CPU time observed after installing the software flow rule eth / ipv4 / udp → queue 0 is consumed by the software packet-type (PTYPE) parser on the receive hot-path of the TAP PMD.
Because the rule requires classification of every frame’s L3/L4 headers to decide whether it is UDP, the parser executes entirely in software on the same logical core (Core 5) that hosts the dpdk-worker thread.

▸ The flame-graph slice covering 42 – 54 µs on the timeline displays a wide orange frame (pkt_burst_io_forward) with a dense stack of six short helper functions repeated for each packet, clearly indicating execution inside the burst Rx/Tx loop.

1. What the trace indicates

Metric	Dominant functions	Evidence within TraceCompass
Call count	`pmd_rx_burst` (768 k) → `rte_net_get_ptype` (1 881) → helper chain (`rte_pktmbuf_read`, `rte_constant_bswap16`, `rte_ipv4_hdr_len`, `ptype_l3_ip`, `ptype_l4`, …)	Descriptive-Statistics and Function-Duration-Statistics views
Self-time	`pmd_rx_burst` ≈ 156 ms of 203 ms	Tooltip in the Flame-Graph
Event-density peaks	Peaks dominated by the above helpers (highlighted in red in the Call-graph Analysis table)	Event-Density view in combination with the Statistics table
Weighted call tree	Six PTYPE helpers account for ≈ 60 % of weighted wall-time	Weighted Tree Viewer

▸ Red rows in the Statistics table correspond to functions highlighted after drawing a region in the Event-Density view; the same helper symbols dominate both, corroborating the conclusion that classification logic is responsible for burst-time spikes.

2. Location of the hot-path code in DPDK 25.03

File	Function	Role in the hot path
`lib/net/rte_net.c`	`uint32_t rte_net_get_ptype(...)`	Linear walk over L2/L3/L4 headers (≈ lines 560 – 850).
`lib/net/rte_ip.h`	`rte_ipv4_hdr_len(...)`	Extracts IPv4 IHL on each IPv4 packet.
`lib/net/rte_byteorder.h`	`rte_constant_bswap16(...)`	Endian swap used in every UDP/TCP port check.
`lib/mbuf/rte_mbuf.c`	`rte_pktmbuf_read(...)`	Safely copies header bytes into cache.
`drivers/net/tap/rte_eth_tap.c`	`tap_trigger_cb()` → `tap_parse_packet()` → `rte_net_get_ptype()`	TAP Rx path; the callback is installed when the rule is accepted.
`drivers/net/tap/tap_flow.c`	`tap_flow_validate`, `tap_flow_create`	Determines that L4 protocol match is required and enables the parser.

▸ In the opened object address 0x61abdac09a90 within the Symbol-Viewer (part of the Call-graph Analysis perspective) TraceCompass resolved the symbol name to rte_net_get_ptype in lib/net/rte_net.c, fully confirming the mapping in the table above.

3. Reasons for the high cost of the rule

TAP is a purely software device; no hardware RSS or flow-director off-load is available.
rte_flow therefore falls back to software steering; tap_trigger_cb() is invoked for every received frame.
rte_net_get_ptype() is branch-heavy and cache-sensitive, causing the helper chain to execute on each packet and producing the latency spread visible in the trace.

▸ The Standard Deviation column for rte_net_get_ptype and its helpers exhibits wide spreads, a hallmark of branch mis-prediction and cache-miss behaviour in a header parser.

4. Bottleneck ranking

Rank	Dominant cost	Trace symptom	Source location
1	`rte_net_get_ptype()`	15 ms total / 7.8 ms self for 1 881 calls	`lib/net/rte_net.c:≈560-850`
2	`rte_pktmbuf_read()` + `rte_pktmbuf_headroom`	2.36 ms self	`lib/mbuf/rte_mbuf.c:≈500-580`
3	`rte_constant_bswap16`	2.45 ms self	`include/rte_byteorder.h`
4	`rte_ipv4_hdr_len`	0.76 ms self	`lib/net/rte_ip.h`
5	Mempool get/put bursts	Visible spikes when mbuf cache empties	`lib/mempool/rte_mempool_generic.c`

5. Practical mitigation options

Mitigation	Applicable context	Implementation
Hardware off-load (`rte_flow` in NIC)	Physical NIC available	Intel E810, Mellanox CX-6, etc.
Disable software PTYPE parsing	Remain on TAP	`port config 0 ptype_parse off`; filter in forwarding loop.
Vectorised TAP Rx path	Willing to patch	Use AVX2/SSE sample in `drivers/net/ixgbe`.
Remove `-finstrument-functions`	Performance runs	Replace with manual UST probes or `perf record`.
Pre-slice traffic in generator	Synthetic workload	Send UDP frames to a dedicated TAP.

▸ The capture employed --subbuf-size=40M, yet some events were lost during the busiest 30 µs sections, indicating that instrumentation overhead itself is non-negligible.

6. Hotspot file references

lib/net/rte_net.c             : rte_net_get_ptype()        (≈ 560-850)
lib/net/rte_ip.h              : rte_ipv4_hdr_len()
include/rte_byteorder.h       : rte_constant_bswap16()
lib/mbuf/rte_mbuf.c           : rte_pktmbuf_read()
drivers/net/tap/rte_eth_tap.c : tap_trigger_cb()
drivers/net/tap/tap_flow.c    : software flow helpers

Detailed observations for each TraceCompass component

TraceCompass view	Observation	Analytical implication
Statistics → Event Types	Entry/exit events are exactly 50 / 50 %.	Confirms high volume caused by `-finstrument-functions`; helper timing is inflated.
Event Density	Saw-tooth pattern of ~44–50 events per burst; valleys align with mbuf allocations.	Indicates classification and occasional mempool refills dominate burst latency.
Flame Graph	Wide `pkt_burst_io_forward`; underneath, repeating stack of six helpers.	Helper set is invoked once per frame, characteristic of software parsing.
Flame Graph tooltip	`rte_net_get_ptype` mean ≈ 8 µs; max ≈ 54 µs.	Long tail reflects branch/cold-cache penalties.
Function Duration Statistics	Helpers: `rte_pktmbuf_read` 422 ns, `rte_constant_bswap16` 391 ns, etc.	Figures match the ∼400 ns overhead introduced by the rule.
Weighted Tree Viewer	Six helpers form ≈ 60 % of weighted time.	Parsing confirmed as the principal hotspot.
Call-graph Analysis	Selected peaks consist exclusively of helper symbols; no system calls present.	Processing overhead lies entirely within userspace DPDK code.

Adding PMU contexts

PMU counter	Projected change with the flow rule active	Explanation
cpu-cycles	Increase by ~5 – 10 %	Additional helper instructions execute per frame.
instructions	Increase by ~3 – 5 %	Header checks add 50 – 100 instructions per packet.
cache-misses	Noticeable rise on the Rx core	`rte_pktmbuf_read` touches payload across cache-line boundaries, incurring extra L1 and occasional LLC misses.

Predicted secondary effects

IPC (instructions / cycle) remains ≈ 1 because the helpers are short and branch-dominated.
LLC-miss percentage rises modestly; most misses are served after the first line fill.
The perf sample set should validate the time-domain analysis by showing higher retired instructions and cache-miss stalls exclusively on the userspace threads.

Test Scenario: Two Queues, Dropping TCP, UDP, or Both Packet Types

In this test, we evaluate the impact of selectively dropping certain types of packets. Initially, only UDP packets were dropped, followed by the dropping of both TCP and UDP packets.

We noticed that regardless of the flow rule, the burst_forward function is always called.
This is expected since the driver operates in poll mode — the CPU is always active, and the specified core maintains 100% utilization.

If there is data to transmit, the call stack changes to:

Upon analyzing the function calls, we did not observe any functions explicitly responsible for filtering or dropping packets. This observation appears unusual; therefore, we will conduct a more in-depth investigation.

Note: LTTng captures function calls; however, it does not trace internal control flow structures such as if or else branches.

Let us examine the DPDK source code for further clarity. In tap_flow.c at line 1065, if the flow action is configured to drop, the implementation modifies the packet data in such a way that the kernel subsequently discards it. As a result, no user-space function is explicitly invoked, which explains why LTTng is unable to trace this behavior.

The same principle applies to queue redirection for specific packet types: When a flow rule redirects packets to a particular queue, the necessary configuration is handled at a lower level—typically by modifying kernel-level or driver-level settings. As such, no explicit user-space function is called during the redirection process, and consequently, tools like LTTng are unable to capture this action.

It simply modifies the socket buffer (SKB) directly, without invoking any additional function calls.

Conclusion

With the current setup, tracing the actual packet filtering logic through function calls is not feasible, as the operations occur at a lower level without explicit invocation of user-space functions. However, it is still possible to gain valuable insights by analyzing low-level performance metrics—such as CPU cycles, cache misses, branch mispredictions, and memory access patterns. These metrics can help us better understand the performance characteristics and efficiency of the filtering mechanism, even in the absence of direct function call traces.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Pics		Pics
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dpdk-TAP-Driver

1. Installing and Building DPDK with Function Tracing Support

2. Configure hugepages and mount 1GB pagesize

3. Create two TAP interfaces for DPDK's TAP Poll Mode Driver (PMD)

4. Create Additional RX/TX Queues

5. Create Flow Filtering Rule in testpmd

6. Install and Run tcpreplay

Note:

7. Setting Up an LTTng Trace Session

1 Tracing Analysis - UDP Filtering

Executive summary

1. What the trace indicates

2. Location of the hot-path code in DPDK 25.03

3. Reasons for the high cost of the rule

4. Bottleneck ranking

5. Practical mitigation options

6. Hotspot file references

Detailed observations for each TraceCompass component

Adding PMU contexts

Test Scenario: Two Queues, Dropping TCP, UDP, or Both Packet Types

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Dpdk-TAP-Driver

1. Installing and Building DPDK with Function Tracing Support

2. Configure hugepages and mount 1GB pagesize

3. Create two TAP interfaces for DPDK's TAP Poll Mode Driver (PMD)

4. Create Additional RX/TX Queues

5. Create Flow Filtering Rule in testpmd

6. Install and Run tcpreplay

Note:

7. Setting Up an LTTng Trace Session

1 Tracing Analysis - UDP Filtering

Executive summary

1. What the trace indicates

2. Location of the hot-path code in DPDK 25.03

3. Reasons for the high cost of the rule

4. Bottleneck ranking

5. Practical mitigation options

6. Hotspot file references

Detailed observations for each TraceCompass component

Adding PMU contexts

Test Scenario: Two Queues, Dropping TCP, UDP, or Both Packet Types

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages