In recent days, I've been deeply immersed in exploring eBPF, particularly focusing on XDP (eXpress Data Path) and Traffic Control (TC) for some moderately complex projects. If you've been following my eBPF serie, you might recall my last two articles delved into the intricacies of XDP. eBPF is a revolutionary technology, especially in the realm of networking, where it plays a pivotal role.
As my exploration into eBPF has grown, so has my understanding of its applications and the environments needed for effective experimentation. Real-world network scenarios often involve intricate segmentation and isolation – conditions that are essential to replicate for meaningful eBPF experimentation. Recognizing this, I saw an opportunity to create something that would not only aid my learning but could also be a valuable tool for others: a simple program to run TCP/UDP
servers in isolated network namespaces
. This setup would allow for a more realistic testing ground, crucial for experimenting with various eBPF programs, particularly those that handle network data across different layers.
Building on this need for a practical network environment, I decided to challenge myself with a new project, especially since I am still learning Rust. My goal was to create a simple and user-friendly command-line tool that could launch TCP and UDP echo servers, each in its separate network namespace. To achieve this, I used the clone
syscall for creating isolated network spaces. I also set up a network bridge
and veth
pairs to enable communication within these spaces. This setup is particularly great for testing because I can attach different eBPF programs to the virtual interfaces to see how they work. While the project is relatively straightforward, it's been a significant part of my learning journey, allowing me to apply Rust in a practical setting and tackle some real network programming challenges.
Why Isolated Network?
In the dynamic field of network programming the ability to launch isolated network spaces is invaluable. This isolation offers a multitude of benefits: it ensures secure and controlled environments for testing and deploying applications, mitigates the risk of system-wide disruptions, and allows for the simulation of complex network configurations. Tools like Docker
excel in this area by providing lightweight, standalone containers that replicate production environments without the overhead of full virtual machines. This isolation is especially critical for developing and experimenting with network-related programs, such as those involving eBPF, where precise control over network behavior and interactions is essential.
Why Not Just Use Docker?
Sure, Docker is a great tool for creating isolated networks, but for what I needed, it seemed a bit too much. Docker comes with a lot of features that I wouldn't use for this project. I wanted something simpler and more focused. Plus, I was looking for a good excuse to really dive into Rust programming. Building this tool from scratch gave me the perfect opportunity to learn more about Rust and network programming, while keeping things simple and tailored to my specific needs.
Understanding Process Isolation: The Basic
Process isolation is a key concept in computing where different processes are kept separate from each other, ensuring they don't interfere or compromise the overall system. Imagine it like having several different workspaces on the same desk, where each task is contained in its own area. Docker
, a popular containerization platform, uses process isolation effectively. It creates containers, each acting like a mini-computer within your main computer, running its own applications and using its own isolated portion of the system resources.
Linux namespaces
, chroot
, and cgroups
are foundational elements for achieving isolation in Linux, and they are crucial for Docker's containerization technology. Namespaces
in Linux provide a way to isolate and virtualize system resources, allowing processes to run in separate environments as if they were on different machines. For instance, network namespaces isolate network interfaces, ensuring that processes in different namespaces don't interfere with each other's network communications. Chroot
, short for 'change root', is a way of isolating process filesystems. It changes the apparent root directory for a process, effectively sandboxing its access to the file system. Lastly, cgroups
, or control groups, manage the allocation of resources such as CPU time, system memory, network bandwidth, or combinations of these resources among user-defined groups of tasks. Together, these technologies form the backbone of Linux containerization, providing robust isolation and resource control.
In a highly simplified explanation, when you create a container using a platform like Docker, which internally utilizes containerd and runc
, what actually happens is a new process gets initiated. This process is then moved into its own set of isolated namespaces. These namespaces include network (for isolating network interfaces), PID (for PID namespace), UTS (for hostname isolation), among others. Alongside this, Docker uses chroot
to change the apparent root directory for the container, effectively sandboxing its filesystem. Additionally, cgroups
are employed to manage and limit the container's resource usage, such as CPU and memory.This setup is more complex than it sounds, but it's what allows each container to work like it's in its own little world. This means every container is kept separate from others and from the main computer it's running on
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers. https://man7.org/linux/man-pages/man7/namespaces.7.html
If you're interested in delving deeper into this topic, I highly recommend checking out a couple of YouTube videos by Liz Rice. Her explanations are fantastic for gaining a more in-depth understanding of containers and how they work from the ground up. Liz rice - Containers from scratch
For this project, I don't need to use all the capabilities that namespaces, chroot, and cgroups offer. I'm not trying to build a full-blown containerization system like Docker. My aim is simpler: to run a server in its own network space using just network namespaces. This way, I can quickly launch a server with a single command, and it will have its own isolated network area without the complexities of a complete container setup.
The Syscall Clone
Like I mentioned earlier, I'm going to use the clone
syscall for this project. But it's worth noting that there's another syscall, unshare
, that can also do similar things. Clone
is great for making new processes that are already separated into different network spaces, while unshare
can take an existing process and isolate it from the rest of the system. Both of these are pretty handy tools in Linux when you want to create isolated environments, like what I need for my server.
Clone by contrast with fork(2), this system call provide more precise control over what pieces of execution context are shared between the calling process and the child process. For example, using this system call, the caller can control whether or not the two processes share the virtual address space, the table of file descriptors, and the table of signal handlers. This system call also allow the new child process to be placed in separate namespaces(7). https://man7.org/linux/man-pages/man2/clone.2.html
The reason I'm opting for clone
over unshare
is because of the specific network setup I'm planning. After creating a new process with clone
, the parent process needs to perform some network configurations. This setup is a bit easier to manage with clone
, as it allows the parent process to set up the network right after the new process starts. Essentially, clone
fits well with the flow of creating and then immediately configuring isolated network spaces for my servers, simplifying the whole process.
+--------------------------+ +----------------------------------------------+
| Parent Process | | Child Process |
| 1. Executes clone | ------> | 2. Starts in New Net NS |
| with CLONE_NEWNET flag | | Inherits other namespaces (PID, Mount, etc.)|
+--------------------------+ +----------------------------------------------+
|
3. Parent Configures Network -> |
V
+-------------------------------------------------------------+
| Network Configuration. |
| (e.g., setup veth pair, bridge and more) |
+-------------------------------------------------------------+
Isolated Network NS Needs Communication
Just having a new process in an isolated network namespace isn't enough. I also need a way for this process to talk to the host and for the host to talk back. To do this, I'm going to set up a bridge
and a pair of virtual Ethernet (veth)
interfaces. This is kind of like what Docker does. The bridge
acts like a link between the isolated network and the main network, and the veth pair
creates a network tunnel between the isolated process and the rest of the system. It's a simple but effective way to make sure the host can communicate with the isolated servers.
bridge: A bridge is a way to connect two Ethernet segments together in a protocol independent way. Packets are forwarded based on Ethernet address, rather than IP address (like a router). https://wiki.linuxfoundation.org/networking/bridge
veth: Packets transmitted on one device in the pair are immediately received on the other device. When either device is down, the link state of the pair is down. https://man7.org/linux/man-pages/man4/veth.4.html
+----------------------+ +---------------------------------+
| Host System | | Isolated Network Namespace |
| | | (e.g., ns-01) |
| +----------------+ | | |
| | Bridge | | | +--------------------------+ |
| | (br0) | | | | | |
| | | | | | Virtual Ethernet Pair | |
| | +-----------+ | | | | (veth0, veth-peer). | |
| | | veth-peer |---- Network Tunnel -------| | |
| | +-----------+ | | | +--------------------------+ |
| +----------------+ | +---------------------------------+
| |
+----------------------+
Using isolated network namespaces with veth pairs
offers significant advantages, particularly in terms of network traffic management and security. Each isolated environment in this setup is connected to the host system through a veth pair
, akin to a virtual wire
. This configuration allows for precise monitoring and control of the network traffic entering and exiting each isolated network. By attaching eBPF programs to the veth pairs
, we can efficiently inspect and manage the network traffic. This includes attaching eBPF programs to the host side of these veth pairs
, enabling detailed monitoring and policy enforcement on all traffic from them.
Crafting the program in Rust
To keep things simple, the project relies on three main Rust libraries, plus one for the user interface. First, there's nix
, which I use for all the Linux syscall API stuff – it's really handy for interacting directly with the operating system. Then I've got rtnetlink
for setting up the network, which makes handling network configurations a lot smoother. And for the asynchronous runtime, I'm using tokio
, ensuring the program remains efficient and responsive, especially during network operations. Lastly, for creating a user-friendly command-line interface, I'm using clap
. It's great for parsing command-line arguments and making the tool easy to use. Together, these libraries form the backbone of this network isolation tool, combining functionality with ease of use.
Let's take a look at the main parts of the code, presented in a straightforward way. I’ll explain each part clearly, focusing on the essentials and leaving out any uninteresting or extra checks. For those eager to dive into all the details, the complete code is waiting in the repo.
Network setup
So, the first step in the program is to set up a bridge. This bridge will let the isolated process talk to the host and other processes. Here's the create_bridge
function that does just that:
async fn create_bridge(name: String, bridge_ip: &str, subnet: u8) -> Result<u32, NetworkError> {
let (connection, handle, _) = new_connection()?;
tokio::spawn(connection);
// Create a bridge
handle.link().add().bridge(name.clone()).execute().await.map_err(...)?;
let bridge_idx = handle.link().get().match_name(name).execute()
.try_next().await?
.ok_or_else(...)?.header.index;
// add ip address to bridge
let bridge_addr = std::net::IpAddr::V4(Ipv4Addr::from_str(bridge_ip)?);
AddressHandle::new(handle.clone())
.add(bridge_idx, bridge_addr, subnet).execute().await
.map_err(...)?;
// set bridge up
handle.link().set(bridge_idx).up().execute().await.map_err(..)?;
Ok(bridge_idx)
}
This function is doing what you'd typically do with network setup commands in Linux. For example, creating a bridge, assigning it an IP address, and bringing it up, which you might normally do with commands like:
$ ip link add name br0 type bridge
$ ip addr add 178.18.0.1/16 dev br0
$ ip link set br0 up
When we run these commands, or the function with similar settings, what we get is a network bridge on the host system, labeled as br0
, we assign it the IP address 178.18.0.1/16
. This IP serves as the network identity for the bridge within the host system. Think of it like the main door to a building.
+---------------------------+
| Host System |
| |
| +---------------------+ |
| | Network Bridge (br0)| |
| | IP: 178.18.0.1/16. | |
| +---------------------+ |
| |
+---------------------------+
Next up in the project is creating the veth pair
. This step is crucial because the veth pair
is what connects our isolated network namespace to the host system, using the bridge we set up.
async fn create_veth_pair(bridge_idx: u32) -> Result<(u32, u32), NetworkError> {
let (connection, handle, _) = new_connection()?;
tokio::spawn(connection);
// create veth interfaces
let veth: String = format!("veth{}", random_suffix());
let veth_2: String = format!("{}_peer", veth.clone());
handle.link().add().veth(veth.clone(), veth_2.clone()).execute()
.await.map_err(...)?;
// Get veth pair idxs
let veth_idx = handle.link().get().match_name(veth.clone())
.execute().try_next().await?.ok_or_else(...)?.header.index;
let veth_2_idx = handle.link().get().match_name(veth_2.clone())
.execute().try_next().await?.ok_or_else(...)?.header.index;
// set master veth up
handle.link().set(veth_idx).up().execute().await.map_err(...)?;
// set master veth to bridge
handle.link().set(veth_idx).controller(bridge_idx).execute()
.await.map_err(...)?;
Ok((veth_idx, veth_2_idx))
}
In this function, we create a pair of virtual Ethernet (veth)
interfaces. One part of this pair will be connected to our isolated network namespace later. The other part stays in the host system and gets connected to our bridge, br0
. This is how we create a communication path between the isolated environment and the host network using the bridge.
You could set up something similar manually with Linux ip
commands. Here’s how it goes:
$ ip link add veth0 type veth peer name veth0_peer # create the veth pair
$ ip link set veth0 up # activate the veth interface
$ ip link set veth0 master br0 # connect one end of the veth to the bridge
So, what we've achieved with this is something like this setup:
+-----------------------+
| Host System |
| |
| +-----------------+ |
| | Bridge (br0) | |
| | 178.18.0.1/16 | |
| | +---------+ | |
| | | veth | | |
| | +----|----+ | |
| +--------|-------+ |
| | |
| +--------|--------+ |
| | veth-peer | |
| +-----------------+ |
| |
+-----------------------+
After we've connected veth
to the bridge in the host system, we need to move veth-peer
to a specific isolated network namespace. To do this, we require two key pieces of information: the index of veth-peer
(referred to as veth_idx
in the code) and the process ID (PID) of the process that owns the namespace we want to use. Here’s the function that handles this:
pub async fn join_veth_to_ns(veth_idx: u32, pid: u32) -> Result<(), NetworkError> {
let (connection, handle, _) = new_connection()?;
tokio::spawn(connection);
// set veth to the process network namespace
handle.link().set(veth_idx).setns_by_pid(pid).execute().await.map_err(...)?;
Ok(())
}
We’re assigning veth-peer
to the network namespace of a process by its PID. This is crucial for ensuring that veth-peer
is part of the desired isolated environment. By executing this function, veth-peer
becomes attached to the network namespace of the process with the given PID, allowing it to communicate within that isolated space, while veth
remains connected to the host's bridge.
+----------------------+ +--------------------------------+
| Host System | | Isolated Network Namespace |
| | | (e.g., newns) |
| +----------------+ | | |
| | Bridge | | | |
| | (br0) | | | |
| | 178.18.0.1/16 | | | |+-------------------------+ |
| | +-----------+ | | | | | |
| | | veth |---------------------------| veth-peer | |
| | +-----------+ | | | +--------------------------+ |
| +----------------+ | +--------------------------------+
| |
+----------------------+
The final step in setting up our network is configuring veth-peer
within the new network namespace. We need to give it an IP address and get it ready to use. It's important to make sure that this IP address is in the same subnet as the bridge's IP, so they can talk to each other properly.
pub async fn setup_veth_peer(
veth_idx: u32,
ns_ip: &String,
subnet: u8,
) -> Result<(), NetworkError> {
let (connection, handle, _) = new_connection()?;
tokio::spawn(connection);
info!("setup veth peer with ip: {}/{}", ns_ip, subnet);
// set veth peer address
let veth_2_addr = std::net::IpAddr::V4(Ipv4Addr::from_str(ns_ip)?);
AddressHandle::new(handle.clone()).add(veth_idx, veth_2_addr, subnet)
.execute().await.map_err(...)?;
handle.link().set(veth_idx).up().execute().await.map_err(...)?;
// set lo interface to up
let lo_idx = handle.link().get().match_name("lo".to_string()).execute().try_next()
.await?.ok_or_else(...)?.header.index;
handle.link().set(lo_idx).up().execute().await.map_err(...)?;
Ok(())
}
This function is doing something similar to what we've done before, but this time it's inside the new network namespace. Basically, we're giving veth-peer
an IP address that matches the subnet of our bridge. This lets them communicate with each other. After assigning the IP, we activate veth-peer
by bringing it online. This step is key to making sure that everything in our isolated network environment is connected and ready to go. If you were doing this manually, you'd use ip
commands like these:
# Assign IP in the namespace
$ ip netns exec newns ip addr add 178.18.0.2/16 dev veth-peer
# Set veth-peer up in the namespace
$ ip netns exec mynetns ip link set veth-peer up
So, that wraps up our network setup. Now we should have everything in place.
+----------------------+ +--------------------------------+
| Host System | | Isolated Network Namespace |
| | | (e.g., newns) |
| +----------------+ | | |
| | Bridge | | | |
| | (br0) | | | |
| | 178.18.0.1/16 | | | |+-------------------------+ |
| | +-----------+ | | | | IP: 178.18.0.2 | |
| | | veth |---------------------------| veth-peer | |
| | +-----------+ | | | +--------------------------+ |
| +----------------+ | +--------------------------------+
| |
+----------------------+
At this point, if we take a look at our system's routing setup using the ip route
command, we'll see an entry for our bridge. This entry is crucial. It tells our system how to handle traffic to and from the 172.18.0.0/16
network. Essentially, whenever our system needs to send a packet to an address within this range, it knows to use the isobr0
interface, all thanks to this route in the routing table.
$ ip route
...
172.18.0.0/16 dev isobr0 proto kernel scope link src 172.18.0.1
...
The Main Program
Before we dive into the main
function where we'll bring all these pieces together, let's take a closer look at the invocation of the clone
function provided by the nix
crate inside main
. Understanding this is key to how we set up our isolated environments.
// prepare child process
let cb = Box::new(|| c_process(&args, veth2_idx));
let mut tmp_stack: [u8; STACK_SIZE] = [0; STACK_SIZE];
let child_pid = unsafe {
clone(
cb,
&mut tmp_stack,
CloneFlags::CLONE_NEWNET,
Some(Signal::SIGCHLD as i32),
)
}
In this part of the code, we're setting up a new child process. We use the clone
system call with a specific flag, CLONE_NEWNET
, to ensure this child process has its own separate network environment. We also allocate a memory stack for this process and define what it should do using a closure cb
. The clone
call returns the child process's ID, which we store in child_pid
. This setup is crucial for our project as it is required for our network setup.
The complete code for the main
function is outlined below. A crucial element within it is the c_process
function. This function is central to our setup — it's what runs as the child process in the newly created network namespace. What c_process
essentially does is: firstly, it calls setup_veth_peer
, which configures the network interface (veth-peer)
inside this new namespace. This step is vital for establishing network communication within the isolated environment. Secondly, c_process
executes the execute
function. This is where the server core functionality lies — based on our initial choice, execute
launches either a TCP or UDP echo server.
fn main() {
env_logger::init();
let args = Args::parse();
let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime");
let (_, _, veth2_idx) = rt
.block_on(prepare_net(
args.bridge_name.clone(),
&args.bridge_ip,
args.subnet,
))
.expect("Failed to prepare network");
// prepare child process
let cb = Box::new(|| c_process(&args, veth2_idx));
let mut tmp_stack: [u8; STACK_SIZE] = [0; STACK_SIZE];
let child_pid = unsafe {
clone(
cb,
&mut tmp_stack,
CloneFlags::CLONE_NEWNET,
Some(Signal::SIGCHLD as i32),
)
}
.expect("Clone failed");
info!("Parent pid: {}", nix::unistd::getpid());
rt.block_on(async {
join_veth_to_ns(veth2_idx, child_pid.as_raw() as u32)
.await
.expect("Failed to join veth to namespace");
});
thread::sleep(time::Duration::from_millis(500));
match waitpid(child_pid, None) {... Wait for the child process}
}
fn c_process(args: &Args, veth_peer_idx: u32) -> isize {
info!("Child process (PID: {}) started", nix::unistd::getpid());
// Spawn a new blocking task on the current runtime
let rt = tokio::runtime::Runtime::new().expect("Failed to create Tokio runtime");
let process = rt.block_on(async {
setup_veth_peer(veth_peer_idx, &args.ns_ip, args.subnet).await?;
execute(args.handler.clone(), args.server_addr.clone()).await
});
info!("Child process finished");
0
}
No cleanup process ?
In the code, you might wonder about the cleanup process for the network resources. Here’s how it works: The kernel plays a crucial role in resource management, especially with network namespaces. When the child process, which runs in its own network namespace, finishes its task and terminates, the associated network namespace is also destroyed. This is a key point — the destruction of the network namespace triggers the kernel to automatically clean up any network interfaces within it, including our veth pairs. So, when the part of the veth pair inside the namespace is deleted by the kernel, the corresponding part in the bridge becomes inactive and is typically removed as well. This automatic cleanup by the kernel ensures that our system remains efficient and free from unused network resources once the child process completes its job.
All the stuff we've talked about — setting up the network, creating isolated spaces, and all that code — is part of a project I've called isoserver. I chose a simple name because, honestly, naming things isn't my strong suit! It's a no-nonsense program that shows these ideas in action. If you're curious to see the code or maybe want to help out, you can find it all in the isoserver repository.
Running the Server
Now, let's look at how to run the app, similar to what's in the README of the repo. We'll go through the command-line arguments (CLI args) and see how to launch the server. The good news is, most of these arguments have default values, so you might not need to specify them all, depending on your setup
To run the server, use the following command. Remember, you can skip some arguments if the default values fit your needs:
sudo RUST_LOG=info ./isoserver--server-addr [server address] --handler [handler] \
--bridge-name [bridge name] --bridge-ip [bridge IP] --subnet [subnet mask] \
--ns-ip [namespace IP]
Values
- --server-addr: No default value, must be specified (e.g., "0.0.0.0:8080").
- --handler: Default is "tcp-echo". Options are "tcp-echo" or "udp-echo".
- --bridge-name: Default is "isobr0".
- --bridge-ip: Default is "172.18.0.1".
- --subnet: Default is "16".
- --ns-ip: No default value, must be specified (e.g., "172.18.0.2").
You can easily test the TCP echo server. First, run it with the default network configuration in one terminal. Then, open another terminal and use telnet 172.18.0.2 8080
. This will let you see the program in action.
sudo RUST_LOG=info ./isoserver --server-addr 0.0.0.0:8080 --ns-ip 172.18.0.2
[2023-12-21T21:02:54Z INFO isoserver::net] Interact with bridge isobr0 at cidr 172.18.0.1/16
[2023-12-21T21:02:54Z INFO isoserver::net] bridge isobr0 already exist
[2023-12-21T21:02:54Z INFO isoserver] Parent pid: 30396
[2023-12-21T21:02:54Z INFO isoserver] Child process (PID: 30413) started
[2023-12-21T21:02:54Z INFO isoserver::net] setup veth peer with ip: 172.18.0.2/16
[2023-12-21T21:02:54Z INFO isoserver::handlers::tcp] TCP echo server listening on: 0.0.0.0:8080
[2023-12-21T21:02:54Z INFO isoserver::handlers::tcp] waiting for new client connection
[2023-12-21T21:02:57Z INFO isoserver::handlers::tcp] new client connection
[2023-12-21T21:02:58Z INFO isoserver::handlers::tcp] Read 4 bytes from the socket
[2023-12-21T21:02:58Z INFO isoserver::handlers::tcp] Wrote 4 bytes to the socket
[2023-12-21T21:03:03Z INFO isoserver::handlers::tcp] Read 6 bytes from the socket
[2023-12-21T21:03:03Z INFO isoserver::handlers::tcp] Wrote 6 bytes to the socket
...
[2023-12-21T21:03:10Z INFO isoserver::handlers::tcp] Client disconnected
I opt for
0.0.0.0
to listen on all interfaces within the new network namespace. This choice is strategic because it allows the server to accept connections on any network interface that's available in its isolated environment, including the veth pair connected to the bridge. If we were to use127.0.0.1
, the server would only listen for connections originating from within the same network namespace, essentially limiting its reach to local-only interactions. By choosing0.0.0.0
, we eliminate the need for additional configurations that would be required to make the server accessible beyond the local scope of127.0.0.1
, like setting up specific routing or port forwarding rules.
So there you have it: I've created a simple method to launch isolated servers, each with its own veth. It's set up so I can attach eBPF programs for interaction and monitoring. This might not be the most complex program out there, but for me, it was both fun and incredibly useful to build.
To Conclude
And that wraps up our journey through the isoserver
project. We've covered everything from setting up isolated network namespaces
to configuring veth pairs
, all through straightforward Rust code. Remember, if you're curious about the details or want to experiment with the code yourself, the entire project is available in the repo.
Thank you for reading along. This blog is a part of my learning journey and your feedback is highly valued. There's more to explore and share, so stay tuned for upcoming posts. Your insights and experiences are welcome as we learn and grow together in this domain. Happy coding!