Common Networking Issues Every DevOps Engineer Encounters

Networking problems rarely announce themselves clearly. A deployment fails, a pod cannot reach its database, or a service responds intermittently. The logs look clean, yet something feels wrong. Most engineers eventually learn one painful truth: when everything else seems fine, it is usually the network.

From misrouted traffic to invisible firewalls, let me walk you through the most frequent networking issues that DevOps engineers encounter in Linux environments. I also explain how to investigate, diagnose, and fix each class of problem using real commands and reasoning.

All this comes from the experience I have gained after spending years. The same experience also yielded this Linux networking microcourse that you should definitely check out.

It’s Almost Always the Network

When an application behaves unpredictably, the first instinct is to look at the code. Developers dig through logs, restart containers, or roll back deployments. In many cases, the application is not the culprit and it’s the network.

Early in my career, I used to dread these moments as application logs would show nothing but retries and timeouts. The developers would swear nothing changed and the operations team would swear they touched nothing. Yet packets were vanishing into the void and that is how I began to take networking seriously not because I wanted to, but because I had to.

A good troubleshooting approach begins by proving that connectivity works at every layer. Start simple:

ping -c 4 8.8.8.8
ping -c 4 example.com

If the first command succeeds but the second fails, DNS is the culprit. If both fail, it is a routing or firewall issue. This baseline test should always come before looking into application-level logs.

Then, verify whether the local host can reach its gateway and whether packets are returning:

ip route show
traceroute 8.8.8.8

The Routing Rabbit Hole

Routing problems are deceptively subtle as traffic flows one way but not the other, or only some destinations are reachable. The root cause often hides in Linux’s routing tables or in policies added by container frameworks.

Start by displaying the active routes:

ip route

This shows the kernel’s routing decisions. For more detailed analysis, especially in multi-interface or container setups, check which route a particular destination would take:

ip route get 1.1.1.1

If a host has multiple network interfaces or is part of a VPN or overlay, verify that the correct table is being used. Linux supports multiple routing tables, and policy routing determines which one applies. Check the rules:

ip rule show

Misconfigured rules can cause asymmetric routing, where packets leave through one interface but return on another. Firewalls often drop these replies because they appear invalid. One reliable fix is to assign separate routing tables for each interface and use ip rule add with from or fwmark selectors to control the path.

For example, to route traffic from 192.168.10.0/24 through a specific gateway:

ip route add default via 192.168.10.1 dev eth1 table 10
ip rule add from 192.168.10.0/24 table 10

Always check for reverse path filtering:

cat /etc/resolv.conf

Set it to 2 (loose mode) on multi-homed hosts to prevent dropped packets due to asymmetric routes.

Routing issues rarely announce themselves clearly. The key is to map how packets should travel, then prove it with ip route get, traceroute, and tcpdump.

DNS: The Eternal Suspect

No other component gets blamed as frequently or incorrectly as DNS. Even the recent AWS outage that took down half of the internet was reportedly caused by DNS.

When an application cannot reach its dependency, the first guess is always “maybe DNS is broken.” Sometimes it is, but often the problem is caching, misconfiguration, or unexpected resolution order.

Start by checking the configured resolvers:

cat /etc/resolv.conf

Most distros these use systemd-resolved, the file may point to a stub resolver at 127.0.0.53. To see the active DNS servers:

resolvectl status

If resolution is inconsistent between services, the problem may be namespace isolation. Containers often have their own /etc/resolv.conf, copied at startup. If the host’s DNS changes later, containers keep using outdated resolvers.

Test resolution directly:

dig example.com
dig @8.8.8.8 example.com

Compare responses from the default resolver and a public one. If only the latter works, the issue lies in internal DNS or local caching.

A subtle but common failure arises from nsswitch.conf. The order of resolution methods (files dns myhostname) determines whether /etc/hosts entries or mDNS override DNS queries. In container-heavy environments, this can lead to confusing name collisions.

💡

DNS problems are not always network failures, but they produce identical symptoms. That is why verifying DNS resolution early saves hours of debugging.

Even when DNS works, it can still mislead you. I remember spending an hour debugging a connection issue that turned out to be caused by an unexpected IPv6 AAAA record. The application preferred IPv6 but the route to that subnet was broken. The fix was as simple as setting precedence ::ffff:0:0/96 100 in /etc/gai.conf.

MTU and Fragmentation Headaches

The Maximum Transmission Unit or MTU defines how large a packet can be before it needs fragmentation. When this number mismatches between interfaces, tunnels, or virtual networks, packets vanish without trace. You get intermittent timeouts, partial uploads, and mysterious hangs in SSH sessions.

To check the MTU on an interface:

ip link show eth0

To test path MTU discovery, use ping with increasing packet sizes:

ping -s 1472 8.8.8.8

Regular ICMP echoes may succeed even when TCP traffic fails. To detect MTU issues, you need to force the “do not fragment” flag:

ping -M do -s 1472 8.8.8.8

If it fails, lower the size until it succeeds. The MTU equals payload plus 28 bytes (ICMP and IP headers).

In virtualized or overlay environments (VXLAN, WireGuard, GRE, eBPF), encapsulation overhead reduces the effective MTU. For example, VXLAN adds 50 bytes. Setting MTU to 1450 instead of 1500 avoids fragmentation.

Adjust interface MTU safely:

ip link set dev eth0 mtu 1450

Applications sensitive to latency often experience erratic behavior because of hidden fragmentation. Once MTU mismatches are corrected, those mysterious slowdowns vanish.

In container environments, MTU mismatches become especially painful. Overlay networks such as Flannel or Calico encapsulate packets inside UDP tunnels, reducing available space. If the MTU is not adjusted inside the container, performance plummets. A single missing ip link set dev eth0 mtu 1450 can make a cluster look broken.

Overlay Networks and Ghost Packets

Modern clusters rely on overlays to connect containers across hosts. VXLAN, WireGuard, and similar technologies encapsulate traffic into tunnels, creating virtual networks. They are convenient but introduce new failure modes that look invisible to traditional tools.

A common symptom is “ghost packets” which is traffic that appears to leave one node but never arrives at another. The tunnel endpoint logs nothing, yet connectivity fails.

The first step is to confirm that the tunnel interfaces exist and are up:

ip link show type vxlan

Check if the remote endpoint is reachable outside the tunnel:

ping <remote_host_ip>

If that fails, the problem is not the overlay but the underlay, the physical or cloud network below it.

Next, verify that encapsulated traffic is not filtered. VXLAN uses UDP port 4789 by default, and WireGuard uses 51820. Ensure that firewalls on both ends allow those ports.

To inspect whether encapsulation is functioning:

tcpdump -i eth0 udp port 4789

If packets appear here but not on the remote host, NAT or routing between the nodes is rewriting source addresses in a way that breaks return traffic.

WireGuard adds its own layer of complexity. Its peers are identified by public keys, not IP addresses, so if the endpoint’s IP changes (for example, in cloud autoscaling), you must update its Endpoint in the configuration:

wg set wg0 peer <public-key> endpoint <new-ip>:51820

Overlay debugging requires seeing both worlds at once, the logical (tunnel) and physical (underlay) networks. Always verify that encapsulated packets can travel freely and that MTU accommodates the overhead. Most ghost packets die because of either firewall drops or fragmentation within the tunnel.

When Firewalls and Conntrack Betray You

Firewalls are supposed to protect systems, but when they fail silently, they create some of the hardest problems to diagnose. Linux’s connection tracking layer (conntrack) manages the state of every connection for NAT and stateful inspection. When its table fills or rules conflict, packets disappear with no visible error.

Start by checking the current number of tracked connections:

cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

I have debugged a number of microservice cluster where outbound connections failed intermittently and the culprit is overloaded conntrack table. Each NAT-ed connection consumes an entry, and the table silently drops new connections once full. The solution to this issue is simply increasing the limit:

sysctl -w net.netfilter.nf_conntrack_max=262144

For persistent tuning, add it to /etc/sysctl.conf.

State timeouts can also cause intermittent loss and long lived connections often expire in conntrack while still active on the application side. Adjust the TCP established timeout:

sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

Firewalls configured with nftables or iptables can complicate debugging when NAT or DNAT rules are applied incorrectly. Always inspect the active NAT table:

nft list table nat

Make sure destination NAT and source NAT are paired correctly because Asymmetric NAT produces connection resets or silence.

In high-throughput environments, offloading some rules to nftables sets with maps improves performance and reduces conntrack pressure. This is one of the areas where modern Linux firewalls significantly outperform legacy setups.

Conntrack issues are often invisible until you look directly into its state tables. Once you learn to monitor them, many “random” connectivity problems turn out to be predictable and fixable.

Lessons I Wish I Learned Earlier

Every engineer eventually learns that networking failures tend to follow recognizable patterns, and identifying those patterns early can save hours of unnecessary panic.

1. Always check the local host first. Half of network incidents begin with something as simple as a down interface, a missing route, or an outdated /etc/resolv.conf.

2. Validate one layer at a time. Use ping for reachability, dig for DNS, traceroute for routing, tcpdump for packet visibility, and nft list ruleset for firewalls and never skip steps.

3. Document assumptions. When debugging, write down what you believe should happen before testing. Networking surprises often come from assumptions no one verified.

4. Monitor the invisible. Connection tracking, queue lengths, and interface drops are invisible in standard metrics. Expose them to your monitoring system to catch silent failures early.

5. Learn how Linux really routes. Most complex issues trace back to misunderstood routing tables, policy rules, or namespaces. Understanding these mechanisms transforms troubleshooting from guessing to knowing.

Wrapping Up

The more you troubleshoot Linux networking, the more you realize it is not about memorizing commands. It is about building mental models of how packets move, how policies influence paths, and how the kernel’s view of the network differs from yours.

For DevOps engineers managing modern infrastructure, from bare metal to Kubernetes that understanding becomes essential. Once you have fixed enough DNS loops, routing asymmetries, and conntrack overflows, the next logical step is to study how Linux handles these problems at scale: multiple routing tables, virtual routing instances, nftables performance tuning, encrypted overlays, and traffic shaping.

The Linux Networking at Scale course builds directly on these foundations. It goes deeper into policy routing, nftables, overlays, and QoS, the exact skills that turn network troubleshooting into design. I highly recommend checking it out.

Common Networking Issues Every DevOps Engineer Encounters

How to Add Custom Style Variations to WordPress Blocks

Introducing Six New Business-Oriented Themes

Was the Air India crash caused by pilot error or technical fault? None of the theories holds up – yet

Fast Domains

Business Hosting

PC Fusion

Linux Park

© Copyright