The 8 Fallacies of Distributed Network Systems: a Comprehensive Guide

Kenn Hussey

March 22, 2024

•

Imagine a popular e-commerce platform that relies on distributed network systems to handle thousands of transactions per second. Developers, in their pursuit of building a seamless user experience, make certain assumptions about the network infrastructure that powers their platform. However, these assumptions can prove detrimental to underlying network infrastructure if not carefully examined and accounted for. As the old saying goes when you assume you make an “*ss out of u and me,” and when that’s applied to the complex world of network technology–that mess gets even greater.

‍

These assumptions, known as the Fallacies of Distributed Network Systems, can lead to critical errors and vulnerabilities if not properly addressed. In this comprehensive guide, we will delve into each of these fallacies, exploring their effects and providing practical solutions to mitigate their impact.

1. The network is reliable

One of the most common fallacies of distributed network systems is assuming that the network is always reliable. This fallacy can have significant consequences for software applications. When developers write applications with little error handling for network errors, they may stall or infinitely wait for a response during a network outage. To address this fallacy, it is crucial to implement fault-tolerant design patterns within applications, API gateways (such as Edge Stack API Gateway), and service meshes.

Fault-tolerant design patterns include techniques such as timeouts, retries, bulkheads, and circuit breakers. By incorporating these patterns, applications can gracefully handle distributed network system errors and recover from failures. Timeouts ensure that applications do not wait indefinitely for a response, retries allow for resending requests in case of failures, bulkheads isolate failures to prevent cascading effects, and circuit breakers provide a mechanism to temporarily stop sending requests to a failing service.

2. Latency is zero

Another fallacy that developers often overlook is assuming that your distributed network system latency is zero. Ignoring network latency and the potential packet loss it can cause can result in inaccurate assumptions being coded into applications. It is essential for developers to familiarize themselves with the "Latency Numbers Every Programmer Should Know" to understand the typical latency values for various network operations.

To overcome this fallacy, developers should implement retries and rate limiting, as appropriate, in API gateways and service-to-service communications. Retries allow applications to handle temporary network delays by resending requests, while rate limiting helps prevent overwhelming a service with excessive requests. By incorporating these strategies, applications can gracefully handle network latency and ensure optimal performance.

3. Bandwidth is infinite

To infinity and beyond–or at least as far as your mediocre bandwidth will take you! Assuming that bandwidth is infinite is another fallacy that can lead to bottlenecks and dropped packets. It is crucial for developers to collaborate with the platform team, operations, and Site Reliability Engineering (SRE) to understand the network capabilities and limitations. By working closely with these teams, developers can gain insights into the available bandwidth and design their applications accordingly.

Understanding the network capabilities allows developers to optimize data transfer and avoid potential bottlenecks. Techniques such as data compression, efficient serialization formats, and asynchronous processing can help maximize bandwidth utilization and minimize the risk of dropped packets. By considering the limitations of bandwidth, developers can ensure smooth and efficient communication within distributed systems.

4. The distributed network is secure

If you walk away with anything, let it be this–under absolutely no circumstances should you ever assume 100% security of your network. Complacency regarding network security is a dangerous fallacy that can leave systems vulnerable to malicious users and programs. In today's threat landscape, it is essential to prioritize security measures and not underestimate the adaptability of attackers. Conducting threat modeling exercises can help identify potential vulnerabilities and design appropriate security controls.

Implementing authentication and authorization mechanisms is crucial to ensure that only authorized entities can access network resources. End-to-end Transport Layer Security (TLS) encryption should be enforced via API gateways and service meshes to protect data in transit. By adopting a proactive approach to network security, developers can stay ahead of evolving threats and safeguard their distributed systems.

Additionally, consider a Zero Trust Security approach to all that you do. Zero Trust is a security approach that challenges the traditional perimeter-based security model, which assumes that everything inside the network is trusted and everything outside is untrusted. In contrast, the Zero Trust model assumes that no user or device should be automatically trusted, regardless of their location or network connection.

The Zero Trust security approach is based on the principle of "never trust, always verify." It requires continuous authentication, authorization, and verification of every user, device, and network resource attempting to access an organization's systems and data. This approach aims to minimize the risk of unauthorized access, lateral movement, and data breaches. We have an excellent eBook that dives into this topic further for those who want to be Zero Trust Security Masters.

5. Topology doesn't change

Assuming that the network topology remains static is a fallacy that can have significant implications for both bandwidth and latency issues. In reality, network topology changes can occur due to various factors, such as scaling, load balancing, or infrastructure updates. These changes can impact the performance and reliability of distributed systems.

To mitigate the impact of changing network topology, it is crucial to regularly announce and audit network changes. Communication between development teams and operations teams becomes essential to ensure that all stakeholders are aware of the changes and can adapt their applications accordingly. In cloud networking environments, where dynamic scaling and infrastructure changes are common, this fallacy becomes even more critical to address.

6. There is only one administrator

Assuming that there is only one administrator overseeing the distributed network system can lead to conflicting policies and decisions. Multiple administrators may institute different policies, and senders of network traffic must be aware of these policies to complete their desired paths. This fallacy highlights the importance of collaboration between development teams, platform teams, operations, and SRE.

Working closely with these teams helps developers understand the network capabilities and policies. By fostering effective communication and collaboration, conflicting policies can be minimized, and a unified approach to network administration can be established. This ensures that network traffic flows smoothly and consistently across the distributed system.

7. Transport cost is zero

The fallacy of assuming that transport cost is zero can have significant implications for the design and maintenance of network infrastructure. When developers discount the costs associated with building and maintaining a network, it can lead to inefficient resource allocation, increased operational expenses, and suboptimal performance.

In reality, building and maintaining a network infrastructure incurs various costs, both in terms of time and financial resources. These costs include hardware and software procurement, network equipment installation and configuration, ongoing maintenance and upgrades, and the expertise required to manage and troubleshoot network issues.

By assuming that transport cost is zero, developers may overlook the need to budget time and money for building and maintaining networks, API gateways, and service meshes. This can result in inadequate resources allocated to network infrastructure, leading to performance bottlenecks, increased latency, and potential service disruptions.

Also, developers often discount the serialization costs associated with data transfer. Serialization refers to the process of converting data structures or objects into a format suitable for transmission over a network. This process incurs computational overhead, as data needs to be encoded and decoded, and can consume significant CPU and memory resources.

All of these woes can be easily mitigated by properly budgeting time and financial resources for network infrastructure. As a technology leader, ensure your organization is properly allocating resources for network equipment, software licenses, and ongoing maintenance and upgrades. Oh, and as I mentioned above, please do consider the serialization costs associated with data transfer.

8. The network is homogeneous

Last but certainly not least, assuming a homogeneous network can lead to problems similar to those of the first three fallacies we already covered. The distributed network systems fallacy of assuming that the network is homogeneous refers to the misconception that all components and elements within a network are identical and operate in a uniform manner. When developers assume a homogeneous network, they may overlook the variations in capabilities, performance, and behavior of different network components.

To really drive the point home, the solutions for fallacies 1-3 are applicable here as well…meaning that to fix it, we need to focus on the need for fault-tolerant design patterns, understanding latency, and collaborating with the platform team to comprehend network capabilities.

By recognizing and addressing the distributed network systems fallacy of network homogeneity, developers can design more robust and efficient systems. This ensures that applications can adapt to the diverse network components and deliver reliable, performant, and scalable solutions.

We’ve All Fallen For a Distributed Network Systems Fallacy

Let’s be honest–at least one or more of these faulty assumptions have probably crossed your mind in all your years as a software developer or technology team leader. I know I’ve been guilty of it, too! But now that you know what to look out for, I hope you’ll be able to understand these common misconceptions of network infrastructure and know how to combat them.

And when in doubt–collaboration with platform teams, operations, and SRE is crucial to gain a comprehensive understanding of network capabilities and policies. By addressing these fallacies, we can ensure our distributed network systems' reliability, security, and efficiency.