Part 1 of the Torus DHT Blog Series


I’ve spent the last few months building Torus, a distributed hash table implementation based on the Chord protocol. Along the way, I’ve learned that DHTs aren’t just academic curiosities from MIT research papers, they’re the hidden backbone of systems you probably use every day.

But before we dive into the architecture, the finger tables, and the logarithmic routing algorithms, let’s talk about why any of this matters. Because honestly, when I first encountered the term “distributed hash table,” my eyes glazed over too.

The Problem: When Your Server Becomes Your Ceiling

Picture this: You’re building the next great file-sharing application. Maybe it’s for musicians collaborating on tracks, or developers sharing code snippets, or just a modern take on file storage. You start simple — a central server, a database, maybe a nice REST API. Classic architecture. Works great.

Then you get popular.

Suddenly, that single server is handling thousands of requests per second. Your database is groaning under the load. You add more RAM, upgrade your instance type, maybe throw in some read replicas. You’re scaling vertically, and it’s getting expensive fast.

But here’s the real kicker: no matter how beefy your server becomes, you’ve got a fundamental problem. That server is a single point of failure. When it goes down (and it will go down), your entire application becomes a very expensive paperweight.

Netflix knows this problem intimately. With hundreds of millions subscribers worldwide, they can’t afford to have a single server (or even a single data center) be the bottleneck. When you hit “play” on your favorite show, Netflix routes your request through a massively distributed system, content delivery networks spanning the globe, with localized caches, redundant servers, and sophisticated failover mechanisms.

Spotify faces similar challenges. They serve music to millions of simultaneous listeners, and interestingly, their music metadata is distributed across regions rather than centralized. If the service relied on a single database in a single location, the latency alone would make the experience terrible for users on the other side of the planet. Not to mention what would happen if that location went offline.

The traditional solution? Scale horizontally, add more servers, use load balancers, shard your database. This works, but now you’ve got a new problem: coordination. How do your servers agree on what data lives where? How do you handle servers joining and leaving your cluster? How do you ensure consistency when network partitions happen (and they will happen)?

Welcome to distributed systems, where everything is harder than you think.

Why Coordination is Actually Impossible

Here’s something interesting I learned: perfect coordination in distributed systems is mathematically impossible.

There’s this classic thought experiment called the Two Generals Problem that illustrates why. Imagine two armies on opposite hills, planning to attack a city between them. They can only communicate by sending messengers through enemy territory, where the messenger might get captured.

General A sends a message: “Attack at dawn.” But did General B receive it? Better send a confirmation.

General B sends back: “Received, attacking at dawn.” But did General A get the confirmation? Better confirm the confirmation.

General A sends: “Confirmation received.” But did General B… you see where this is going.

No matter how many messages they exchange, neither general can be 100% certain that the other knows they’re on the same page. There’s always uncertainty because the communication channel is unreliable.

In distributed systems, your servers are those generals, and the network between them is that unreliable messenger. You can’t eliminate this uncertainty, you can only mitigate it. This is why distributed systems design is all about trade-offs, not perfect solutions.

The CAP Theorem: Pick Your Battles

Speaking of trade-offs, let’s talk about the CAP theorem. It’s one of those things that sounds intimidating but is actually pretty intuitive once you get it.

CAP stands for Consistency, Availability, and Partition Tolerance. The theorem, conjectured by Eric Brewer in 2000 and formally proven by Gilbert and Lynch in 2002, states that in any distributed system, you can only guarantee two of these three properties at once:

  • Consistency: Every read receives the most recent write. All nodes see the same data at the same time.
  • Availability: Every request gets a response (even if it’s not the latest data). The system keeps working even if some nodes are down.
  • Partition Tolerance: The system keeps operating even when network failures split your cluster into isolated groups.

Here’s the thing: in real-world distributed systems, network partitions will happen. A switch will fail, a cable will get unplugged, AWS will have a bad day in us-east-1. Partition tolerance isn’t optional, it’s mandatory.

So the real choice is between CP (Consistency + Partition Tolerance) and AP (Availability + Partition Tolerance).

MongoDB chose CP. When a network partition happens, MongoDB prioritizes consistency, it’ll refuse to serve potentially stale data, even if that means some requests fail. For applications where showing incorrect data is worse than showing no data (think financial transactions), this makes sense.

Cassandra chose AP(Availability + Partition Tolerance). When partitions occur, Cassandra keeps serving requests and accepts that some nodes might temporarily have different versions of the data. Eventually (when the partition heals), all nodes converge to the same state. For applications where availability matters more than instant consistency (think social media feeds), this is the right trade-off.

There’s no “best” choice, only the right choice for your specific use case.

Enter: Distributed Hash Tables

So we’ve established that:

  1. Centralized systems don’t scale and have single points of failure
  2. Distributed systems are hard because coordination is impossible
  3. You can’t have perfect consistency, availability, and partition tolerance

This is where Distributed Hash Tables (DHTs) enter the picture.

You already know what a hash table is — it’s that data structure you use all the time for O(1) key-value lookups. HashMap in Java, dict in Python, map in Go. You give it a key, it hashes that key to find the right bucket, and boom, instant retrieval.

Now imagine spreading that hash table across hundreds or thousands of different machines. Each machine is responsible for storing some subset of the key-value pairs. There’s no central coordinator, no master node, no single point of failure. The system is truly distributed.

That’s a DHT. But it raises some immediate questions:

  • How do you know which machine stores which keys? You can’t ask a central coordinator because there isn’t one.
  • What happens when machines join or leave? Do you have to rehash and move all the data?
  • How do you find a key’s location without asking every single node? Checking thousands of machines defeats the point.

This is what makes DHTs clever. They solve these problems with elegant algorithms that provide:

  1. Decentralization: No single point of failure, no bottleneck
  2. Scalability: Adding more nodes increases capacity and often improves performance
  3. Fault Tolerance: The system keeps working even when nodes crash or network partitions occur

DHTs in the Wild: You’re Already Using Them

Before you think “this is all theoretical,” let me show you where DHTs are working right now, at massive scale.

BitTorrent: The Internet-Scale DHT

BitTorrent’s Mainline DHT is one of the largest distributed systems on the planet. We’re talking about tens of millions of concurrent nodes at any given moment. That’s not thousands — tens of millions.

When you download a torrent, your BitTorrent client doesn’t need to rely on a central tracker anymore (though it can). Instead, it queries the DHT to find peers who have the file you want. The DHT uses a protocol called Kademlia (a cousin of Chord, which Torus implements) to efficiently route queries through millions of nodes.

Think about that for a second. Your laptop can find a specific peer among millions of nodes, in seconds, without asking a central server. That’s the power of DHT routing algorithms.

IPFS: The Distributed Web

The InterPlanetary File System (IPFS) uses DHTs for content discovery. When you request a file from IPFS, the system uses the DHT to find which peers are storing that content. No central directory, no single organization controlling what’s available.

IPFS’s DHT (also Kademlia-based) stores over billions of records. The vision is a decentralized alternative to HTTP, a web where content is addressed by its cryptographic hash, not by its location on a specific server. Whether that vision fully materializes or not, the DHT technology powering it is very real and very functional.

Cassandra: Distributed Databases

Remember when we talked about Cassandra choosing AP (Availability + Partition Tolerance)? Under the hood, Cassandra uses consistent hashing—a core DHT technique to distribute data across its cluster.

When you write data to Cassandra, the system hashes your partition key to determine which nodes are responsible for storing that data. When nodes join or leave the cluster, only a fraction of the data needs to move. This is massive for scalability, adding a node to a 100-node cluster doesn’t require reshuffling all your data.

Ethereum: Peer Discovery

Ethereum nodes need to find each other without relying on centralized servers. The discv5 (discovery version 5) protocol uses a Kademlia-derived DHT for peer discovery.

When your Ethereum node starts up, it needs to connect to the network. It uses bootstrap nodes to join the DHT, then participates in maintaining the distributed database of all live nodes. This allows the Ethereum network to remain decentralized, there’s no “Ethereum, Inc.” running servers that every node must connect through.

The clever part? The DHT doesn’t just store arbitrary data, it stores and relays “node records,” which are cryptographically signed documents proving a node’s identity. This provides peer discovery and security without central authority.

The Three Superpowers of DHTs

After seeing these real-world applications, you might notice a pattern. DHTs consistently deliver three key properties that are incredibly hard to achieve otherwise:

1. No Single Point of Failure

There’s no master server to attack, no central database to corrupt, no CEO to serve with a takedown notice. The system is fundamentally resilient because it’s fundamentally distributed. You’d have to take down a significant fraction of all nodes simultaneously to kill the network.

BitTorrent has survived legal challenges, ISP throttling, and coordinated attacks precisely because there’s no central target. Take down one node (or a thousand nodes), and the DHT keeps functioning.

2. Scalable Without Bottlenecks

In traditional centralized systems, scaling means beefing up your central server or adding more replicas behind a load balancer. But the coordinator itself becomes the bottleneck.

In a DHT, adding more nodes increases the system’s capacity. More nodes means more storage, more bandwidth, and (with good routing algorithms) often better performance because queries have more paths to their destination.

The Chord protocol (which Torus implements) guarantees O(log N) routing complexity. In a network of 1,000 nodes, finding a key takes about 10 hops. In a network of 1,000,000 nodes? About 20 hops. That logarithmic scaling is what makes massive DHT networks feasible.

3. Fault Tolerant by Design

Nodes crash. Hard drives fail. Network cables get accidentally unplugged. In traditional systems, you handle this with redundancy — backup servers, RAID arrays, failover mechanisms.

DHTs build fault tolerance into the routing protocol itself. Data is replicated across multiple nodes (typically to successor nodes in the hash ring). If a node goes down, queries automatically route around it. When the node comes back up, it resynchronizes its data with its neighbors.

This isn’t perfect eventual consistency as there are still edge cases and failure modes, but it’s remarkably robust for a fully decentralized system.

The Price of Decentralization

I’d be lying if I said DHTs are a silver bullet. They’re not. There are real trade-offs you accept when you choose a DHT architecture:

Eventual consistency, not strong consistency. In most DHT designs (including Chord), you get eventual consistency. When you write data, it might take time to propagate to all replicas. Reads might return stale data. For some applications (financial transactions, collaborative editing), this is unacceptable.

Limited query capabilities. Traditional databases give you SQL, indexes, joins, aggregations. DHTs give you… GET(key) and PUT(key, value). That’s pretty much it. Want to do a range query? You’re in for a world of pain. DHTs are optimized for exact-key lookups, and that’s a significant constraint.

Network overhead. Every operation in a DHT involves network communication, often multiple hops across different nodes. Compare that to a local hash table in memory, and you’re looking at latencies that are orders of magnitude higher. This matters for latency-sensitive applications.

Complex failure modes. While DHTs handle individual node failures well, they can exhibit complex failure modes during network partitions or when many nodes crash simultaneously. Debugging a distributed hash table is way harder than debugging a single-server application. Trust me on this.

But here’s the thing: for certain classes of problems — truly decentralized systems, massive scale, censorship resistance, peer-to-peer applications, these trade-offs are worth it. Often, they’re the only viable option.

Why I Built Torus

When I started learning about distributed systems, I found the concepts fascinating but abstract. Reading papers about Chord and Kademlia was one thing, understanding how they actually work in practice was another.

So I decided to build one.

Torus is my implementation of the Chord DHT protocol, with a modern tech stack (Go backend, Next.js frontend, D3.js visualization) and a focus on being educational. The goal wasn’t to replace Cassandra or compete with IPFS. The goal was to make the abstract concrete, to see finger tables in action, to watch stabilization protocols maintain the ring, to visualize how O(log N) routing actually works.

Building Torus taught me more about distributed systems than reading a dozen textbooks. I encountered race conditions I never imagined, edge cases that only appear when nodes join and leave simultaneously, and the subtle beauty of consistent hashing algorithms.

Over this blog series, I’m going to walk you through that journey. We’ll dive deep into the Chord protocol, explore the challenges of building a production-ready DHT, examine the code that makes it all work, and discuss the trade-offs at every level.

But I wanted to start here, with the problem space, because understanding why DHTs exist is crucial to appreciating how they work. These aren’t abstract academic exercises, they’re practical solutions to real problems that centralized systems can’t solve.

What’s Next

In the next post, we’ll crack open the Chord paper and demystify the protocol that makes Torus possible. We’ll explore:

  • Consistent hashing and why circles are the perfect shape for distributed systems
  • Finger tables and the elegant trick that enables O(log N) routing
  • How nodes join and leave the ring without losing data
  • Why the math works (in plain English, I promise)

If you’ve ever wondered how BitTorrent finds peers so quickly, or how Cassandra decides which node owns which data, or what “consistent hashing” actually means beyond the buzzword, the next post is for you.

Until then, think about the systems you use every day. How many of them depend on centralized servers? What would happen if those servers disappeared tomorrow? And could a DHT architecture solve that problem?

That’s the question that drives this whole field. And it’s the question that led me to build Torus.


If you want to explore the Torus codebase yourself, it’s open source on GitHub. The visualization alone is worth checking out, watching Chord’s routing algorithm work in real-time makes everything click in a way that diagrams never could.

Have questions about DHTs, distributed systems, or anything in this post? Drop a comment below. I’m learning this stuff alongside you, and I’d love to hear what resonates (or what doesn’t make sense yet).


Further Reading:


<
Previous Post
The Invisible Man: How Go’s Standard Library Tricks the Compiler
>
Next Post
The Ring and Consistent Hashing: Chord’s Foundation