Cache Lines & Atomics

In Part 2, we learned how to play “Struct Tetris.” We took bloated, inefficient structs and rearranged their fields to squeeze out every drop of wasted memory. We felt smart. We saved RAM.

But in software engineering, there is rarely a free lunch.

Sometimes, optimizing purely for size can accidentally destroy your speed. Even worse, on certain hardware, getting the alignment wrong doesn’t just make things slow—it makes your program crash instantly.

Today, we are leaving the safe world of RAM and entering the chaotic world of the CPU Cache.

The Invisible 64-Byte Line

We tend to think of variables as individual islands. If I have var A int64 and var B int64, they feel distinct. If Goroutine 1 touches A, it shouldn’t affect Goroutine 2 touching B.

But your CPU doesn’t see islands. It sees continents.

When your CPU pulls data from main memory (RAM), it doesn’t grab just the 8 bytes you asked for. That would be inefficient. Instead, it grabs a “Cache Line”—typically 64 bytes of consecutive memory—and pulls it into the L1 Cache.

Think of it like buying snacks. You don’t drive to the store to buy one single potato chip. You buy a whole bag.

This efficiency mechanism leads to one of the most insidious performance killers in concurrent programming: False Sharing.

The “Noisy Neighbor” Problem (False Sharing)

Imagine we have a struct tracking metrics for a web server. We optimized it perfectly using our skills from Part 2:

type ServerStats struct {
    ReqCount  int64 // Updated by Goroutine A
    ErrCount  int64 // Updated by Goroutine B
}

This struct is tiny (16 bytes). It fits easily inside a single 64-byte cache line.

Now, let’s say:

  • Core 1 runs Goroutine A, incrementing ReqCount.
  • Core 2 runs Goroutine B, incrementing ErrCount.

Both cores want this data in their L1 cache.

  1. Core 1 loads the cache line (containing both ReqCount and ErrCount) to write to ReqCount. It marks the line as “Modified.”
  2. Core 2 tries to read ErrCount. But the hardware sees that Core 1 holds the “exclusive lock” on that chunk of memory. Core 2 is forced to wait for Core 1 to sync data back.
  3. Core 2 finally gets the line, writes to ErrCount, and marks it “Modified.”
  4. Core 1 tries to write ReqCount again. It has to wait for Core 2.

This is called Cache Line Ping-Pong.

Even though the variables are logically different, they live in the same physical “house” (cache line). The cores fight over the cache line lock like two siblings fighting over a single TV remote. This can degrade performance by alot in high-concurrency systems.

The Fix: Manual Padding

To fix this, we have to do the exact opposite of what we did in Part 2. We need to waste memory on purpose.

We need to force ReqCount and ErrCount to live far enough apart that they end up on different cache lines.

type ServerStats struct {
    ReqCount int64
    // Pad with 56 bytes to fill the rest of the 64-byte line
    // (8 bytes used + 56 padding = 64 bytes)
    _ [56]byte 
    
    ErrCount int64
    _ [56]byte
}

Now, ReqCount lives on Cache Line X, and ErrCount lives on Cache Line Y. The cores stop fighting. Peace is restored.

Note: In the Go standard library, you will often see cpu.CacheLinePad used for this exact purpose.

The Atomic Crash (Alignment or Death)

False sharing just makes your code slow. But there is another alignment issue that will make your code panic.

This involves the sync/atomic package.

Atomic operations (like atomic.AddInt64) rely on specific CPU instructions. On 64-bit architectures (like your modern laptop), this is usually fine. But on 32-bit architectures (like ARMv7 or x86-32, often found in IoT devices or older servers), the rules are strict.

The Rule: A 64-bit atomic operation must be performed on a variable that is 64-bit aligned (starts at a memory address divisible by 8).

On a 32-bit system, the natural alignment of a standard variable is only 4 bytes. Look at this struct:

type DangerousStruct struct {
    count int32 // 4 bytes
    total int64 // 8 bytes (but on 32-bit arch, might align to 4!)
}

On a 32-bit machine:

  1. count takes bytes 0-3.
  2. total might start at byte 4. 4 is not divisible by 8.

If you try to call atomic.AddInt64(&d.total, 1), the Go runtime on a 32-bit machine will detect this misalignment and immediately panic.

The Fix: Order Matters Again

To prevent this crash, you must ensure 64-bit fields are the first element in the struct (since the struct itself is usually pointer-aligned), or you have to use manual padding to push it to an 8-byte boundary.

type SafeStruct struct {
    total int64 // Put this first!
    count int32
}

Note: As of Go 1.19+, the atomic.Int64 and atomic.Pointer types handle alignment for you automatically, effectively solving this headache. But if you are using the primitive functions (atomic.AddInt64) or maintaining older codebases, this trap is still waiting for you.

The “Do Not Move” Sign (nocopy)

Before we leave the world of hardware and atomics, there is one more structural hack you need to know. It’s not about speed; it’s about safety.

In Go, structs are copied by value. If I pass a struct to a function, Go clones the entire block of memory.

func doSomething(s MyStruct) { ... } // 's' is a brand new copy

Usually, this is fine. But what if your struct contains a sync.Mutex or a sync.WaitGroup?

These types contain internal state (semaphores and counters). If you copy a Mutex, you copy its “locked” state. The function receives a new lock that might be locked, but unlocking it won’t affect the original lock. This leads to deadlocks that are nightmares to debug.

To prevent this, the Go standard library uses a clever trick called the nocopy idiom.

If you look at the source code for sync.WaitGroup, you’ll see something like this:

type WaitGroup struct {
	noCopy noCopy

	state atomic.Uint64 // high 32 bits are counter, low 32 bits are waiter count.
	sema  uint32
}

What is noCopy? It’s a tiny, empty struct defined inside the sync package. It doesn’t really do anything at runtime (it’s zero-sized), but it triggers the go vet tool.

If you try to pass a WaitGroup by value, go vet checks if the struct implements a Lock() and Unlock() method (which noCopy does, artificially). If it detects a copy, it yells at you:

call of process copies lock value: sync.WaitGroup contains sync.noCopy

Actionable Tip: If you create a struct that must not be copied (e.g., it manages a file handle or a local cache), you can add this protection yourself:

type MySensitiveData struct {
    noCopy struct{} // Just a marker
    data   map[string]string
}

// Add a dummy Lock method to trigger 'go vet'
func (*MySensitiveData) Lock()   {}
func (*MySensitiveData) Unlock() {}

Now, if a junior developer tries to pass this struct by value, the linter will stop them before they break production.


Summary of Part 3

We have now seen the extremes of memory layout:

  1. Too Loose: If we are sloppy with struct fields, we waste RAM (Part 2).
  2. Too Tight: If we pack active variables too close together, we trigger False Sharing and kill CPU performance (Part 3).
  3. Just Wrong: If we ignore hardware alignment rules on 32-bit systems, sync/atomic will crash the app (Part 3).
  4. Too Copied: If we pass lock-containing structs by value, we duplicate internal state and create invisible deadlocks (Part 3).

So far, we have been working with honest data structures. An int64 is 8 bytes. A struct’s size is calculable. Alignment follows predictable rules.

But there are three types in Go that don’t play by these rules. Three types that give limited context about their size. Three types that hide their complexity behind syntactic sugar.

In Part 4, we’re going to dissect them. See you there!


<
Previous Post
Breaking the Glass: A Deep Dive into Go’s Unsafe Package (Part 1)
>
Next Post
Part 4: The Hidden Headers (Strings, Slices, Interfaces, and Maps)