Concurrency and Memory Hierarchy

Cache Coherence

In modern computer systems, caches play a vital role in improving performance by reducing the latency of memory accesses. However, in a concurrent environment, caches can introduce challenges related to cache coherence.

Cache coherence ensures that multiple copies of the same data in different caches are kept consistent. When multiple threads access shared data, it’s essential to maintain cache coherence to avoid reading stale or inconsistent data.

To achieve cache coherence, various protocols and mechanisms are employed, such as the MESI (Modified, Exclusive, Shared, Invalid) protocol. These protocols define the states and transitions of cache lines to ensure data consistency across multiple caches.

Here’s an example in Java that demonstrates the importance of cache coherence:

public class SharedCounter {
    private volatile int counter = 0;

    public void increment() {
        counter++;
    }

    public int getCounter() {
        return counter;
    }
}

In this example, the volatile keyword is used to ensure that the counter variable is always read from and written to main memory, rather than relying on cached values. This guarantees that all threads see the most up-to-date value of the counter.

False sharing occurs when multiple threads access different data that happen to reside on the same cache line. Although the threads are accessing different data, the cache coherence mechanism treats it as if they are accessing the same data, leading to unnecessary cache invalidations and performance degradation.

To mitigate false sharing, it’s important to ensure that frequently accessed and modified data by different threads are placed on separate cache lines. This can be achieved through techniques such as padding and alignment.

Here’s an example in Golang that demonstrates false sharing:

type Counter struct {
    value int64
    _     [56]byte // Padding to prevent false sharing
}

func (c *Counter) Increment() {
    atomic.AddInt64(&c.value, 1)
}

func (c *Counter) Value() int64 {
    return atomic.LoadInt64(&c.value)
}

In this example, the Counter struct includes padding to ensure that each Counter instance occupies a separate cache line, preventing false sharing when multiple threads access different counters concurrently.

Memory Barriers and Fences

Memory barriers and fences are synchronization mechanisms used to enforce ordering and visibility constraints on memory operations. They ensure that memory operations are executed in a specific order and that the effects of those operations are visible to other threads.

Memory barriers are typically used to prevent reordering of memory operations by the compiler or the processor. They guarantee that all memory operations before the barrier are completed before any memory operations after the barrier are executed.

Here’s an example in Java that demonstrates the use of memory barriers:

public class SignalingExample {
    private volatile boolean flag = false;
    private int data = 0;

    public void send(int value) {
        data = value;
        flag = true; // Memory barrier ensures write to data happens before write to flag
    }

    public int receive() {
        while (!flag) {
            // Wait for flag to be set
        }
        return data; // Memory barrier ensures read of flag happens before read of data
    }
}

In this example, the memory barrier ensures that the write to data happens before the write to flag in the send method, and the read of flag happens before the read of data in the receive method. This guarantees that the receiver sees the updated value of data when the flag is set.

Non-Uniform Memory Access (NUMA)

Non-Uniform Memory Access (NUMA) is a memory architecture where memory access times depend on the memory location relative to the processor. In NUMA systems, each processor has its own local memory, and accessing memory local to the processor is faster than accessing memory on a different processor.

When writing concurrent programs on NUMA systems, it’s important to consider data locality and minimize cross-processor memory access. This can be achieved by techniques such as thread pinning, where threads are assigned to specific processors, and data partitioning, where data is distributed among the local memories of different processors.

Here’s an example in Golang that demonstrates thread pinning on a NUMA system:

func worker(id int, wg *sync.WaitGroup) {
    defer wg.Done()

    // Pin the goroutine to a specific processor
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()

    // Set the processor affinity
    cpu := id % runtime.NumCPU()
    runtime.GOMAXPROCS(1)
    runtime.Gosched()
    runtime.Sched.NUMA.BindThread(cpu)

    // Perform work
    // ...
}

func main() {
    var wg sync.WaitGroup
    numWorkers := runtime.NumCPU()

    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go worker(i, &wg)
    }

    wg.Wait()
}

In this example, each worker goroutine is pinned to a specific processor using the runtime.LockOSThread() and runtime.Sched.NUMA.BindThread() functions. This ensures that the goroutine executes on the same processor throughout its lifetime, promoting data locality and reducing cross-processor memory access.