【tech blog review】: CPU cache-friendly data structure in GO

TL;DR

CPU Cache-Friendly Data Structures in Go: 10x Speed with Same Algorithm

正文#

看了一篇技术博客，关于 Go 程序性能优化的，受益匪浅，值得记录一下，首先原文如下：

https://skoredin.pro/blog/golang/cpu-cache-friendly-go

前置知识#

L1 Cache: 4 cycles (~1ns) 32KB
L2 Cache: 12 cycles (~3ns) 256KB
L3 Cache: 40 cycles (~10ns) 8MB
RAM: 200+ cycles (~60ns) 32GB

从缓存中读取数据的速度最快比从内存中快 60 倍
在 x86_64 架构上，一个缓存行通常是 64 字节：CPU 缓存不是一个字节一个字节地存储数据的，而是以固定大小的块来存储，这个块就叫做 缓存行(cache line)。当 CPU 从内存中读取数据时，它会把目标数据以及它周围的几十个字节一起读入一个缓存行。
缓存一致性：在多核系统中，为了确保所有 CPU cores 看到的数据都是最新的，多核系统有一个 “缓存一致性协议”。当一个 core 修改了它缓存中的某个数据时，它必须通知其他 core：这个缓存行中的数据已经过期了，你们不能再用旧数据了。

// BROKEN: False sharing destroys performance
type Counters struct {
    requests uint64 // 8 bytes
    errors uint64 // 8 bytes - SAME cache line!
    latency uint64 // 8 bytes - SAME cache line!
}
// Result: 45ns/op under contention

requests, errors,latency 这三个变量大概率会进入同一个 cache line，当其中一个变量被更新时，另外两个同时缓存失效了。这个时候再去获取它们，就得从内存里面去拿，这样显然速度慢了好多倍。

一种方法就是空间换时间：

// FIXED: Padding prevents false sharing
type Counters struct {
    requests uint64
    _ [56]byte // Padding to 64 bytes (cache line)
    errors uint64
    _ [56]byte
    latency uint64
    _ [56]byte
    timeouts uint64
    _ [56]byte
}

// Result: 7ns/op - 6.4x faster!

使用填充技术，将上述每一个变量放到一个 64 字节的内存块中，这样它们就会在不同的 cache line 中，各自的缓存更新不相互影响。这样的优化之后，速度快了 6 倍！虽然内存填充浪费了一些内存，但是在时间上的优化是可观的，所以值得。

结构体数组 VS. 数组结构体#

// 结构体数组 Array of Structs (AoS) - Cache UNFRIENDLY
type Entity struct {
    ID uint64 // 8 bytes
    X, Y, Z float64 // 24 bytes
    Velocity float64 // 8 bytes
    Health int // 8 bytes
    Type string // 16 bytes
    // Total: 64+ bytes per entity
}

type World struct {
    Entities []Entity // Each entity in different cache lines
}

func (w *World) UpdatePositions(dt float64) {
    for i := range w.Entities {
        // Loads 64+ bytes but only uses 32 bytes (X,Y,Z,Velocity)
        // Wastes 50% of cache!
        w.Entities[i].X += w.Entities[i].Velocity * dt
    }
}

// Benchmark: 85ns per entity

// 数组结构体 Struct of Arrays (SoA) - Cache FRIENDLY
type World struct {
    IDs []uint64
    Positions [][3]float64 // X,Y,Z together
    Velocities []float64
    Healths []int
    Types []string
}

func (w *World) UpdatePositions(dt float64) {
    // Positions and Velocities are contiguous in memory
    // CPU loads 8 positions per cache line!
    for i := range w.Positions {
        w.Positions[i][0] += w.Velocities[i] * dt
    }
}

// Benchmark: 12ns per entity - 7x faster!

Aos 是先定义一个复合数据类型组成的结构体，再重复 N 次（数组），SoA 是先将每一种数据类型重复 N 次（数组），再组合为结构体。

在 SoA 中，所有相同类型的属性被存储在各自的连续数组中。例如，所有实体的 Positions 数据都在 w.Positions 这个切片中，并且在内存中是连续排列的。当 CPU 需要访问 w.Positions[i] 时，它会从内存中加载一个缓存行。由于 w.Positions 中的元素是连续的，这个缓存行不仅会包含 w.Positions[i] 的数据，很可能还会包含 w.Positions[i+1]、w.Positions[i+2] 等后续元素的数据（预取，for 循环遍历操作会发生预取操作）。这意味着，CPU 一次从主内存加载数据，就能获得多个接下来可能需要的数据，大大增加了缓存命中率。

另外，在 UpdatePositions 这样的操作中， AoS 每次会加载整个 Entity，而这里面包含很多不需要的数据，而同样的操作，SoA 则只会加载需要的 Positions, Velocities 数据，其他数据不会被加载（冷热数据）。

预取 (Prefetching)#

刚刚提到了在 for 遍历的时候，预取器就会产生作用，将后续会用到的数据提前加载到缓存中，提高缓存命中率。

// Linear access - CPU prefetcher works great
func SumLinear(data []int) int {
    sum := 0
    for i := 0; i < len(data); i++ {
        sum += data[i] // CPU prefetches data[i+1], data[i+2]... }
    return sum
}
// 2.1ns per element

// Random access - Prefetcher can't help
func SumRandom(data []int, indices []int) int {
    sum := 0
    for _, idx := range indices {
        sum += data[idx] // Cache miss likely
    }
    return sum
}
// 45ns per element - 20x slower!

在线性遍历数据的时候，当前循环是 data[i], CPU 会预取 data[i+1], data[i+2]… 到缓存，从而加快数据遍历。

而在第一种案例中，由于从切片 indices 中获取的值不确定，因此 data[idx] 并不一定是线性增长的，CPU 预取器无法工作，也就不会发生数据预取了。可能实际我们不会绕弯写这种遍历程序，但是作为测试，它们已经很明显的展示了两种写法之前的巨大差异，体现了 CPU 预取的威力。

冷热数据分离#

刚才也涉及到了这个知识点，即要用到的热数据才会被加载，不会被用到的冷数据就不加载。这种冷热数据分离的性能差别有多大呢？下面的测试用例可以说明：

// BROKEN: Hot and cold data mixed
type User struct {
    ID uint64 // HOT: accessed frequently
    Score int // HOT: accessed frequently
    Name string // COLD: rarely accessed
    Email string // COLD: rarely accessed
    Address string // COLD: rarely accessed
    Bio string // COLD: rarely accessed
    // One User = 100+ bytes, but we often only need 16 bytes
}

func TopUsers(users []User) []uint64 {
    // Sorts load 100+ bytes per user but only use Score
    sort.Slice(users, func(i, j int) bool {
        return users[i].Score > users[j].Score // Cache thrashing!
    })
}

// FIXED: Separate hot and cold data
type UserHot struct {
    ID uint64
    Score int
    Cold *UserCold // Pointer to cold data
}

type UserCold struct {
    Name string
    Email string
    Address string
    Bio string
}

// Now sorting only touches 24 bytes per user
// 4x fewer cache misses!

NUMA（非统一内容访问）主动分配#

在多核系统中，内存控制器和内存成为多个 CPU core 争夺的核心资源之一，这成了系统性能的瓶颈。为了解决这个问题，将 CPU core 分组，每一组 core 都有自己的内存控制器和内容，这就是 非统一内容访问 (NUMA)，每一组都是一个 NUMA node。当程序从 NUMA 节点的本地内存中分配会更快，因为不用跨 NUMA 节点去访问内存控制器。所以，我们可以主动分配 gorounte 到指定的 core 内，已达到就近使用内存资源的目的。

// Check NUMA topology
// 有两个 numa 节点
// $ numactl --hardware
// node 0 cpus: 0-23
// node 1 cpus: 24-47

// Pin goroutine to specific CPU
// 一个 Go goroutine 绑定到（或 “钉住” 到）一个特定的 CPU 内核上
func PinToCPU(cpuID int) {
    runtime.LockOSThread()

    var cpuSet unix.CPUSet
    cpuSet.Zero()
    cpuSet.Set(cpuID)

    tid := unix.Gettid()
    unix.SchedSetaffinity(tid, &cpuSet)
}

// NUMA-aware worker pool
type NUMAPool struct {
    workers [][]chan Task // workers[numa_node][worker_id]
}

func (p *NUMAPool) Submit(task Task) {
    // Hash task to NUMA node for data locality
    node := hash(task.Key) % len(p.workers)
    worker := rand.Intn(len(p.workers[node]))
    p.workers[node][worker] <- task
}

func (p *NUMAPool) StartWorker(numaNode, workerID int) {
    // Pin to CPU in specific NUMA node
    // numa node 1 中的 CPU 是 24-27
    cpuID := numaNode*24 + workerID
    // 在指定的 numa node 内的 CPU 上面工作，避免跨 numaNode
    PinToCPU(cpuID)

    for task := range p.workers[numaNode][workerID] {
        processTask(task)
    }
}

更友好的分支预测#

下面的例子中，当对 data 排序之后再进行遍历，其中的 if 分支判断会有更效率，不会一会儿判断 false，一会儿判断 true，这样会严重影响分支预测，打乱预测节奏，从而拖慢了整个循环的执行速度。总之一边倒的分支判断更友好，效率更高。

// BROKEN: Unpredictable branches
func CountCondition(data []int) int {
    count := 0
    for _, v := range data {
        if v > 128 { // Random data = 50% misprediction
            count++
        }
    }
    return count
}
// With random data: 8.2ns per element

// FIXED: Sort first to make branches predictable
func CountConditionSorted(data []int) int {
    // 进行排序
    sort.Ints(data) // Now branches are predictable!

    count := 0
    for _, v := range data {
        if v > 128 { // First all false, then all true
            count++
        }
    }
    return count
}
// Even with sort overhead: 3.1ns per element

// BETTER: Branchless version 无分支预测
func CountConditionBranchless(data []int) int {
    count := 0
    for _, v := range data {
        // No branch, just arithmetic
        count += int((uint(v) >> 7) & 1)
    }
    return count
}
// Always fast: 2.3ns per element

利用 SIMD 特性#

当存在大量数据执行相同操作的时候，可以考虑面向 SIMD(Single input, Multiple Data) 优化。

// Enable SIMD processing with proper alignment
type Vec3 struct {
    X, Y, Z float32
    _ float32 // Padding for 16-byte alignment
}

// Process 4 vectors at once with SIMD
func AddVectors(a, b []Vec3, result []Vec3) {
    // Compiler can vectorize this loop
    for i := 0; i < len(a); i++ {
        result[i].X = a[i].X + b[i].X
        result[i].Y = a[i].Y + b[i].Y
        result[i].Z = a[i].Z + b[i].Z
    }
}

这里的关键是对 Vec3 结构进行优化，填充一个空的 float32，使整个 Vec3 成为 16 字节对齐的数据结构。这样在后面的 for 循环遍历的时候，只需要单个 SIMD 指令，就能批量获取 x,y,z 这些数据，然后同时执行加法操作。

什么时候可以考虑 SIMD？#

存在计算密集型 (CPU 能力) 的热点代码（比如循环）
对大量数据执行相同的操作
数据类型相同且连续
编译器可以进行自动向量化（编译器可以识别某些循环模式，并自动将其转换为 SIMD 指令）
其他优化效果有限的时候，比如已经尝试了改进算法、优化数据结构、减少内存分配和垃圾回收压力、利用并发能力、分支预测优化

什么时候不适合考虑 SIMD？#

I/O 密集型或内存密集型
数据量小
逻辑复杂，分支多，SIMD 擅长对数据进行向量加法、减法、乘法、点积、叉积等朴素的计算，不擅长额外的复杂的逻辑判断。

总结#

数据结构对程序性能的影响非常大，当发现程序性能存在瓶颈的时候，需要对关键数据结构进行分析，然后重点考虑缓存友好的设计。理解 CPU 如何与内存交互，并设计适当的数据结构和算法，以最大化 CPU 缓存的利用率

正文#

前置知识#

内存伪共享 (false sharing) 问题#

结构体数组 VS. 数组结构体#

预取 (Prefetching)#

冷热数据分离#

NUMA（非统一内容访问） 主动分配#

更友好的分支预测#

利用 SIMD 特性#

什么时候可以考虑 SIMD？#

什么时候不适合考虑 SIMD？#

总结#

NUMA（非统一内容访问）主动分配#