Go Structs Are Not C++ Classes: Why Similar Modeling Roles Produce Different Hardware Execution Patterns

Go structs and C++ classes occupy similar modeling roles but commonly lead to different hardware execution patterns. From memory layout to CPU cache behavior, language defaults shape what the processor actually executes.

This is a follow-up to How Multicore CPUs Changed Object-Oriented Programming , which generated significant discussion about whether modern languages truly differ from classical OOP.

“Go structs are basically C++ classes” is usually shorthand for “Go structs play the same modeling role as my classes.”

This post shows why that analogy breaks at the CPU level - especially once indirection and dynamic dispatch enter the picture.

If you only take one thing from this article:

“Structs with methods” is not the differentiator.

The differentiator is execution topology: the pattern of memory loads, branches, and indirection that your code compiles into. Similar abstractions at the modeling level can produce different execution topologies under common idioms—and CPUs care about topology, not abstractions.

Language defaults shape which topologies become the path of least resistance.


Quick Clarifications from the Multicore Thread

Before the hardware details, these are the background assumptions behind the “structs are classes” claim:

“Java/C++ Are Still Used Successfully on Multicore”

The critique: “Enterprise runs on Java. Games run on C++. Multicore didn’t kill anything.”

Correct. Java and C++ adapt through:

  1. Thread pools (limit concurrency, reduce overhead)
  2. Immutable objects (java.lang.String, java.time.*)
  3. Concurrent collections (java.util.concurrent.*)
  4. Modern C++ value types (std::optional, std::variant)
  5. Smart pointers (std::unique_ptr reduces sharing)

These are workarounds retrofitted onto reference-based languages. Go/Rust bake these patterns into the language default.

The difference: Java requires discipline. Go makes cache-friendly patterns the default.

“You Can Model Any Semantics in C++”

The critique: “C++ lets you write value types with perfect forwarding, move semantics, RAII. You can model Go’s semantics in C++.”

True, but irrelevant. Yes, expert C++ programmers write:

1
2
3
4
5
6
7
8
9
// C++: value-oriented design
struct Point { int x, y; };  // Value type

std::vector<Point> points;  // Not pointers!
points.push_back({1, 2});   // Value stored inline

for (const auto& p : points) {  // Reference for perf
    sum += p.x + p.y;
}

But this requires:

  • Understanding copy/move semantics
  • Knowing when to use const&
  • Avoiding inheritance (commonly requires pointers for heterogeneous collections)
  • Fighting std library defaults (std::shared_ptr everywhere)

Go makes this the default. You don’t need expertise to write cache-friendly code.

The question is not what is possible in a language, but what is idiomatic under deadline pressure. Defaults shape systems.

“OOP Is Just Message Passing”

The critique: “Alan Kay said Erlang is true OOP. You’re attacking a strawman definition.”

Fair point. Kay’s vision (isolated objects communicating via messages) describes:

  • Erlang (actor model)
  • Go channels (CSP model)
  • Rust message passing (channels)

Not Java’s shared heap objects.

This article uses “classical OOP” to mean the 1980s-2010s mainstream implementation: Java, C++, Python, Ruby, C#. These languages deviated from Kay’s message-passing vision toward shared mutable heap objects.

So yes, if we define OOP as Kay intended, then Erlang/Go/Rust are OOP. The article’s thesis becomes: “Multicore forced mainstream OOP to return to Kay’s original vision.”


Foundational Terms

Before examining hardware differences, define the key concepts. Note: Latency numbers cited below are order-of-magnitude mental models; actual costs depend on cache level, miss rate, CPU architecture, and system load.

Cache locality: How close data is in memory. CPUs read memory in 64-byte cache lines. Sequential data (addresses 0x1000, 0x1008, 0x1010) fits in one cache line (fast - L1 cache ~0.5ns). Scattered data (addresses 0x1000, 0x5000, 0x9000) requires multiple cache lines. Modern CPUs have multiple cache levels: L1 (~0.5ns), L2 (~3-5ns), L3 (~10-30ns), DRAM (~100ns). Modern systems with NUMA can see cross-socket memory access exceed 150ns. Miss rates and latency depend on access pattern, working set size, and CPU architecture. Value semantics produce sequential layouts. Reference semantics produce scattered layouts.

Static dispatch: Function call where the target address is known at compile time. The CPU knows exactly which function to call before runtime. Enables inlining (compiler replaces call with function body). Cost: ~1ns, often zero after inlining.

Dynamic dispatch: Function call where the target address is determined at runtime through indirection (vtable lookup, interface dispatch). The CPU must load the function pointer from memory before calling. Prevents inlining. Cost: typically a few nanoseconds per call, depending on branch predictability and cache behavior.

Vtable (virtual method table): Compiler-generated table of function pointers used for dynamic dispatch. Each polymorphic object has a hidden vtable pointer (8 bytes overhead). Calling a virtual method: load object pointer → load vtable pointer from object → load function pointer from vtable → indirect call. Three memory accesses before the actual function executes.

Pointer chasing: Following pointers through memory to access data. Each pointer dereference is a memory access. If the target isn’t in cache (common for heap-allocated objects), costs 100ns. Sequential array access: one pointer (array base), then offsets (arithmetic). Scattered object access: load pointer, dereference (cache miss), load next pointer, dereference (cache miss).

Stack allocation: Local variables stored on the call stack. Allocation: move stack pointer (1 CPU cycle, <1ns). Deallocation: automatic when function returns (free). Memory is contiguous and reused across function calls. Lifetime: deterministic (scope-bound).

Heap allocation: Memory requested from allocator (malloc, new, runtime allocator). Allocation: search free lists, update metadata (typically tens to hundreds of cycles). Deallocation: explicit (free, delete) or garbage collection. Memory is scattered across heap. Lifetime: flexible (can outlive function scope). Note: Exact allocator costs vary by implementation and fast-path optimizations; these are representative ranges, not guarantees.

Contiguous memory: Data stored sequentially in memory. Arrays, slices, value structs. Enables CPU prefetching (hardware loads data before requested). Often achieves high cache hit rates (directionally 70%+ in tight sequential loops, varies by working set).

Scattered memory: Data stored at non-sequential addresses. Pointer arrays, heap-allocated objects, linked lists. Defeats CPU prefetching (unpredictable access pattern). Often causes elevated cache miss rates in large, scattered working sets.


The Hardware Reality

Philosophical debates aside, here’s what actually executes on the CPU.

The important difference isn’t what either language can do. It’s what their mainstream patterns make easy. Go makes “concrete value types, contiguous memory, static calls” the path of least resistance. C++ makes that possible too - but in inheritance-heavy designs, the path of least resistance often becomes “pointers, scattered objects, virtual dispatch.” That’s what shows up at the hardware level.

About the benchmarks in this article:

These microbenchmarks isolate primitives (pointer chasing, cache misses, indirect branches) that can dominate hot paths in real systems. The exact ratios don’t carry over universally—CPUs overlap latencies, prefetchers help sometimes, and bottlenecks shift with workload. But the mechanisms and direction do: when your hot loop becomes indirection-heavy (pointer chasing + indirect calls), the CPU pays these categories of penalties. The benchmarks show what’s possible when these costs concentrate in tight loops.


1. Memory Layout: Contiguous vs Scattered

C++: Polymorphic Class Hierarchies Push Toward Pointers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// C++: Heterogeneous polymorphism commonly requires pointer storage
class Shape {
public:
    virtual ~Shape() = default;
    virtual double area() const = 0;
};

class Circle : public Shape {
    int radius;
public:
    double area() const override { return 3.14159 * radius * radius; }
};

class Rectangle : public Shape {
    int width, height;
public:
    double area() const override { return width * height; }
};

// Pattern required for heterogeneous polymorphism:
std::vector<Shape*> shapes;
for (int i = 0; i < 1000; i++) {
    if (i % 2 == 0) {
        shapes.push_back(new Circle{i});  // Each 'new' calls malloc() internally
    } else {
        shapes.push_back(new Rectangle{i, i});
    }
}

// Cannot use vector<Shape> - causes object slicing
// Cannot store inline - different derived types have different sizes

Memory layout (what the hardware sees):

Stack:
std::vector<Shape*> shapes
├─ data: 0x7fff1000 (pointer to array)
├─ size: 1000
└─ capacity: 1024

Heap - Array of pointers (8 KB, contiguous):
0x7fff1000: [ptr 0] → 0x2a4b1000  (Circle)
0x7fff1008: [ptr 1] → 0x2a4b1020  (Rectangle)
0x7fff1010: [ptr 2] → 0x2a4b1040  (Circle)
...

Heap - Shape objects (scattered, different sizes):
0x2a4b1000: Circle{vtable, radius}      (16 bytes: vtable ptr + int + padding)
0x2a4b1020: Rectangle{vtable, w, h}     (16 bytes: vtable ptr + 2 ints)
0x2a4b1040: Circle{vtable, radius}      (16 bytes)
...

What 'new Circle{i}' does internally:
1. Call malloc(16) to allocate heap memory for Circle
2. Call Circle constructor to initialize vtable pointer and radius
3. Return pointer to allocated memory

Why pointers are required:
- Circle and Rectangle have different sizes (heterogeneous)
- vector<Shape> would slice off derived class data
- Storing different types in one container requires indirection

Result: Objects scattered across heap pages (no locality guarantee)

What happens during iteration:

1
2
3
for (auto* s : shapes) {
    area += s->area();  // Pointer dereference + vtable lookup
}

CPU execution:

  1. Read pointer from array: 0x7fff1000 → get 0x2a4b1000 (Circle)
  2. Dereference pointer: Jump to 0x2a4b1000 → load vtable pointer
  3. Follow vtable: Load Circle::area function pointer → indirect call
  4. Next iteration: Read 0x7fff1008 → get 0x2a4b1020 (Rectangle)
  5. Dereference: Jump to 0x2a4b1020 (likely cache miss!) → load vtable
  6. Follow vtable: Load Rectangle::area function pointer → indirect call

Cache behavior:

  • Pointer array is contiguous (cache-friendly)
  • Dereferencing jumps to random heap locations (cache-unfriendly)
  • Each object access risks cache miss (~100ns penalty)

Go: Contiguous Value Array

Go structs use value semantics by default - collections store actual objects, not pointers.

1
2
3
4
5
6
7
8
9
// Go: Collection of values
type Point struct {
    X, Y int
}

points := make([]Point, 1000)
for i := 0; i < 1000; i++ {
    points[i] = Point{i, i}
}

Memory layout (what the hardware sees):

Stack (or heap, decided by escape analysis):
points (slice header, 24 bytes):
├─ array: 0x7fff1000 (pointer to backing array)
├─ len: 1000
└─ cap: 1000

Backing array (16 KB, single allocation, contiguous):
0x7fff1000: Point{0, 0}   (16 bytes: x=8, y=8)
0x7fff1010: Point{1, 1}   (16 bytes)
0x7fff1020: Point{2, 2}   (16 bytes)
0x7fff1030: Point{3, 3}   (16 bytes)
...
0x7fff3e80: Point{999, 999} (16 bytes)

All data in ONE contiguous block

What happens during iteration:

1
2
3
for i := range points {
    sum += points[i].X + points[i].Y  // One memory access
}

CPU execution:

  1. Read Point at 0x7fff1000 (cache miss)
  2. Read Point at 0x7fff1010 (cache hit - same cache line!)
  3. Read Point at 0x7fff1020 (cache hit)
  4. Read Point at 0x7fff1030 (cache hit)
  5. …cache hits for 4-8 Points per cache line

Cache behavior:

  • All data sequential (perfect prefetching)
  • CPU loads 64-byte cache lines
  • High cache hit rate on sequential access (prefetcher loads ahead)
  • No pointer dereferencing overhead

The Hardware Impact

The Hardware Cost of Pointer Chasing

According to Jeff Dean’s “Latency Numbers Every Programmer Should Know” as order-of-magnitude reference points:

  • L1 cache reference: ~0.5 ns
  • Main memory reference: ~100 ns (200× slower)

Measured benchmark results ( source code ):

These numbers are from one machine with a specific working set. Focus on the ratio and direction:

C++ (1M elements, 100 iterations):

Pointer array (scattered heap):
  Measured: ~2 ns per element
  (This includes many cache hits; working set partially fits in cache)

Value array (contiguous memory):
  Measured: ~0.29 ns per element
  (High cache hit rate from sequential access + prefetching)

Measured ratio: 7× slower for pointer chasing

What the numbers mean:

  • The 2 ns average includes cache hits (not pure DRAM latency)
  • As working set grows beyond cache, miss rates rise toward the 100ns DRAM speeds
  • Sequential access enables hardware prefetching (CPU loads ahead)
  • Scattered access defeats prefetching (unpredictable pattern)

The key is the ratio (7×) and mechanism (cache locality), not absolute ns values.

The difference isn’t the language. It’s what the CPU executes:

  • Inheritance-heavy C++: Chase pointers through scattered memory
  • Go concrete types: Read contiguous sequential data

2. Virtual Method Dispatch: Vtable vs Static Calls

C++: Virtual Dispatch in Inheritance Hierarchies

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Shape {
public:
    virtual double area() = 0;  // Virtual method
};

class Circle : public Shape {
    int radius;
public:
    double area() override {
        return 3.14159 * radius * radius;
    }
};

// Usage
Shape* shapes[1000];
for (int i = 0; i < 1000; i++) {
    shapes[i] = new Circle{i};
}

for (auto* s : shapes) {
    double a = s->area();  // Virtual call
}

What the CPU executes:

Each object has hidden vtable pointer:

Circle object layout (16 bytes):
├─ [0-7]:  vtable pointer → 0x400000
├─ [8-12]: radius
└─ [12-16]: padding

Vtable (at 0x400000):
├─ [0]: &Circle::area
├─ [8]: &Circle::destructor
└─ [16]: RTTI pointer

Virtual call s->area():
1. Load object pointer:        s = 0x2a4b1000
2. Dereference to get vtable:  vtable = *(s+0) = 0x400000
3. Load function pointer:      func = *(vtable+0) = 0x401234
4. Indirect call:              call *func

Cost: 4 memory accesses + indirect branch
Time: typically a few nanoseconds per call, depending on branch predictability

Branch prediction impact:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// Mixed types - unpredictable branches
Shape* shapes[1000];
for (int i = 0; i < 1000; i++) {
    if (i % 2 == 0) {
        shapes[i] = new Circle{i};
    } else {
        shapes[i] = new Rectangle{i, i};
    }
}

// Each call could go to Circle::area OR Rectangle::area
// CPU branch predictor struggles (misprediction penalty: 15-20 cycles)

Go: Compile-Time Static Dispatch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
type Circle struct {
    Radius int
}

func (c Circle) Area() float64 {
    return 3.14159 * float64(c.Radius * c.Radius)
}

circles := make([]Circle, 1000)
for i := range circles {
    circles[i] = Circle{Radius: i}
}

for i := range circles {
    a := circles[i].Area()  // Static call
}

What the CPU executes:

Circle object layout (8 bytes):
└─ [0-8]: Radius (no vtable pointer!)

Static call circles[i].Area():
1. Load Circle value:     circle = *(circles + i*8)
2. Direct call:           call Circle.Area
   (address known at compile time: 0x401000)

Cost: 1 memory access + direct branch
Time: ~1-2ns per call

Compiler can inline:
for i := range circles {
    a := 3.14159 * float64(circles[i].Radius * circles[i].Radius)
}

Cost: 1 memory access + arithmetic (no function call!)
Time: ~0.5ns per iteration

Branch prediction impact:

1
2
3
// All calls go to same function - perfect prediction
// CPU branch predictor: 100% hit rate
// No indirect branches, no misprediction penalties

When Go Uses Virtual Dispatch (Interfaces)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type Shape interface {
    Area() float64
}

func (c Circle) Area() float64 {
    return 3.14159 * float64(c.Radius * c.Radius)
}

// Interface value (similar to C++ virtual dispatch)
var s Shape = Circle{Radius: 5}
a := s.Area()  // Dynamic dispatch through interface

Interface value layout (16 bytes):

Interface value (two words):
├─ [0-8]:  itab pointer → type + method table
└─ [8-16]: data pointer → actual Circle value

Dynamic call s.Area():
1. Load itab pointer:       itab = *(s+0)
2. Load method pointer:     func = *(itab+24)  // offset to Area
3. Load data pointer:       data = *(s+8)
4. Indirect call:           call func(data)

Cost: 3-4 memory accesses + indirect branch
Similar to C++ virtual dispatch

In Go, static dispatch is the default and dynamic dispatch is opt-in via interfaces. In C++, once you design around inheritance/virtuals, the dynamic-dispatch + indirection costs become pervasive in that slice of the codebase.

The Hardware Impact

Virtual Dispatch Overhead

From Jeff Dean’s latency numbers as order-of-magnitude mental models:

  • Branch mispredict: ~5 ns
  • Indirect call overhead: typically a few ns, varies by predictability

Measured benchmark results ( source code , isolated vtable overhead using contiguous arrays):

These numbers are from one machine and workload. Absolute values vary by CPU, compiler version, and access patterns. Focus on ratios and direction:

C++ (10M elements, 10 iterations = 100M calls):

Virtual dispatch (contiguous array, vtable lookup):
  Measured: ~20 ns per call (includes loop overhead + vtable indirection)

Static dispatch (contiguous array, direct calls):
  Measured: ~7 ns per call (includes loop overhead, likely partially inlined)

Measured ratio: ~2.8× slower for virtual dispatch

Go (10M elements, 10 iterations = 100M calls):

Interface dispatch (dynamic):
  Measured: a few ns per call on this system
  (Exact values vary widely and may be optimized away by the compiler)

Concrete type (static dispatch):
  Measured: sub-ns per call (likely fully inlined)

Measured ratio: several× slower for interface dispatch
(Treat only the ratio and mechanism as meaningful, not absolute ns)

Key findings:

  • Both languages show the same pattern: indirect calls are several times slower than direct/inlined calls
  • The ratios (2-5×) are more stable than absolute ns values
  • Go’s measured numbers suggest aggressive inlining; beware of devirtualization in microbenchmarks
  • In real code, vtable overhead combines with pointer chasing (see Benchmark 1)

In Go, static dispatch is the default path. In inheritance-heavy C++, dynamic dispatch often becomes pervasive across hot loops.


3. Object Allocation: Stack vs Heap

C++: Polymorphism Often Implies Pointer Lifetimes

1
2
3
4
5
6
7
// C++ pattern when using polymorphic base pointers:
Shape* s1 = new Circle{5};
auto s2 = std::make_unique<Rectangle>(3, 4);

// Concrete types can use stack allocation:
Point p1{1, 2};  // Stack, no indirection
std::vector<Point> points;  // Values, not pointers

What happens with new Point{1, 2}:

1. Call malloc(8)
   - Search free list for 8-byte chunk
   - Update heap metadata
   - Return pointer: 0x2a4b1000
   Cost: often tens to hundreds of CPU cycles on the fast path

2. Call Point constructor
   - Initialize x = 1, y = 2
   Cost: ~5 cycles

3. Later: delete p1
   - Call destructor
   - Call free(0x2a4b1000)
   - Update free list
   Cost: ~30-50 cycles

Total cost: ~100-150 cycles per allocation

Garbage collection isn’t the issue here. C++ uses manual memory management (new/delete), not GC. The cost is heap allocation itself - malloc/free overhead.

Why inheritance-heavy C++ commonly uses heap allocation:

  • Runtime polymorphism via base pointers requires indirection
  • Object lifetime beyond scope (return from function)
  • Containers holding heterogeneous derived types store pointers

Go: Stack Allocation Default (Escape Analysis)

1
2
3
4
5
// Go: Looks like heap allocation, but compiler decides
p1 := &Point{1, 2}  // Might be stack!
p2 := Point{3, 4}   // Definitely stack

// Compiler escape analysis determines stack vs heap

What the compiler does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
func createPoint() *Point {
    p := Point{1, 2}  // Does p escape?
    return &p         // Yes! Returns pointer
}
// Compiler: Allocate p on heap

func usePoint() {
    p := Point{1, 2}  // Does p escape?
    process(p)        // No! Stays local
    // Compiler: Allocate p on stack
}

Stack allocation (escape analysis says “no escape”):

1. Move stack pointer
   - Current SP: 0x7fffe000
   - Allocate 16 bytes: SP -= 16
   - New SP: 0x7fffeff0
   Cost: ~1 CPU cycle

2. Initialize Point
   - Write x = 1, y = 2 to stack
   Cost: ~2 cycles

3. Function return
   - Stack frame discarded (SP += 16)
   Cost: ~1 cycle

Total cost: ~4 cycles per allocation

Heap allocation (escape analysis says “escapes”):

1. Call runtime.newobject(16)
   - Small object allocation from per-P cache (mcache)
   - Fast path: ~10-20 cycles
   - Slow path (cache miss): ~50-100 cycles
   Cost: ~10-100 cycles (avg ~20)

2. Initialize Point
   Cost: ~2 cycles

3. Garbage collection
   - Mark phase: Scans object (amortized cost)
   - Sweep phase: Reclaims memory (amortized cost)
   Cost: ~5-10 cycles per object (amortized)

Total cost: ~30-50 cycles per allocation

The Hardware Impact

Allocation Cost Comparison

From Jeff Dean’s latency numbers :

  • Mutex lock/unlock: 25 ns

Heap allocation (malloc/new) typically involves:

  • Free list traversal or allocator lock
  • Metadata updates
  • Typical cost: tens to hundreds of nanoseconds on the fast path, far more under contention or fragmentation

Stack allocation:

  • Adjust stack pointer (SUB instruction)
  • Typical cost: <1 ns (single CPU cycle)

Measured allocation overhead ( source code ):

C++ (1M allocations, 48-byte objects):

Heap allocation (new + store pointer):
  Total time: 34.9 ms
  Time per allocation: 34 ns

Stack-based storage (vector of values):
  Total time: 6.5 ms
  Time per allocation: 6 ns

Measured speedup: 5.3× faster for stack-based storage

Go (1M allocations, 96-byte objects):

Heap allocation (pointer slice):
  Total time: 97.3 ms
  Time per allocation: 97 ns

Value slice (contiguous storage):
  Total time: 56.4 ms
  Time per allocation: 56 ns

Measured speedup: 1.7× faster for value storage

C++ shows larger difference because Go’s allocator has per-goroutine caches (faster small allocations). But both show that heap overhead exceeds value storage.

Important: This benchmark measures allocation + storage topology. Go’s per-P caches make individual allocations fast, but the total cost includes GC scanning of pointer graphs. C++ has malloc overhead but no GC. The trade-offs differ:

  • Go: Fast allocation, pays for GC scanning later
  • C++: Slower malloc, pays for explicit free/delete

Real-world impact: From Discord’s engineering blog , heap allocation pressure caused 2-minute GC pauses in production with millions of long-lived objects. Moving to value semantics (Rust) eliminated both allocation overhead and GC scanning.

Real-World Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// HTTP handler (typical web service)
func handleRequest(w http.ResponseWriter, r *http.Request) {
    // All these likely stay on stack:
    user := User{ID: 123, Name: "Alice"}
    config := Config{Timeout: 30}
    result := processRequest(user, config)
    writeResponse(w, result)
}

// Goroutine stack: 2KB-8KB
// 10,000 concurrent requests: 20-80 MB total
// Zero heap allocations for short-lived values

Compare to C++/Java where every object is heap-allocated:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// C++: All heap allocations
User* user = new User{123, "Alice"};
Config* config = new Config{30};
Result* result = processRequest(user, config);
writeResponse(w, result);
delete result;
delete config;
delete user;

// 10,000 concurrent requests: 30,000 heap allocations
// Plus malloc/free overhead

4. Inheritance: Pointer Indirection Requirement

C++: Polymorphism Commonly Pushes Toward Pointers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class Shape {
public:
    virtual double area() = 0;
};

class Circle : public Shape {
    int radius;
public:
    double area() override {
        return 3.14159 * radius * radius;
    }
};

class Rectangle : public Shape {
    int width, height;
public:
    double area() override {
        return width * height;
    }
};

// CANNOT store polymorphic objects contiguously:
// Shape shapes[1000];  // ERROR: Can't instantiate abstract class
// Shape shapes[1000] = {Circle{5}, Rectangle{10, 20}};  // ERROR: Object slicing

// MUST use pointers:
Shape* shapes[1000];
for (int i = 0; i < 1000; i++) {
    if (i % 2 == 0) {
        shapes[i] = new Circle{i};
    } else {
        shapes[i] = new Rectangle{i, i};
    }
}

Memory layout (no choice):

Array of pointers (contiguous):
shapes[0] → 0x2a4b1000 (Circle, 16 bytes)
shapes[1] → 0x2a4b1050 (Rectangle, 20 bytes)
shapes[2] → 0x2a4b10a0 (Circle, 16 bytes)
...

Objects scattered on heap (different sizes!)
Cannot be stored contiguously without indirection or manual layout machinery - different sizes per type

Why heterogeneous storage requires pointers:

  • Circle is 16 bytes (vtable ptr + radius + padding)
  • Rectangle is 20 bytes (vtable ptr + width + height + padding)
  • Cannot fit different-sized objects in fixed-size array
  • Indirection is required to store polymorphic collection

Go: Separate Arrays (Opt-In Polymorphism)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
type Circle struct {
    Radius int
}

type Rectangle struct {
    Width, Height int
}

// When you DON'T need polymorphism (most code):
circles := make([]Circle, 500)
rectangles := make([]Rectangle, 500)

// Process separately (cache-friendly):
for i := range circles {
    area := 3.14159 * float64(circles[i].Radius * circles[i].Radius)
}

for i := range rectangles {
    area := rectangles[i].Width * rectangles[i].Height
}

Memory layout (programmer’s choice):

circles array (contiguous, 4 KB):
├─ Circle{0}  (8 bytes)
├─ Circle{1}  (8 bytes)
├─ Circle{2}  (8 bytes)
...

rectangles array (contiguous, 8 KB):
├─ Rectangle{0, 0}  (16 bytes)
├─ Rectangle{1, 1}  (16 bytes)
├─ Rectangle{2, 2}  (16 bytes)
...

Both arrays fully contiguous
CPU prefetches perfectly
No pointer chasing

When you DO need polymorphism:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
type Shape interface {
    Area() float64
}

func (c Circle) Area() float64 {
    return 3.14159 * float64(c.Radius * c.Radius)
}

func (r Rectangle) Area() float64 {
    return float64(r.Width * r.Height)
}

// Now you pay the cost (like C++):
shapes := []Shape{
    Circle{5},
    Rectangle{10, 20},
}

// Interface values (16 bytes each):
// [itab ptr + data ptr] [itab ptr + data ptr] ...

The Hardware Impact

Performance Impact (Directional)

Based on cache latency costs (L1: 0.5ns, memory: 100ns as mental models):

C++ (inheritance-based polymorphism):
- Memory layout: Array of pointers → scattered objects
- Each access: Pointer read + dereference (cache miss likely) + vtable lookup
- Combined overhead: In worst cases (unpredictable access + cache misses), 
  can be orders of magnitude slower than sequential access
- In practice: Often multi-× slowdowns when hot loops become indirection-heavy

Go (concrete types, no polymorphism):
- Memory layout: Contiguous arrays
- Each access: Direct read (cache hits via prefetching)
- Static dispatch: Direct calls (often inlined)
- Overhead: Minimal compared to scattered + indirect patterns

Go (explicit polymorphism via interface):
- Memory layout: Interface values (similar to pointers)
- Each access: Similar cache/dispatch behavior to C++
- Combined overhead: Pays similar costs to C++ virtual dispatch when used

Go’s advantage: You choose when to pay the cost. C++ inheritance hierarchies make you pay everywhere.

Real-world example: Discord’s migration to Rust

From Discord’s engineering blog :

“We were reaching the limits of Go’s garbage collector… We had 2-minute latency spikes as the garbage collector was forced to scan the entire heap.”

After migrating their Read States service from Go to Rust (value semantics, no GC):

  • Before (Go): 2-minute latency spikes during GC
  • After (Rust): Microsecond average response times
  • Cache improvement: Increased to 8 million items in single LRU cache

This confirms the memory layout thesis: scattered heap objects create GC pressure and cache misses. Rust’s value semantics (similar to Go’s, but without GC) eliminated both problems.

Real-world impact: Game engines (ECS)

Why modern game engines moved away from deep inheritance hierarchies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Old approach (inheritance-based): Pointer array + virtual calls
GameObject* entities[N];
for (auto* e : entities) {
    e->update();  // Pointer chase + virtual call per entity
}

// Modern approach (data-oriented): Separate component arrays
Position positions[N];   // Contiguous
Velocity velocities[N];  // Contiguous
Health healths[N];       // Contiguous

for (int i = 0; i < N; i++) {
    positions[i].x += velocities[i].x;
    positions[i].y += velocities[i].y;
}

The pattern: moving from pointer graphs with indirect calls to contiguous arrays with direct loops commonly produces multi-× improvements, and in cache-bound workloads can reach order-of-magnitude gains. The speedup isn’t from Go vs C++—it’s from contiguous data vs pointer indirection, regardless of language.


5. Method Receivers: Explicit Mutation Visibility

C++: Implicit this Pointer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class Counter {
    int count;
public:
    void increment() {
        this->count++;  // 'this' is hidden pointer
        // Signature doesn't show mutation
    }
    
    int get() {
        return this->count;
    }
};

// Usage - can't tell if methods mutate:
Counter c;
c.increment();  // Mutates? Maybe?
c.get();        // Mutates? Maybe?

What the CPU executes:

Counter object layout:
├─ [0-4]: count

Call c.increment():
1. Load 'this' pointer:     rdi = &c (calling convention)
2. Load count:              eax = *(rdi+0)
3. Increment:               eax++
4. Store count:             *(rdi+0) = eax

'this' is passed as a pointer, introducing implicit indirection at the call boundary

Concurrency issue:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Counter c;

std::thread t1([&]() { c.increment(); });
std::thread t2([&]() { c.increment(); });
t1.join();
t2.join();

// RACE CONDITION
// No indication in method signature that mutation happens
// No compiler help

Go: Explicit Value vs Pointer Receivers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type Counter struct {
    Count int
}

// Value receiver: Receives COPY (can't mutate original)
func (c Counter) Get() int {
    return c.Count
}

// Pointer receiver: Receives POINTER (can mutate original)
func (c *Counter) Increment() {
    c.Count++
}

// Usage - mutation is VISIBLE in signature:
c := Counter{Count: 0}
c.Get()        // Value receiver - can't mutate
c.Increment()  // Pointer receiver - might mutate

What the CPU executes:

Value receiver (c Counter):

Call c.Get():
1. Copy Counter:            [stack] = c (16 bytes if escaped)
2. Load count:              eax = [stack+0]
3. Return:                  return eax

No pointers, no indirection
Function receives independent copy
Original unchanged (guaranteed)

Pointer receiver (c *Counter):

Call c.Increment():
1. Load pointer:            rdi = &c (calling convention)
2. Load count:              eax = *(rdi+0)
3. Increment:               eax++
4. Store count:             *(rdi+0) = eax

Pointer indirection (like C++ 'this')
Original mutated

Concurrency benefit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
c := Counter{Count: 0}

// Value receiver - SAFE (each goroutine gets copy)
go func() {
    val := c.Get()  // Independent copy, no race
}()

// Pointer receiver - UNSAFE (shared state)
go func() {
    c.Increment()  // RACE CONDITION (visible in signature!)
}()

The receiver type shows the programmer whether mutation/sharing happens.

Why This Matters

The receiver type makes sharing and mutation intent visible at the call boundary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
type Counter struct {
    Count int
}

// Value receiver: Can't mutate original (receives copy)
func (c Counter) Get() int {
    return c.Count
}

// Pointer receiver: Can mutate original (receives pointer)
func (c *Counter) Increment() {
    c.Count++
}

The impact is semantic and ergonomic, not a direct performance guarantee:

  • Concurrency reasoning: Seeing counter.Get() vs counter.Increment() immediately tells you which ones might mutate shared state
  • Escape behavior: Value receivers can sometimes enable stack allocation, but this depends on escape analysis heuristics, not receiver type alone
  • Aliasing opportunities: Value receivers give the compiler more freedom to reason about aliasing, but whether this translates to optimization depends on many factors

The key win is intent clarity at call boundaries, which shapes what patterns become idiomatic in concurrent code. Performance effects are indirect (escape analysis, aliasing analysis), not a guaranteed inlining switch.


6. Construction: Special Syntax vs Functions

C++: Constructor Special Semantics

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Point {
    int x, y;
public:
    // Constructor (special rules):
    Point(int x, int y) : x(x), y(y) {
        // Member initialization list required for const/reference members
        // Virtual functions can't be called here safely
        // Exception during construction leaves object half-initialized
    }
    
    // Copy constructor (implicitly generated or explicit)
    Point(const Point& other) : x(other.x), y(other.y) {}
    
    // Move constructor (C++11)
    Point(Point&& other) : x(other.x), y(other.y) {
        other.x = 0;
        other.y = 0;
    }
    
    // Destructor (called automatically)
    ~Point() {
        // Cleanup logic
    }
};

// Usage
Point p1(1, 2);                    // Constructor
Point p2 = p1;                     // Copy constructor
Point p3 = std::move(p1);          // Move constructor
// Destructors called automatically at end of scope

What the CPU executes:

Point p1(1, 2):
1. Allocate 8 bytes (stack or heap)
2. Call Point::Point(int, int)
   - Initialize x, y
3. Mark object as constructed

Point p2 = p1:
1. Allocate 8 bytes
2. Call Point::Point(const Point&)
   - Copy x, y
3. Mark object as constructed

End of scope:
1. Call p3.~Point()
2. Call p2.~Point()
3. Call p1.~Point()
4. Deallocate memory

Complex semantics:

  • Initialization order matters (member init list)
  • Copy/move constructors have implicit generation rules
  • Exception safety during construction is subtle
  • Virtual function dispatch doesn’t work in constructors
  • Destructors must be virtual for polymorphic classes

Go: Regular Functions (No Special Semantics)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
type Point struct {
    X, Y int
}

// Not a constructor - just a function
func NewPoint(x, y int) Point {
    return Point{X: x, Y: y}
}

// Usage
p1 := Point{1, 2}            // Struct literal
p2 := NewPoint(3, 4)         // Regular function call
p3 := p1                     // Copy (no special constructor)

// No destructors - garbage collector handles cleanup

What the CPU executes:

p1 := Point{1, 2}:
1. Allocate 16 bytes (stack or heap, escape analysis)
2. Write x = 1
3. Write y = 2
That's it. No special semantics.

p2 := NewPoint(3, 4):
1. Call NewPoint (regular function)
2. Return value copies to p2
3. No constructor semantics

p3 := p1:
1. Load p1 (16 bytes)
2. Store to p3 (16 bytes)
3. Memcpy (no special copy constructor)

Simpler semantics:

  • No initialization order complexity (just assignments)
  • No copy/move distinction (always copies bytes)
  • No destructor timing issues (GC handles cleanup)
  • No virtual function restrictions
  • No exception safety concerns (no exceptions in Go)

The Hardware Impact

Construction overhead comparison:

C++: Create 1 million Points
- Constructor calls: 1 million
- Copy constructor calls: Variable (depends on usage)
- Destructor calls: 1 million
- Time: Depends on constructor complexity

Go: Create 1 million Points
- Struct initialization: 1 million (memcpy)
- No constructor/destructor overhead
- Time: Minimal (just memory writes)

The difference is conceptual complexity, not raw performance. C++ constructors add rules that the programmer must understand. Go treats initialization as simple data copying.


7. Memory Footprint: Hidden Vtable Pointers

C++: Every Polymorphic Object Has Vtable Pointer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class NonVirtual {
    int x, y;
};
// Size: 8 bytes (just data)

class Virtual {
    int x, y;
    virtual void foo() {}
};
// Size: 16 bytes (vtable pointer + data + padding)

// The vtable pointer is hidden but always there

Memory layout:

NonVirtual object (8 bytes):
├─ [0-4]: x
└─ [4-8]: y

Virtual object (16 bytes):
├─ [0-8]:  __vptr (hidden vtable pointer)
├─ [8-12]: x
└─ [12-16]: y (includes padding)

Overhead: 8 bytes per object (50% increase!)

Array of 1 million objects:

1
2
3
4
5
NonVirtual objects[1000000];
// Memory: 8 MB

Virtual objects[1000000];
// Memory: 16 MB (8 MB is vtable pointers!)

Go: No Hidden Pointers

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type Point struct {
    X, Y int
}
// Size: 16 bytes (just data, no hidden pointers)

type PointWithMethod struct {
    X, Y int
}

func (p PointWithMethod) Foo() {}
// Size: Still 16 bytes! Methods don't add memory

Memory layout:

Point object (16 bytes):
├─ [0-8]:  X
└─ [8-16]: Y

No hidden pointers
Methods are not stored in objects
Function pointers resolved at compile time

Array of 1 million objects:

1
2
3
4
5
points := make([]Point, 1000000)
// Memory: 16 MB (just data)

pointsWithMethods := make([]PointWithMethod, 1000000)
// Memory: Still 16 MB (methods don't add size)

Interface Values (Explicit Overhead)

1
2
3
4
5
6
type Shape interface {
    Area() float64
}

var s Shape = Circle{Radius: 5}
// Interface value: 16 bytes (itab pointer + data pointer)

Memory layout:

Interface value (16 bytes):
├─ [0-8]:  itab pointer (type + methods)
└─ [8-16]: data pointer (or small value directly)

Overhead: 8-16 bytes per interface value
But this is EXPLICIT - you opt in with interface type

Array comparison:

1
2
3
4
5
6
7
// Concrete types (no overhead):
circles := make([]Circle, 1000000)
// Memory: 8 MB (8 bytes per Circle)

// Interface types (explicit overhead):
shapes := make([]Shape, 1000000)
// Memory: 16 MB (16 bytes per interface value)

The Hardware Impact

Memory overhead:

C++ with virtual methods:
- 1M Point objects: 16 MB (8 MB vtable pointers)
- Cache pollution: Half the cache lines are pointers
- Memory bandwidth: Wasted on metadata

Go concrete types:
- 1M Point objects: 16 MB (pure data)
- Cache efficiency: All cache lines are data
- Memory bandwidth: Fully utilized

Go interface types (explicit):
- 1M Shape interfaces: 16 MB (same as C++)
- But opt-in, not default

Real-world impact:

Game engines processing 100,000 entities:

C++ (virtual methods required):
- Entity size: 64 bytes (vtable + data)
- Total: 6.4 MB
- Effective data: 3.2 MB (50% overhead)

Go/ECS (concrete types):
- Component size: 16-32 bytes (pure data)
- Total: 1.6-3.2 MB
- Effective data: 100% (no overhead)

Cache difference: 2-3× more data fits in cache

The Systemic Difference

These 7 differences aren’t independent. They compound:

Inheritance-Heavy C++ Pattern (Common in OO Designs)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// C++: Idiomatic inheritance-based design
class GameObject {
public:
    virtual void update() = 0;  // Virtual method (vtable pointer)
    virtual ~GameObject() {}    // Virtual destructor
};

class Enemy : public GameObject {
    Vector3 position;
    Vector3 velocity;
    int health;
public:
    void update() override {
        position += velocity;
    }
};

// Must use pointers (polymorphism requirement):
std::vector<GameObject*> entities;
for (int i = 0; i < 100000; i++) {
    entities.push_back(new Enemy{});  // Heap allocation
}

// Processing:
for (auto* e : entities) {
    e->update();  // Pointer chase + virtual dispatch
}

Hardware execution:

  1. Read pointer from vector (cache hit)
  2. Dereference pointer (cache miss - scattered heap)
  3. Load vtable pointer (another memory access)
  4. Load function pointer from vtable (another memory access)
  5. Indirect call (branch misprediction possible)

Result: 4-5 memory accesses per iteration, scattered across RAM

Go Concrete-Type Pattern (Path of Least Resistance)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
// Go: Idiomatic data-oriented design
type Position struct { X, Y, Z float64 }
type Velocity struct { X, Y, Z float64 }
type Health struct { HP int }

// Separate arrays (no inheritance, no pointers):
positions := make([]Position, 100000)
velocities := make([]Velocity, 100000)
healths := make([]Health, 100000)

// Processing:
for i := range positions {
    positions[i].X += velocities[i].X
    positions[i].Y += velocities[i].Y
    positions[i].Z += velocities[i].Z
}

Hardware execution:

  1. Read Position from array (cache hit)
  2. Read Velocity from array (cache hit)
  3. Compute sum (CPU registers)
  4. Write back to Position (cache hit)

Result: 2-3 memory accesses per iteration, sequential RAM access

Performance Comparison

Directional Performance Impact:

When hot loops shift from pointer-chasing + virtual calls to contiguous iteration + direct calls:

Inheritance-based (pointer graphs + virtual dispatch):
- More memory accesses per operation (pointer + vtable + data)
- Higher cache miss rates (scattered objects)
- Indirect branches (misprediction penalties)
- Result: Multi-× to orders of magnitude fewer operations per frame (depends on cache)

Data-oriented (contiguous arrays + direct calls):
- Fewer memory accesses per operation (direct array indexing)
- Lower cache miss rates (sequential access + prefetching)
- Direct branches (predictable, often inlined)
- Result: Multi-× to orders of magnitude more operations per frame (depends on cache)

This isn’t “Go is faster than C++.” This is contiguous data is faster than pointer chasing, regardless of language. The question is which pattern your idioms push you toward.

The difference: C++’s heterogeneous polymorphism commonly pushes designs toward pointer indirection. Go’s design makes it optional.


When C++ and Go Are Similar

Go interfaces do use dynamic dispatch (like C++ virtual methods):

1
2
3
4
5
6
7
8
9
type Shape interface {
    Area() float64
}

func processShapes(shapes []Shape) {
    for _, s := range shapes {
        a := s.Area()  // Dynamic dispatch (like C++ virtual call)
    }
}

This has the same costs as C++:

  • Indirect calls through interface
  • Scattered memory (interface values hold pointers)
  • Branch misprediction penalties
  • Cache misses

The difference is where you pay the cost:

Go: Interfaces are pervasive in the standard library (io.Reader, io.Writer, error, fmt.Stringer, context.Context). You’re constantly using interface dispatch for I/O, errors, and formatting. But these are glue code where I/O latency dominates anyway (disk/network operations take milliseconds, interface dispatch takes nanoseconds).

Your domain data structures remain concrete values:

1
2
3
4
5
6
7
// Business logic: Concrete types (cache-friendly)
type Point struct { X, Y int }
type Transaction struct { ID, Amount int }
type User struct { Name string, Age int }

points := make([]Point, 1000000)      // Contiguous values
transactions := make([]Transaction, 1000000)  // Contiguous values

C++: Inheritance-based polymorphism commonly propagates into your domain objects. Your business logic data structures pay the indirection cost:

1
2
3
4
5
6
// Business logic: Commonly designed with inheritance (cache-hostile)
class GameObject { virtual void update() = 0; };
class Transaction { virtual void process() = 0; };

GameObject* entities[1000000];     // Your data is pointers
Transaction* txns[1000000];        // Your data is scattered

The real distinction: Go’s interface cost concentrates in I/O boundaries (already slow). C++ inheritance cost spreads into your hot loops (where every nanosecond matters).

When processing millions of domain objects in tight loops, Go’s concrete types avoid the overhead. When doing I/O operations, both languages pay interface costs - but I/O dominates anyway.


Summary: Hardware-Level Differences

AspectC++ Inheritance PatternGo Concrete TypesMeasured Impact
Memory layoutScattered (heap pointers)Contiguous (value arrays)7.3× speedup (measured)
Method dispatchVirtual (vtable lookup)Static (compile-time)2.8× speedup C++, 4.6× speedup Go (measured)
AllocationHeap (new/delete)Stack/value storage5.3× speedup C++, 1.7× speedup Go (measured)
PolymorphismPervasive (inheritance)Optional (interfaces)Opt-in cost vs pervasive cost
ReceiverImplicit this pointerExplicit value/pointerCan enable inlining and copy elision (compiler-dependent)
ConstructionSpecial semanticsRegular functionsSimpler, fewer edge cases
Memory overhead+8 bytes (vtable ptr)+0 bytes50% space savings per object

Benchmarks: Source code and methodology

The compounding effect (measured):

Processing 1M Point objects, 100 iterations:

C++ inheritance pattern: 
  Pointer array (2ns/elem) + virtual dispatch (20ns/call)
  = 213.8ms total

C++ value-oriented:
  Value array (0.29ns/elem) + static dispatch (7ns/call)  
  = 29.2ms total

Measured speedup: 7.3× faster

When combined: Memory layout dominates (accounts for ~86% of speedup)

Conclusion

“Structs with methods” is not the differentiator.

The differentiator is execution topology: whether your design trends toward contiguous data + direct calls, or pointer graphs + indirect calls.

Both Go and C++ can express either style. The difference is what becomes idiomatic under pressure:

  • Go makes it easy to keep domain data as concrete values in contiguous slices, and to reserve interfaces and pointers for boundaries.
  • Classic C++ inheritance-centric designs commonly propagate base pointers and virtual calls into hot loops, which brings pointer chasing + indirect branches along for the ride.

The CPU doesn’t care about “objects”—it cares about cache lines, branches, and allocations. This article maps language idioms to execution topologies: the actual pattern of loads, stores, and branches that the hardware executes.

What the benchmarks show:

The measured ratios (7× for memory layout, 3-5× for dispatch) isolate specific mechanisms. Real systems see variable impacts depending on:

  • Whether the working set fits in cache
  • Whether branches are predictable
  • How allocation patterns interact with GC or allocator behavior
  • Whether prefetchers and OoO execution can hide latency

But the direction is consistent: when hot loops concentrate pointer chasing + indirect dispatch, performance degrades. When they use contiguous data + direct calls, CPUs execute them efficiently.

These mechanisms are universal—the languages differ only in how often their idioms lead you into them. Cache misses, indirect branches, and scattered allocations cost the same in both languages. The difference is which patterns become the path of least resistance.

The point: Modern languages didn’t invent new abstractions—they made different patterns the path of least resistance. Go’s concrete-type idioms lower the activation energy for cache-friendly code. C++’s value-oriented subset does the same, but you reach for it deliberately, not by default in inheritance-heavy designs.

Next time someone says “just syntax,” ask them to show you the assembly. Syntax doesn’t cause multi-× slowdowns—memory layout and indirection do.


Further Reading

Related articles: