Go Structs Are Not C++ Classes: Why Similar Modeling Roles Produce Different Hardware Execution Patterns
Go structs and C++ classes occupy similar modeling roles but commonly lead to different hardware execution patterns. From memory layout to CPU cache behavior, language defaults shape what the processor actually executes.
- categories
- Programming Architecture
- published
This is a follow-up to How Multicore CPUs Changed Object-Oriented Programming , which generated significant discussion about whether modern languages truly differ from classical OOP.
“Go structs are basically C++ classes” is usually shorthand for “Go structs play the same modeling role as my classes.”
This post shows why that analogy breaks at the CPU level - especially once indirection and dynamic dispatch enter the picture.
If you only take one thing from this article:
“Structs with methods” is not the differentiator.
The differentiator is execution topology: the pattern of memory loads, branches, and indirection that your code compiles into. Similar abstractions at the modeling level can produce different execution topologies under common idioms—and CPUs care about topology, not abstractions.
Language defaults shape which topologies become the path of least resistance.
Quick Clarifications from the Multicore Thread
Before the hardware details, these are the background assumptions behind the “structs are classes” claim:
“Java/C++ Are Still Used Successfully on Multicore”
The critique: “Enterprise runs on Java. Games run on C++. Multicore didn’t kill anything.”
Correct. Java and C++ adapt through:
- Thread pools (limit concurrency, reduce overhead)
- Immutable objects (java.lang.String, java.time.*)
- Concurrent collections (java.util.concurrent.*)
- Modern C++ value types (std::optional, std::variant)
- Smart pointers (std::unique_ptr reduces sharing)
These are workarounds retrofitted onto reference-based languages. Go/Rust bake these patterns into the language default.
The difference: Java requires discipline. Go makes cache-friendly patterns the default.
“You Can Model Any Semantics in C++”
The critique: “C++ lets you write value types with perfect forwarding, move semantics, RAII. You can model Go’s semantics in C++.”
True, but irrelevant. Yes, expert C++ programmers write:
| |
But this requires:
- Understanding copy/move semantics
- Knowing when to use const&
- Avoiding inheritance (commonly requires pointers for heterogeneous collections)
- Fighting std library defaults (std::shared_ptr everywhere)
Go makes this the default. You don’t need expertise to write cache-friendly code.
The question is not what is possible in a language, but what is idiomatic under deadline pressure. Defaults shape systems.
“OOP Is Just Message Passing”
The critique: “Alan Kay said Erlang is true OOP. You’re attacking a strawman definition.”
Fair point. Kay’s vision (isolated objects communicating via messages) describes:
- Erlang (actor model)
- Go channels (CSP model)
- Rust message passing (channels)
Not Java’s shared heap objects.
This article uses “classical OOP” to mean the 1980s-2010s mainstream implementation: Java, C++, Python, Ruby, C#. These languages deviated from Kay’s message-passing vision toward shared mutable heap objects.
So yes, if we define OOP as Kay intended, then Erlang/Go/Rust are OOP. The article’s thesis becomes: “Multicore forced mainstream OOP to return to Kay’s original vision.”
Foundational Terms
Before examining hardware differences, define the key concepts. Note: Latency numbers cited below are order-of-magnitude mental models; actual costs depend on cache level, miss rate, CPU architecture, and system load.
Cache locality: How close data is in memory. CPUs read memory in 64-byte cache lines. Sequential data (addresses 0x1000, 0x1008, 0x1010) fits in one cache line (fast - L1 cache ~0.5ns). Scattered data (addresses 0x1000, 0x5000, 0x9000) requires multiple cache lines. Modern CPUs have multiple cache levels: L1 (~0.5ns), L2 (~3-5ns), L3 (~10-30ns), DRAM (~100ns). Modern systems with NUMA can see cross-socket memory access exceed 150ns. Miss rates and latency depend on access pattern, working set size, and CPU architecture. Value semantics produce sequential layouts. Reference semantics produce scattered layouts.
Static dispatch: Function call where the target address is known at compile time. The CPU knows exactly which function to call before runtime. Enables inlining (compiler replaces call with function body). Cost: ~1ns, often zero after inlining.
Dynamic dispatch: Function call where the target address is determined at runtime through indirection (vtable lookup, interface dispatch). The CPU must load the function pointer from memory before calling. Prevents inlining. Cost: typically a few nanoseconds per call, depending on branch predictability and cache behavior.
Vtable (virtual method table): Compiler-generated table of function pointers used for dynamic dispatch. Each polymorphic object has a hidden vtable pointer (8 bytes overhead). Calling a virtual method: load object pointer → load vtable pointer from object → load function pointer from vtable → indirect call. Three memory accesses before the actual function executes.
Pointer chasing: Following pointers through memory to access data. Each pointer dereference is a memory access. If the target isn’t in cache (common for heap-allocated objects), costs 100ns. Sequential array access: one pointer (array base), then offsets (arithmetic). Scattered object access: load pointer, dereference (cache miss), load next pointer, dereference (cache miss).
Stack allocation: Local variables stored on the call stack. Allocation: move stack pointer (1 CPU cycle, <1ns). Deallocation: automatic when function returns (free). Memory is contiguous and reused across function calls. Lifetime: deterministic (scope-bound).
Heap allocation: Memory requested from allocator (malloc, new, runtime allocator). Allocation: search free lists, update metadata (typically tens to hundreds of cycles). Deallocation: explicit (free, delete) or garbage collection. Memory is scattered across heap. Lifetime: flexible (can outlive function scope). Note: Exact allocator costs vary by implementation and fast-path optimizations; these are representative ranges, not guarantees.
Contiguous memory: Data stored sequentially in memory. Arrays, slices, value structs. Enables CPU prefetching (hardware loads data before requested). Often achieves high cache hit rates (directionally 70%+ in tight sequential loops, varies by working set).
Scattered memory: Data stored at non-sequential addresses. Pointer arrays, heap-allocated objects, linked lists. Defeats CPU prefetching (unpredictable access pattern). Often causes elevated cache miss rates in large, scattered working sets.
The Hardware Reality
Philosophical debates aside, here’s what actually executes on the CPU.
The important difference isn’t what either language can do. It’s what their mainstream patterns make easy. Go makes “concrete value types, contiguous memory, static calls” the path of least resistance. C++ makes that possible too - but in inheritance-heavy designs, the path of least resistance often becomes “pointers, scattered objects, virtual dispatch.” That’s what shows up at the hardware level.
About the benchmarks in this article:
These microbenchmarks isolate primitives (pointer chasing, cache misses, indirect branches) that can dominate hot paths in real systems. The exact ratios don’t carry over universally—CPUs overlap latencies, prefetchers help sometimes, and bottlenecks shift with workload. But the mechanisms and direction do: when your hot loop becomes indirection-heavy (pointer chasing + indirect calls), the CPU pays these categories of penalties. The benchmarks show what’s possible when these costs concentrate in tight loops.
1. Memory Layout: Contiguous vs Scattered
C++: Polymorphic Class Hierarchies Push Toward Pointers
| |
Memory layout (what the hardware sees):
Stack:
std::vector<Shape*> shapes
├─ data: 0x7fff1000 (pointer to array)
├─ size: 1000
└─ capacity: 1024
Heap - Array of pointers (8 KB, contiguous):
0x7fff1000: [ptr 0] → 0x2a4b1000 (Circle)
0x7fff1008: [ptr 1] → 0x2a4b1020 (Rectangle)
0x7fff1010: [ptr 2] → 0x2a4b1040 (Circle)
...
Heap - Shape objects (scattered, different sizes):
0x2a4b1000: Circle{vtable, radius} (16 bytes: vtable ptr + int + padding)
0x2a4b1020: Rectangle{vtable, w, h} (16 bytes: vtable ptr + 2 ints)
0x2a4b1040: Circle{vtable, radius} (16 bytes)
...
What 'new Circle{i}' does internally:
1. Call malloc(16) to allocate heap memory for Circle
2. Call Circle constructor to initialize vtable pointer and radius
3. Return pointer to allocated memory
Why pointers are required:
- Circle and Rectangle have different sizes (heterogeneous)
- vector<Shape> would slice off derived class data
- Storing different types in one container requires indirection
Result: Objects scattered across heap pages (no locality guarantee)
What happens during iteration:
| |
CPU execution:
- Read pointer from array:
0x7fff1000→ get0x2a4b1000(Circle) - Dereference pointer: Jump to
0x2a4b1000→ load vtable pointer - Follow vtable: Load Circle::area function pointer → indirect call
- Next iteration: Read
0x7fff1008→ get0x2a4b1020(Rectangle) - Dereference: Jump to
0x2a4b1020(likely cache miss!) → load vtable - Follow vtable: Load Rectangle::area function pointer → indirect call
Cache behavior:
- Pointer array is contiguous (cache-friendly)
- Dereferencing jumps to random heap locations (cache-unfriendly)
- Each object access risks cache miss (~100ns penalty)
Go: Contiguous Value Array
Go structs use value semantics by default - collections store actual objects, not pointers.
| |
Memory layout (what the hardware sees):
Stack (or heap, decided by escape analysis):
points (slice header, 24 bytes):
├─ array: 0x7fff1000 (pointer to backing array)
├─ len: 1000
└─ cap: 1000
Backing array (16 KB, single allocation, contiguous):
0x7fff1000: Point{0, 0} (16 bytes: x=8, y=8)
0x7fff1010: Point{1, 1} (16 bytes)
0x7fff1020: Point{2, 2} (16 bytes)
0x7fff1030: Point{3, 3} (16 bytes)
...
0x7fff3e80: Point{999, 999} (16 bytes)
All data in ONE contiguous block
What happens during iteration:
| |
CPU execution:
- Read Point at
0x7fff1000(cache miss) - Read Point at
0x7fff1010(cache hit - same cache line!) - Read Point at
0x7fff1020(cache hit) - Read Point at
0x7fff1030(cache hit) - …cache hits for 4-8 Points per cache line
Cache behavior:
- All data sequential (perfect prefetching)
- CPU loads 64-byte cache lines
- High cache hit rate on sequential access (prefetcher loads ahead)
- No pointer dereferencing overhead
The Hardware Impact
The Hardware Cost of Pointer Chasing
According to Jeff Dean’s “Latency Numbers Every Programmer Should Know” as order-of-magnitude reference points:
- L1 cache reference: ~0.5 ns
- Main memory reference: ~100 ns (200× slower)
Measured benchmark results ( source code ):
These numbers are from one machine with a specific working set. Focus on the ratio and direction:
C++ (1M elements, 100 iterations):
Pointer array (scattered heap):
Measured: ~2 ns per element
(This includes many cache hits; working set partially fits in cache)
Value array (contiguous memory):
Measured: ~0.29 ns per element
(High cache hit rate from sequential access + prefetching)
Measured ratio: 7× slower for pointer chasing
What the numbers mean:
- The 2 ns average includes cache hits (not pure DRAM latency)
- As working set grows beyond cache, miss rates rise toward the 100ns DRAM speeds
- Sequential access enables hardware prefetching (CPU loads ahead)
- Scattered access defeats prefetching (unpredictable pattern)
The key is the ratio (7×) and mechanism (cache locality), not absolute ns values.
The difference isn’t the language. It’s what the CPU executes:
- Inheritance-heavy C++: Chase pointers through scattered memory
- Go concrete types: Read contiguous sequential data
2. Virtual Method Dispatch: Vtable vs Static Calls
C++: Virtual Dispatch in Inheritance Hierarchies
| |
What the CPU executes:
Each object has hidden vtable pointer:
Circle object layout (16 bytes):
├─ [0-7]: vtable pointer → 0x400000
├─ [8-12]: radius
└─ [12-16]: padding
Vtable (at 0x400000):
├─ [0]: &Circle::area
├─ [8]: &Circle::destructor
└─ [16]: RTTI pointer
Virtual call s->area():
1. Load object pointer: s = 0x2a4b1000
2. Dereference to get vtable: vtable = *(s+0) = 0x400000
3. Load function pointer: func = *(vtable+0) = 0x401234
4. Indirect call: call *func
Cost: 4 memory accesses + indirect branch
Time: typically a few nanoseconds per call, depending on branch predictability
Branch prediction impact:
| |
Go: Compile-Time Static Dispatch
| |
What the CPU executes:
Circle object layout (8 bytes):
└─ [0-8]: Radius (no vtable pointer!)
Static call circles[i].Area():
1. Load Circle value: circle = *(circles + i*8)
2. Direct call: call Circle.Area
(address known at compile time: 0x401000)
Cost: 1 memory access + direct branch
Time: ~1-2ns per call
Compiler can inline:
for i := range circles {
a := 3.14159 * float64(circles[i].Radius * circles[i].Radius)
}
Cost: 1 memory access + arithmetic (no function call!)
Time: ~0.5ns per iteration
Branch prediction impact:
| |
When Go Uses Virtual Dispatch (Interfaces)
| |
Interface value layout (16 bytes):
Interface value (two words):
├─ [0-8]: itab pointer → type + method table
└─ [8-16]: data pointer → actual Circle value
Dynamic call s.Area():
1. Load itab pointer: itab = *(s+0)
2. Load method pointer: func = *(itab+24) // offset to Area
3. Load data pointer: data = *(s+8)
4. Indirect call: call func(data)
Cost: 3-4 memory accesses + indirect branch
Similar to C++ virtual dispatch
In Go, static dispatch is the default and dynamic dispatch is opt-in via interfaces. In C++, once you design around inheritance/virtuals, the dynamic-dispatch + indirection costs become pervasive in that slice of the codebase.
The Hardware Impact
Virtual Dispatch Overhead
From Jeff Dean’s latency numbers as order-of-magnitude mental models:
- Branch mispredict: ~5 ns
- Indirect call overhead: typically a few ns, varies by predictability
Measured benchmark results ( source code , isolated vtable overhead using contiguous arrays):
These numbers are from one machine and workload. Absolute values vary by CPU, compiler version, and access patterns. Focus on ratios and direction:
C++ (10M elements, 10 iterations = 100M calls):
Virtual dispatch (contiguous array, vtable lookup):
Measured: ~20 ns per call (includes loop overhead + vtable indirection)
Static dispatch (contiguous array, direct calls):
Measured: ~7 ns per call (includes loop overhead, likely partially inlined)
Measured ratio: ~2.8× slower for virtual dispatch
Go (10M elements, 10 iterations = 100M calls):
Interface dispatch (dynamic):
Measured: a few ns per call on this system
(Exact values vary widely and may be optimized away by the compiler)
Concrete type (static dispatch):
Measured: sub-ns per call (likely fully inlined)
Measured ratio: several× slower for interface dispatch
(Treat only the ratio and mechanism as meaningful, not absolute ns)
Key findings:
- Both languages show the same pattern: indirect calls are several times slower than direct/inlined calls
- The ratios (2-5×) are more stable than absolute ns values
- Go’s measured numbers suggest aggressive inlining; beware of devirtualization in microbenchmarks
- In real code, vtable overhead combines with pointer chasing (see Benchmark 1)
In Go, static dispatch is the default path. In inheritance-heavy C++, dynamic dispatch often becomes pervasive across hot loops.
3. Object Allocation: Stack vs Heap
C++: Polymorphism Often Implies Pointer Lifetimes
| |
What happens with new Point{1, 2}:
1. Call malloc(8)
- Search free list for 8-byte chunk
- Update heap metadata
- Return pointer: 0x2a4b1000
Cost: often tens to hundreds of CPU cycles on the fast path
2. Call Point constructor
- Initialize x = 1, y = 2
Cost: ~5 cycles
3. Later: delete p1
- Call destructor
- Call free(0x2a4b1000)
- Update free list
Cost: ~30-50 cycles
Total cost: ~100-150 cycles per allocation
Garbage collection isn’t the issue here. C++ uses manual memory management (new/delete), not GC. The cost is heap allocation itself - malloc/free overhead.
Why inheritance-heavy C++ commonly uses heap allocation:
- Runtime polymorphism via base pointers requires indirection
- Object lifetime beyond scope (return from function)
- Containers holding heterogeneous derived types store pointers
Go: Stack Allocation Default (Escape Analysis)
| |
What the compiler does:
| |
Stack allocation (escape analysis says “no escape”):
1. Move stack pointer
- Current SP: 0x7fffe000
- Allocate 16 bytes: SP -= 16
- New SP: 0x7fffeff0
Cost: ~1 CPU cycle
2. Initialize Point
- Write x = 1, y = 2 to stack
Cost: ~2 cycles
3. Function return
- Stack frame discarded (SP += 16)
Cost: ~1 cycle
Total cost: ~4 cycles per allocation
Heap allocation (escape analysis says “escapes”):
1. Call runtime.newobject(16)
- Small object allocation from per-P cache (mcache)
- Fast path: ~10-20 cycles
- Slow path (cache miss): ~50-100 cycles
Cost: ~10-100 cycles (avg ~20)
2. Initialize Point
Cost: ~2 cycles
3. Garbage collection
- Mark phase: Scans object (amortized cost)
- Sweep phase: Reclaims memory (amortized cost)
Cost: ~5-10 cycles per object (amortized)
Total cost: ~30-50 cycles per allocation
The Hardware Impact
Allocation Cost Comparison
From Jeff Dean’s latency numbers :
- Mutex lock/unlock: 25 ns
Heap allocation (malloc/new) typically involves:
- Free list traversal or allocator lock
- Metadata updates
- Typical cost: tens to hundreds of nanoseconds on the fast path, far more under contention or fragmentation
Stack allocation:
- Adjust stack pointer (SUB instruction)
- Typical cost: <1 ns (single CPU cycle)
Measured allocation overhead ( source code ):
C++ (1M allocations, 48-byte objects):
Heap allocation (new + store pointer):
Total time: 34.9 ms
Time per allocation: 34 ns
Stack-based storage (vector of values):
Total time: 6.5 ms
Time per allocation: 6 ns
Measured speedup: 5.3× faster for stack-based storage
Go (1M allocations, 96-byte objects):
Heap allocation (pointer slice):
Total time: 97.3 ms
Time per allocation: 97 ns
Value slice (contiguous storage):
Total time: 56.4 ms
Time per allocation: 56 ns
Measured speedup: 1.7× faster for value storage
C++ shows larger difference because Go’s allocator has per-goroutine caches (faster small allocations). But both show that heap overhead exceeds value storage.
Important: This benchmark measures allocation + storage topology. Go’s per-P caches make individual allocations fast, but the total cost includes GC scanning of pointer graphs. C++ has malloc overhead but no GC. The trade-offs differ:
- Go: Fast allocation, pays for GC scanning later
- C++: Slower malloc, pays for explicit free/delete
Real-world impact: From Discord’s engineering blog , heap allocation pressure caused 2-minute GC pauses in production with millions of long-lived objects. Moving to value semantics (Rust) eliminated both allocation overhead and GC scanning.
Real-World Example
| |
Compare to C++/Java where every object is heap-allocated:
| |
4. Inheritance: Pointer Indirection Requirement
C++: Polymorphism Commonly Pushes Toward Pointers
| |
Memory layout (no choice):
Array of pointers (contiguous):
shapes[0] → 0x2a4b1000 (Circle, 16 bytes)
shapes[1] → 0x2a4b1050 (Rectangle, 20 bytes)
shapes[2] → 0x2a4b10a0 (Circle, 16 bytes)
...
Objects scattered on heap (different sizes!)
Cannot be stored contiguously without indirection or manual layout machinery - different sizes per type
Why heterogeneous storage requires pointers:
- Circle is 16 bytes (vtable ptr + radius + padding)
- Rectangle is 20 bytes (vtable ptr + width + height + padding)
- Cannot fit different-sized objects in fixed-size array
- Indirection is required to store polymorphic collection
Go: Separate Arrays (Opt-In Polymorphism)
| |
Memory layout (programmer’s choice):
circles array (contiguous, 4 KB):
├─ Circle{0} (8 bytes)
├─ Circle{1} (8 bytes)
├─ Circle{2} (8 bytes)
...
rectangles array (contiguous, 8 KB):
├─ Rectangle{0, 0} (16 bytes)
├─ Rectangle{1, 1} (16 bytes)
├─ Rectangle{2, 2} (16 bytes)
...
Both arrays fully contiguous
CPU prefetches perfectly
No pointer chasing
When you DO need polymorphism:
| |
The Hardware Impact
Performance Impact (Directional)
Based on cache latency costs (L1: 0.5ns, memory: 100ns as mental models):
C++ (inheritance-based polymorphism):
- Memory layout: Array of pointers → scattered objects
- Each access: Pointer read + dereference (cache miss likely) + vtable lookup
- Combined overhead: In worst cases (unpredictable access + cache misses),
can be orders of magnitude slower than sequential access
- In practice: Often multi-× slowdowns when hot loops become indirection-heavy
Go (concrete types, no polymorphism):
- Memory layout: Contiguous arrays
- Each access: Direct read (cache hits via prefetching)
- Static dispatch: Direct calls (often inlined)
- Overhead: Minimal compared to scattered + indirect patterns
Go (explicit polymorphism via interface):
- Memory layout: Interface values (similar to pointers)
- Each access: Similar cache/dispatch behavior to C++
- Combined overhead: Pays similar costs to C++ virtual dispatch when used
Go’s advantage: You choose when to pay the cost. C++ inheritance hierarchies make you pay everywhere.
Real-world example: Discord’s migration to Rust
From Discord’s engineering blog :
“We were reaching the limits of Go’s garbage collector… We had 2-minute latency spikes as the garbage collector was forced to scan the entire heap.”
After migrating their Read States service from Go to Rust (value semantics, no GC):
- Before (Go): 2-minute latency spikes during GC
- After (Rust): Microsecond average response times
- Cache improvement: Increased to 8 million items in single LRU cache
This confirms the memory layout thesis: scattered heap objects create GC pressure and cache misses. Rust’s value semantics (similar to Go’s, but without GC) eliminated both problems.
Real-world impact: Game engines (ECS)
Why modern game engines moved away from deep inheritance hierarchies:
| |
The pattern: moving from pointer graphs with indirect calls to contiguous arrays with direct loops commonly produces multi-× improvements, and in cache-bound workloads can reach order-of-magnitude gains. The speedup isn’t from Go vs C++—it’s from contiguous data vs pointer indirection, regardless of language.
5. Method Receivers: Explicit Mutation Visibility
C++: Implicit this Pointer
| |
What the CPU executes:
Counter object layout:
├─ [0-4]: count
Call c.increment():
1. Load 'this' pointer: rdi = &c (calling convention)
2. Load count: eax = *(rdi+0)
3. Increment: eax++
4. Store count: *(rdi+0) = eax
'this' is passed as a pointer, introducing implicit indirection at the call boundary
Concurrency issue:
| |
Go: Explicit Value vs Pointer Receivers
| |
What the CPU executes:
Value receiver (c Counter):
Call c.Get():
1. Copy Counter: [stack] = c (16 bytes if escaped)
2. Load count: eax = [stack+0]
3. Return: return eax
No pointers, no indirection
Function receives independent copy
Original unchanged (guaranteed)
Pointer receiver (c *Counter):
Call c.Increment():
1. Load pointer: rdi = &c (calling convention)
2. Load count: eax = *(rdi+0)
3. Increment: eax++
4. Store count: *(rdi+0) = eax
Pointer indirection (like C++ 'this')
Original mutated
Concurrency benefit:
| |
The receiver type shows the programmer whether mutation/sharing happens.
Why This Matters
The receiver type makes sharing and mutation intent visible at the call boundary:
| |
The impact is semantic and ergonomic, not a direct performance guarantee:
- Concurrency reasoning: Seeing
counter.Get()vscounter.Increment()immediately tells you which ones might mutate shared state - Escape behavior: Value receivers can sometimes enable stack allocation, but this depends on escape analysis heuristics, not receiver type alone
- Aliasing opportunities: Value receivers give the compiler more freedom to reason about aliasing, but whether this translates to optimization depends on many factors
The key win is intent clarity at call boundaries, which shapes what patterns become idiomatic in concurrent code. Performance effects are indirect (escape analysis, aliasing analysis), not a guaranteed inlining switch.
6. Construction: Special Syntax vs Functions
C++: Constructor Special Semantics
| |
What the CPU executes:
Point p1(1, 2):
1. Allocate 8 bytes (stack or heap)
2. Call Point::Point(int, int)
- Initialize x, y
3. Mark object as constructed
Point p2 = p1:
1. Allocate 8 bytes
2. Call Point::Point(const Point&)
- Copy x, y
3. Mark object as constructed
End of scope:
1. Call p3.~Point()
2. Call p2.~Point()
3. Call p1.~Point()
4. Deallocate memory
Complex semantics:
- Initialization order matters (member init list)
- Copy/move constructors have implicit generation rules
- Exception safety during construction is subtle
- Virtual function dispatch doesn’t work in constructors
- Destructors must be virtual for polymorphic classes
Go: Regular Functions (No Special Semantics)
| |
What the CPU executes:
p1 := Point{1, 2}:
1. Allocate 16 bytes (stack or heap, escape analysis)
2. Write x = 1
3. Write y = 2
That's it. No special semantics.
p2 := NewPoint(3, 4):
1. Call NewPoint (regular function)
2. Return value copies to p2
3. No constructor semantics
p3 := p1:
1. Load p1 (16 bytes)
2. Store to p3 (16 bytes)
3. Memcpy (no special copy constructor)
Simpler semantics:
- No initialization order complexity (just assignments)
- No copy/move distinction (always copies bytes)
- No destructor timing issues (GC handles cleanup)
- No virtual function restrictions
- No exception safety concerns (no exceptions in Go)
The Hardware Impact
Construction overhead comparison:
C++: Create 1 million Points
- Constructor calls: 1 million
- Copy constructor calls: Variable (depends on usage)
- Destructor calls: 1 million
- Time: Depends on constructor complexity
Go: Create 1 million Points
- Struct initialization: 1 million (memcpy)
- No constructor/destructor overhead
- Time: Minimal (just memory writes)
The difference is conceptual complexity, not raw performance. C++ constructors add rules that the programmer must understand. Go treats initialization as simple data copying.
7. Memory Footprint: Hidden Vtable Pointers
C++: Every Polymorphic Object Has Vtable Pointer
| |
Memory layout:
NonVirtual object (8 bytes):
├─ [0-4]: x
└─ [4-8]: y
Virtual object (16 bytes):
├─ [0-8]: __vptr (hidden vtable pointer)
├─ [8-12]: x
└─ [12-16]: y (includes padding)
Overhead: 8 bytes per object (50% increase!)
Array of 1 million objects:
| |
Go: No Hidden Pointers
| |
Memory layout:
Point object (16 bytes):
├─ [0-8]: X
└─ [8-16]: Y
No hidden pointers
Methods are not stored in objects
Function pointers resolved at compile time
Array of 1 million objects:
| |
Interface Values (Explicit Overhead)
| |
Memory layout:
Interface value (16 bytes):
├─ [0-8]: itab pointer (type + methods)
└─ [8-16]: data pointer (or small value directly)
Overhead: 8-16 bytes per interface value
But this is EXPLICIT - you opt in with interface type
Array comparison:
| |
The Hardware Impact
Memory overhead:
C++ with virtual methods:
- 1M Point objects: 16 MB (8 MB vtable pointers)
- Cache pollution: Half the cache lines are pointers
- Memory bandwidth: Wasted on metadata
Go concrete types:
- 1M Point objects: 16 MB (pure data)
- Cache efficiency: All cache lines are data
- Memory bandwidth: Fully utilized
Go interface types (explicit):
- 1M Shape interfaces: 16 MB (same as C++)
- But opt-in, not default
Real-world impact:
Game engines processing 100,000 entities:
C++ (virtual methods required):
- Entity size: 64 bytes (vtable + data)
- Total: 6.4 MB
- Effective data: 3.2 MB (50% overhead)
Go/ECS (concrete types):
- Component size: 16-32 bytes (pure data)
- Total: 1.6-3.2 MB
- Effective data: 100% (no overhead)
Cache difference: 2-3× more data fits in cache
The Systemic Difference
These 7 differences aren’t independent. They compound:
Inheritance-Heavy C++ Pattern (Common in OO Designs)
| |
Hardware execution:
- Read pointer from vector (cache hit)
- Dereference pointer (cache miss - scattered heap)
- Load vtable pointer (another memory access)
- Load function pointer from vtable (another memory access)
- Indirect call (branch misprediction possible)
Result: 4-5 memory accesses per iteration, scattered across RAM
Go Concrete-Type Pattern (Path of Least Resistance)
| |
Hardware execution:
- Read Position from array (cache hit)
- Read Velocity from array (cache hit)
- Compute sum (CPU registers)
- Write back to Position (cache hit)
Result: 2-3 memory accesses per iteration, sequential RAM access
Performance Comparison
Directional Performance Impact:
When hot loops shift from pointer-chasing + virtual calls to contiguous iteration + direct calls:
Inheritance-based (pointer graphs + virtual dispatch):
- More memory accesses per operation (pointer + vtable + data)
- Higher cache miss rates (scattered objects)
- Indirect branches (misprediction penalties)
- Result: Multi-× to orders of magnitude fewer operations per frame (depends on cache)
Data-oriented (contiguous arrays + direct calls):
- Fewer memory accesses per operation (direct array indexing)
- Lower cache miss rates (sequential access + prefetching)
- Direct branches (predictable, often inlined)
- Result: Multi-× to orders of magnitude more operations per frame (depends on cache)
This isn’t “Go is faster than C++.” This is contiguous data is faster than pointer chasing, regardless of language. The question is which pattern your idioms push you toward.
The difference: C++’s heterogeneous polymorphism commonly pushes designs toward pointer indirection. Go’s design makes it optional.
When C++ and Go Are Similar
Go interfaces do use dynamic dispatch (like C++ virtual methods):
| |
This has the same costs as C++:
- Indirect calls through interface
- Scattered memory (interface values hold pointers)
- Branch misprediction penalties
- Cache misses
The difference is where you pay the cost:
Go: Interfaces are pervasive in the standard library (io.Reader, io.Writer, error, fmt.Stringer, context.Context). You’re constantly using interface dispatch for I/O, errors, and formatting. But these are glue code where I/O latency dominates anyway (disk/network operations take milliseconds, interface dispatch takes nanoseconds).
Your domain data structures remain concrete values:
| |
C++: Inheritance-based polymorphism commonly propagates into your domain objects. Your business logic data structures pay the indirection cost:
| |
The real distinction: Go’s interface cost concentrates in I/O boundaries (already slow). C++ inheritance cost spreads into your hot loops (where every nanosecond matters).
When processing millions of domain objects in tight loops, Go’s concrete types avoid the overhead. When doing I/O operations, both languages pay interface costs - but I/O dominates anyway.
Summary: Hardware-Level Differences
| Aspect | C++ Inheritance Pattern | Go Concrete Types | Measured Impact |
|---|---|---|---|
| Memory layout | Scattered (heap pointers) | Contiguous (value arrays) | 7.3× speedup (measured) |
| Method dispatch | Virtual (vtable lookup) | Static (compile-time) | 2.8× speedup C++, 4.6× speedup Go (measured) |
| Allocation | Heap (new/delete) | Stack/value storage | 5.3× speedup C++, 1.7× speedup Go (measured) |
| Polymorphism | Pervasive (inheritance) | Optional (interfaces) | Opt-in cost vs pervasive cost |
| Receiver | Implicit this pointer | Explicit value/pointer | Can enable inlining and copy elision (compiler-dependent) |
| Construction | Special semantics | Regular functions | Simpler, fewer edge cases |
| Memory overhead | +8 bytes (vtable ptr) | +0 bytes | 50% space savings per object |
Benchmarks: Source code and methodology
The compounding effect (measured):
Processing 1M Point objects, 100 iterations:
C++ inheritance pattern:
Pointer array (2ns/elem) + virtual dispatch (20ns/call)
= 213.8ms total
C++ value-oriented:
Value array (0.29ns/elem) + static dispatch (7ns/call)
= 29.2ms total
Measured speedup: 7.3× faster
When combined: Memory layout dominates (accounts for ~86% of speedup)
Conclusion
“Structs with methods” is not the differentiator.
The differentiator is execution topology: whether your design trends toward contiguous data + direct calls, or pointer graphs + indirect calls.
Both Go and C++ can express either style. The difference is what becomes idiomatic under pressure:
- Go makes it easy to keep domain data as concrete values in contiguous slices, and to reserve interfaces and pointers for boundaries.
- Classic C++ inheritance-centric designs commonly propagate base pointers and virtual calls into hot loops, which brings pointer chasing + indirect branches along for the ride.
The CPU doesn’t care about “objects”—it cares about cache lines, branches, and allocations. This article maps language idioms to execution topologies: the actual pattern of loads, stores, and branches that the hardware executes.
What the benchmarks show:
The measured ratios (7× for memory layout, 3-5× for dispatch) isolate specific mechanisms. Real systems see variable impacts depending on:
- Whether the working set fits in cache
- Whether branches are predictable
- How allocation patterns interact with GC or allocator behavior
- Whether prefetchers and OoO execution can hide latency
But the direction is consistent: when hot loops concentrate pointer chasing + indirect dispatch, performance degrades. When they use contiguous data + direct calls, CPUs execute them efficiently.
These mechanisms are universal—the languages differ only in how often their idioms lead you into them. Cache misses, indirect branches, and scattered allocations cost the same in both languages. The difference is which patterns become the path of least resistance.
The point: Modern languages didn’t invent new abstractions—they made different patterns the path of least resistance. Go’s concrete-type idioms lower the activation energy for cache-friendly code. C++’s value-oriented subset does the same, but you reach for it deliberately, not by default in inheritance-heavy designs.
Next time someone says “just syntax,” ask them to show you the assembly. Syntax doesn’t cause multi-× slowdowns—memory layout and indirection do.
Further Reading
Related articles:
- How Multicore CPUs Changed Object-Oriented Programming - Why reference semantics became problematic
- Go’s Value Philosophy: Part 1 - Why Everything Is a Value - Deep dive into value semantics
- Go’s Value Philosophy: Part 2 - Escape Analysis and Performance - How Go optimizes value allocation