Skip to content

optimize: reduce heap allocations and memory footprint#23957

Open
aunjgr wants to merge 65 commits intomatrixorigin:mainfrom
aunjgr:memopt
Open

optimize: reduce heap allocations and memory footprint#23957
aunjgr wants to merge 65 commits intomatrixorigin:mainfrom
aunjgr:memopt

Conversation

@aunjgr
Copy link
Copy Markdown
Contributor

@aunjgr aunjgr commented Mar 25, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #23846

What this PR does / why we need it:

Key optimizations to reduce GC pressure and improve memory efficiency:

  1. Primitives (pkg/container/types):

    • Rewrite ParseDateCast to use stack-based accumulators instead of heap-allocated strings/builders.
  2. Storage Layer (pkg/objectio):

    • Implement buffer reuse for compression operations (lz4) in object writers.
    • Reuse writer instances to minimize allocations per block.
  3. Execution Layer (pkg/sql/colexec):

    • Adopt off-heap vectors and batches for aggregation states and join operations.
    • This moves significant temporary data structures out of the Go heap, reducing GC overhead during complex queries.
  4. Dependencies:

    • Upgrade hyperloglog library for improved memory efficiency.

for TPC-H 100G data loading, total allocated memory is reduced from 882G to 101G

Key optimizations to reduce GC pressure and improve memory efficiency:

1. Primitives (pkg/container/types):
   - Rewrite ParseDateCast to use stack-based accumulators instead of heap-allocated strings/builders.

2. Storage Layer (pkg/objectio):
   - Implement buffer reuse for compression operations (lz4) in object writers.
   - Reuse writer instances to minimize allocations per block.

3. Execution Layer (pkg/sql/colexec):
   - Adopt off-heap vectors and batches for aggregation states and join operations.
   - This moves significant temporary data structures out of the Go heap, reducing GC overhead during complex queries.

4. Dependencies:
   - Upgrade hyperloglog library for improved memory efficiency.
aunjgr and others added 30 commits March 27, 2026 19:17
ObjectBuffer.Write uses append to grow the Entries slice from zero
capacity (1.12 GB in profile 046).  A typical object has ~100 entries
(N_blocks × N_cols + BFs + ZMs + header + meta + footer), so 7
append-doublings occur per object.  Pre-sizing to 128 eliminates
most of these intermediate allocations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…bucket

Profile 046 shows 80,965 GetTransferMap calls with 100% pool miss
rate (4.93 GB, all from fresh make). During heavy merge bursts the
pool drains faster than returns, leaving it permanently empty.

Increasing tmMaxFree from 64 to 256 per bucket (8 buckets = 2048
total pooled maps) gives a deeper recycling buffer to absorb bursts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace single-tier arena free list with two tiers:
- ArenaSmall (≤16 MB): flush tasks, sinkers, general I/O
- ArenaLarge (≤128 MB): merge/compaction tasks

Each tier gets GOMAXPROCS/2 slots, matching the merge and flush worker
pool sizes. Small arenas have a sizeLimit that prevents Reset from
growing them beyond 16 MB, so flush arenas never inflate to 128 MB.

Without tiers, all arenas converge to 128 MB because small callers
(flush) share the pool with large callers (merge). On a 32-core
machine this wastes 4 GB permanent RSS; with tiers it drops to ~2.3 GB.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CN merge tasks allocated TransferMaps via GetTransferMap but never
returned them to the pool, causing ongoing pool misses. Add
CleanTransMapping call in Release() to match TN merge behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove 6 unused symbols:
- MedianDecimal64, MedianDecimal128, MedianNumeric (superseded by
  medianNumericFromState/medianDecimal*FromState in median2.go)
- GetMinAggregatorsChunkSize, getChunkSizeOfAggregator (no callers)
- NewEmptyVectors (no callers)
- SpecialAggValuesString (no callers, also removes fmt/strings imports)

178 lines deleted, tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Sinker accumulates data from multiple WriteBatch calls before
syncing. The CN S3 writer path goes through the Sinker and processes
large data volumes that exceed the 16 MB ArenaSmall cap, causing
constant make() overflow in WriteArena.Alloc (51+ GB in profile 056).

Flush jobs already create their own ArenaSmall directly via
NewBlockWriterWithArena, so this change only affects the Sinker path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ArenaLarge is now shared between TN merge workers and CN S3 writers.
The previous GOMAXPROCS/2 capacity was sized for merge workers only,
causing CN S3 writer arenas to be discarded and re-created cold.
Profile 057 showed 150 arena warmup allocations (5.2 GB of Reset).
Doubling the pool capacity keeps more warm arenas available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… MB)

Fresh arenas previously started at size 0, requiring 7-8 geometric
growth steps (0→1→2→4→8→16→32→64→128 MB) to reach steady state.
Each step triggers a Reset allocation.  Pre-warming skips most of
these steps, reducing Reset+Alloc overhead (~1.88 GB in profile 058).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
FetchWithSchema previously called NewWithSchema which allocates ALL
column vectors upfront, then immediately replaced some with pooled
vectors — wasting the freshly allocated ones.  Now we create the
batch shell with nil vectors, fill from the pool first, and only
allocate fresh vectors for unfilled positions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…emove dead code

Merge objects with many blocks easily exceed 128 entries
(entries = columns × blocks + 2×blocks + 3), causing append
doubling.  Increase initial capacity to 256 to cover most cases.

Also remove the unused 'buf' field (bytes.Buffer) and its sole
accessor Length() which was never called.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DrainArenaPools() which atomically empties both arena free lists,
letting the GC reclaim the backing memory.  Called from the
post-checkpoint callback since checkpoints follow flush/merge cycles,
making it a natural idle point.

On a 16-core machine with GOMAXPROCS large arenas at 128 MB each,
this can reclaim ~2 GB of RSS during idle periods.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sfer maps

Replace per-block TransferMap allocations with a single contiguous slab
allocation pooled via sync.Pool. Introduce TransferTable abstraction that
supports both slab-based (merger) and legacy maps-based (flush/debug) formats.

Key changes:
- Add TransferTable struct with GetBlockMap/Len/Release accessors
- Merger allocates flat slab from pool, uses blockIdx*stride+rowIdx indexing
- Reshaper creates TransferTable with Maps format (unchanged allocation)
- Replace InitTransferMaps/GetTransferMaps/SetTransferSlab with single
  SetTransferTable on MergeTaskHost interface
- Update all consumers (TN merge, CN merge, debug RPC) to use TransferTable
- Pool returns slab after each merge task completes (effective reuse)
- Remove ~125 lines of per-block pool infrastructure (tmNode, tmFreeList, etc.)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Execute defer released the TransferTable before PrepareCommit ran
phase-2 transfer, causing a panic on GetBlockMap with nil Maps/BlockActive.

Move release to mergeObjectsEntry.PrepareCommit (after both transfer phases)
and PrepareRollback (on abort), where the TransferTable is no longer needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Convert reshaper from per-block GetTransferMap allocations to a single
contiguous slab from the pool. Eliminates per-block nil check and
GetTransferMap call from the hot loop.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Large slabs (>4 MB) from big merges were being retained in sync.Pool,
causing 899 MB inuse_space (profile 071). Now putTransferSlab drops
slabs exceeding maxPoolSlabEntries, letting GC reclaim them promptly.
Small slabs (<= 4 MB) are still pooled for reuse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Profile 072 showed 680 MB inuse_space from NewArena, with 94% from
FSinkerImpl.Sink. The GOMAXPROCS pool couldn't keep up with concurrent
merge + sinker + flush tasks. Doubling the pool reduces fresh allocations
during burst workloads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e list

Replace sync.Pool-based transfer slab caching with off-heap allocation
via mpool.MakeSlice (C.calloc). This removes the slab from Go's heap
entirely, eliminating its contribution to pprof inuse_space and reducing
GC pressure.

Key changes:
- Allocate slabs off-heap via mpool.MakeSlice[TransferDestPos](offHeap=true)
- Maintain a capped free list (max 32 entries, max 4 MB each) with
  best-fit selection to reuse off-heap slabs without re-allocating
- Large slabs (>4 MB) are freed immediately via mpool.FreeSlice (C.free)
- Add DrainTransferSlabPool() called after checkpoint to reclaim idle RSS

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move WriteArena data and compressBuf allocations off the Go heap using
mpool.Alloc(offHeap=true), which routes through C.calloc/C.free. This
removes these buffers from Go's pprof inuse_space and eliminates their
GC scanning overhead.

Key changes:
- NewArena: allocate data via arenaMPool.Alloc(size, true)
- Reset: free old data via arenaMPool.Free before growing
- CompressBuf: free old buffer via arenaMPool.Free before growing
- Add FreeBuffers() to explicitly release off-heap allocations
- PutArena: call FreeBuffers() when dropping (pool full)
- DrainArenaPools: walk freed list and FreeBuffers() each arena

The Alloc() fallback (make when arena overflows) stays on Go heap —
these are small, temporary, and GC-collected naturally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…6MB)

Replace flat free list + mutex with two-tier lock-free CAS stacks,
matching the arena pool design. Slabs are quantized to tier capacity
for O(1) routing. Slabs exceeding 16 MB are freed immediately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oint)

Replace immediate DrainArenaPools() after every checkpoint with a
debounced timer.  During active operation (checkpoints every ~60s),
the timer resets before firing so pools stay warm — eliminating ~2000
needless alloc/free cycles per run.  Once activity ceases for 2 minutes,
the timer fires and off-heap RSS is reclaimed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flush reads from appendable objects that will be merged away shortly.
Without SkipCacheReads, these reads pollute the memory/disk cache and
evict actively-queried data.  The merge path already had this policy
(mergeobjects.go:290); now flush matches.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace raw unsafe.Pointer CAS operations with typed atomic.Pointer[T]
in both transferSlabBucket and arenaFreeList. Same lock-free semantics,
better type safety, no more unsafe import.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
page.Marshal() returns *bytes.Buffer, not []byte. Use Len() and
Bytes() methods instead of len() builtin and direct assignment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
NewNoSparse() uses probabilistic registers from the start, giving
~2024 instead of exact 2048 for small cardinalities. Revert to
New() which uses sparse representation for exact counts when the
cardinality is below the sparse threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tables with fake PK (__mo_fake_pk_col) had approximate NDV from HLL
estimation instead of exact totalRows. This is because fake PK has
SortKey=false, so HasPK() returns false and SetPKNdv was never called.

Add SetFakePK() to BlockWriter that marks the fake PK column index.
In Sync(), SetPKNdv is called for the fake PK column with totalRows.
Add HasFakePK() to Schema for explicit fake PK detection.
Update flush (flushTableTail, flushobj) and merge (mergeobjects) paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/wip kind/enhancement kind/refactor Code refactor size/XL Denotes a PR that changes [1000, 1999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants