optimize: reduce heap allocations and memory footprint by aunjgr · Pull Request #23957 · matrixorigin/matrixone

aunjgr · 2026-03-25T11:45:18Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Key optimizations to reduce GC pressure and improve memory efficiency:

Primitives (pkg/container/types):
- Rewrite ParseDateCast to use stack-based accumulators instead of heap-allocated strings/builders.
Storage Layer (pkg/objectio):
- Implement buffer reuse for compression operations (lz4) in object writers.
- Reuse writer instances to minimize allocations per block.
Execution Layer (pkg/sql/colexec):
- Adopt off-heap vectors and batches for aggregation states and join operations.
- This moves significant temporary data structures out of the Go heap, reducing GC overhead during complex queries.
Dependencies:
- Upgrade hyperloglog library for improved memory efficiency.

for TPC-H 100G data loading, total allocated memory is reduced from 882G to 101G

Key optimizations to reduce GC pressure and improve memory efficiency: 1. Primitives (pkg/container/types): - Rewrite ParseDateCast to use stack-based accumulators instead of heap-allocated strings/builders. 2. Storage Layer (pkg/objectio): - Implement buffer reuse for compression operations (lz4) in object writers. - Reuse writer instances to minimize allocations per block. 3. Execution Layer (pkg/sql/colexec): - Adopt off-heap vectors and batches for aggregation states and join operations. - This moves significant temporary data structures out of the Go heap, reducing GC overhead during complex queries. 4. Dependencies: - Upgrade hyperloglog library for improved memory efficiency.

ObjectBuffer.Write uses append to grow the Entries slice from zero capacity (1.12 GB in profile 046). A typical object has ~100 entries (N_blocks × N_cols + BFs + ZMs + header + meta + footer), so 7 append-doublings occur per object. Pre-sizing to 128 eliminates most of these intermediate allocations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…bucket Profile 046 shows 80,965 GetTransferMap calls with 100% pool miss rate (4.93 GB, all from fresh make). During heavy merge bursts the pool drains faster than returns, leaving it permanently empty. Increasing tmMaxFree from 64 to 256 per bucket (8 buckets = 2048 total pooled maps) gives a deeper recycling buffer to absorb bursts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace single-tier arena free list with two tiers: - ArenaSmall (≤16 MB): flush tasks, sinkers, general I/O - ArenaLarge (≤128 MB): merge/compaction tasks Each tier gets GOMAXPROCS/2 slots, matching the merge and flush worker pool sizes. Small arenas have a sizeLimit that prevents Reset from growing them beyond 16 MB, so flush arenas never inflate to 128 MB. Without tiers, all arenas converge to 128 MB because small callers (flush) share the pool with large callers (merge). On a 32-core machine this wastes 4 GB permanent RSS; with tiers it drops to ~2.3 GB. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

CN merge tasks allocated TransferMaps via GetTransferMap but never returned them to the pool, causing ongoing pool misses. Add CleanTransMapping call in Release() to match TN merge behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove 6 unused symbols: - MedianDecimal64, MedianDecimal128, MedianNumeric (superseded by medianNumericFromState/medianDecimal*FromState in median2.go) - GetMinAggregatorsChunkSize, getChunkSizeOfAggregator (no callers) - NewEmptyVectors (no callers) - SpecialAggValuesString (no callers, also removes fmt/strings imports) 178 lines deleted, tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Sinker accumulates data from multiple WriteBatch calls before syncing. The CN S3 writer path goes through the Sinker and processes large data volumes that exceed the 16 MB ArenaSmall cap, causing constant make() overflow in WriteArena.Alloc (51+ GB in profile 056). Flush jobs already create their own ArenaSmall directly via NewBlockWriterWithArena, so this change only affects the Sinker path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ArenaLarge is now shared between TN merge workers and CN S3 writers. The previous GOMAXPROCS/2 capacity was sized for merge workers only, causing CN S3 writer arenas to be discarded and re-created cold. Profile 057 showed 150 arena warmup allocations (5.2 GB of Reset). Doubling the pool capacity keeps more warm arenas available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… MB) Fresh arenas previously started at size 0, requiring 7-8 geometric growth steps (0→1→2→4→8→16→32→64→128 MB) to reach steady state. Each step triggers a Reset allocation. Pre-warming skips most of these steps, reducing Reset+Alloc overhead (~1.88 GB in profile 058). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

FetchWithSchema previously called NewWithSchema which allocates ALL column vectors upfront, then immediately replaced some with pooled vectors — wasting the freshly allocated ones. Now we create the batch shell with nil vectors, fill from the pool first, and only allocate fresh vectors for unfilled positions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…emove dead code Merge objects with many blocks easily exceed 128 entries (entries = columns × blocks + 2×blocks + 3), causing append doubling. Increase initial capacity to 256 to cover most cases. Also remove the unused 'buf' field (bytes.Buffer) and its sole accessor Length() which was never called. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add DrainArenaPools() which atomically empties both arena free lists, letting the GC reclaim the backing memory. Called from the post-checkpoint callback since checkpoints follow flush/merge cycles, making it a natural idle point. On a 16-core machine with GOMAXPROCS large arenas at 128 MB each, this can reclaim ~2 GB of RSS during idle periods. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…sfer maps Replace per-block TransferMap allocations with a single contiguous slab allocation pooled via sync.Pool. Introduce TransferTable abstraction that supports both slab-based (merger) and legacy maps-based (flush/debug) formats. Key changes: - Add TransferTable struct with GetBlockMap/Len/Release accessors - Merger allocates flat slab from pool, uses blockIdx*stride+rowIdx indexing - Reshaper creates TransferTable with Maps format (unchanged allocation) - Replace InitTransferMaps/GetTransferMaps/SetTransferSlab with single SetTransferTable on MergeTaskHost interface - Update all consumers (TN merge, CN merge, debug RPC) to use TransferTable - Pool returns slab after each merge task completes (effective reuse) - Remove ~125 lines of per-block pool infrastructure (tmNode, tmFreeList, etc.) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Execute defer released the TransferTable before PrepareCommit ran phase-2 transfer, causing a panic on GetBlockMap with nil Maps/BlockActive. Move release to mergeObjectsEntry.PrepareCommit (after both transfer phases) and PrepareRollback (on abort), where the TransferTable is no longer needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Convert reshaper from per-block GetTransferMap allocations to a single contiguous slab from the pool. Eliminates per-block nil check and GetTransferMap call from the hot loop. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Large slabs (>4 MB) from big merges were being retained in sync.Pool, causing 899 MB inuse_space (profile 071). Now putTransferSlab drops slabs exceeding maxPoolSlabEntries, letting GC reclaim them promptly. Small slabs (<= 4 MB) are still pooled for reuse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Profile 072 showed 680 MB inuse_space from NewArena, with 94% from FSinkerImpl.Sink. The GOMAXPROCS pool couldn't keep up with concurrent merge + sinker + flush tasks. Doubling the pool reduces fresh allocations during burst workloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…e list Replace sync.Pool-based transfer slab caching with off-heap allocation via mpool.MakeSlice (C.calloc). This removes the slab from Go's heap entirely, eliminating its contribution to pprof inuse_space and reducing GC pressure. Key changes: - Allocate slabs off-heap via mpool.MakeSlice[TransferDestPos](offHeap=true) - Maintain a capped free list (max 32 entries, max 4 MB each) with best-fit selection to reuse off-heap slabs without re-allocating - Large slabs (>4 MB) are freed immediately via mpool.FreeSlice (C.free) - Add DrainTransferSlabPool() called after checkpoint to reclaim idle RSS Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move WriteArena data and compressBuf allocations off the Go heap using mpool.Alloc(offHeap=true), which routes through C.calloc/C.free. This removes these buffers from Go's pprof inuse_space and eliminates their GC scanning overhead. Key changes: - NewArena: allocate data via arenaMPool.Alloc(size, true) - Reset: free old data via arenaMPool.Free before growing - CompressBuf: free old buffer via arenaMPool.Free before growing - Add FreeBuffers() to explicitly release off-heap allocations - PutArena: call FreeBuffers() when dropping (pool full) - DrainArenaPools: walk freed list and FreeBuffers() each arena The Alloc() fallback (make when arena overflows) stays on Go heap — these are small, temporary, and GC-collected naturally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…6MB) Replace flat free list + mutex with two-tier lock-free CAS stacks, matching the arena pool design. Slabs are quantized to tier capacity for O(1) routing. Slabs exceeding 16 MB are freed immediately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…oint) Replace immediate DrainArenaPools() after every checkpoint with a debounced timer. During active operation (checkpoints every ~60s), the timer resets before firing so pools stay warm — eliminating ~2000 needless alloc/free cycles per run. Once activity ceases for 2 minutes, the timer fires and off-heap RSS is reclaimed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Flush reads from appendable objects that will be merged away shortly. Without SkipCacheReads, these reads pollute the memory/disk cache and evict actively-queried data. The merge path already had this policy (mergeobjects.go:290); now flush matches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace raw unsafe.Pointer CAS operations with typed atomic.Pointer[T] in both transferSlabBucket and arenaFreeList. Same lock-free semantics, better type safety, no more unsafe import. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

page.Marshal() returns *bytes.Buffer, not []byte. Use Len() and Bytes() methods instead of len() builtin and direct assignment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

NewNoSparse() uses probabilistic registers from the start, giving ~2024 instead of exact 2048 for small cardinalities. Revert to New() which uses sparse representation for exact counts when the cardinality is below the sparse threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Tables with fake PK (__mo_fake_pk_col) had approximate NDV from HLL estimation instead of exact totalRows. This is because fake PK has SortKey=false, so HasPK() returns false and SetPKNdv was never called. Add SetFakePK() to BlockWriter that marks the fake PK column index. In Sync(), SetPKNdv is called for the fake PK column with totalRows. Add HasFakePK() to Schema for explicit fake PK detection. Update flush (flushTableTail, flushobj) and merge (mergeobjects) paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… NDV" This reverts commit 6fd4e01.

aunjgr requested review from LeftHandCold, XuPeng-SH, fengttt, gouhongshen, iamlinjunhong, jiangxinmeng1 and ouyuanning as code owners March 25, 2026 11:45

aunjgr had a problem deploying to ci March 25, 2026 11:45 — with GitHub Actions Failure

aunjgr temporarily deployed to ci March 25, 2026 11:45 — with GitHub Actions Inactive

aunjgr had a problem deploying to ci March 25, 2026 11:45 — with GitHub Actions Failure

aunjgr temporarily deployed to ci March 25, 2026 11:45 — with GitHub Actions Inactive

matrix-meow added the size/L Denotes a PR that changes [500,999] lines label Mar 25, 2026

mergify bot added kind/enhancement kind/refactor Code refactor labels Mar 25, 2026

aunjgr added the do-not-merge/wip label Mar 25, 2026

aunjgr added 3 commits March 25, 2026 22:56

delete dead code

7b2d01f

unit tests

a0a1bd6

improve memory allocation in mergesort

2176aab

aunjgr had a problem deploying to ci March 25, 2026 17:44 — with GitHub Actions Failure

aunjgr had a problem deploying to ci March 25, 2026 17:44 — with GitHub Actions Error

aunjgr and others added 30 commits March 27, 2026 19:17

Merge branch 'main' into memopt

f52576f

Merge branch 'main' into memopt

6a94681

Merge branch 'main' into memopt

887a6d6

Merge branch 'main' into memopt

17210f8

fix: use data.Len()/data.Bytes() for *bytes.Buffer in db_test.go

e06a3b5

page.Marshal() returns *bytes.Buffer, not []byte. Use Len() and Bytes() methods instead of len() builtin and direct assignment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert "fix: revert HLL to sparse mode for accurate small-cardinality…

eb3c115

… NDV" This reverts commit 6fd4e01.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize: reduce heap allocations and memory footprint#23957

optimize: reduce heap allocations and memory footprint#23957
aunjgr wants to merge 65 commits intomatrixorigin:mainfrom
aunjgr:memopt

aunjgr commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aunjgr commented Mar 25, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants