Skip to content

feat(python): add target_partition_size param to train_ivf, deprecate num_partitions#6287

Open
hushengquan wants to merge 1 commit intolance-format:mainfrom
hushengquan:dev
Open

feat(python): add target_partition_size param to train_ivf, deprecate num_partitions#6287
hushengquan wants to merge 1 commit intolance-format:mainfrom
hushengquan:dev

Conversation

@hushengquan
Copy link
Copy Markdown
Contributor

close #6286

  • Add target_partition_size as the preferred way to specify IVF partitioning,
    consistent with the Rust recommended_num_partitions logic (clamped to [1, 4096]).
  • Mark num_partitions as deprecated with a DeprecationWarning.
  • Propagate the new param through prepare_global_ivf_pq.
  • Make num_partitions and target_partition_size mutually exclusive.

@github-actions github-actions bot added enhancement New feature or request python labels Mar 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat(python): add target_partition_size param to train_ivf

P0: Missing tests

Per project standards, all features must have corresponding tests. This PR needs tests for at least:

  • target_partition_size computing the correct num_partitions
  • Mutual exclusivity error when both params are set
  • target_partition_size=0 (see bug below)
  • Edge cases: very small datasets (fewer rows than target size), very large datasets (hitting the 4096 cap)

P1: Division by zero when target_partition_size=0

In _determine_num_partitions, num_rows // target_partition_size will raise ZeroDivisionError if target_partition_size=0. Add an early validation:

if target_partition_size is not None and target_partition_size <= 0:
    raise ValueError("target_partition_size must be a positive integer")

P1: Duplicate logic — reuse existing utility

python/python/lance/util.py:251 already has _target_partition_size_to_num_partitions which does the same thing. Rather than duplicating the clamping logic inline, reuse that utility.

However, the existing utility has a bug on line 257:

return max(1, num_partitions, 4096)  # bug: always returns >= 4096

This should be max(1, min(num_partitions, 4096)) — which is what your inline version correctly does. Consider fixing the utility and reusing it from both call sites.

Minor

  • The num_partitions parameter in prepare_global_ivf_pq was changed from positional to keyword-only default (Optional[int] = None). This is a breaking change for callers passing it positionally. Verify no downstream code calls prepare_global_ivf_pq(ds, 128, 16, ...).

🤖 Generated with Claude Code

@hushengquan hushengquan force-pushed the dev branch 2 times, most recently from 3cc9d97 to 4b0b9cb Compare March 25, 2026 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

python: train_ivf() should accept target_partition_size instead of deprecated num_partitions

2 participants