feat: support hamming clustering by jackye1995 · Pull Request #6265 · lance-format/lance

jackye1995 · 2026-03-24T04:31:08Z

Add support for SIMD accelerated pairwise hamming distance computation, and the ability to compute a cluster of binary vectors that are within a given hamming distance threshold, these are considered similar or potentially duplicated vectors of the original representation.

Also expose the feature in python for easy consumption.

github-actions · 2026-03-24T04:32:39Z

PR Review: feat: support hamming clustering

P1: `total_rows` silently drops fragments

let total_rows: usize = dataset
    .get_fragments()
    .iter()
    .filter_map(|f| f.metadata().physical_rows)  // silently skips None
    .sum();

If any fragment has physical_rows = None, the total is undercounted. This can lead to incorrect sampling behavior (sampling more than intended, or use_sampling being wrong). Consider using dataset.count_rows(None).await? instead.

codecov · 2026-03-24T05:02:27Z

Codecov Report

❌ Patch coverage is 80.92158% with 236 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/distance/hamming.rs	85.44%	117 Missing and 6 partials ⚠️
rust/lance/src/index/vector/hamming.rs	71.17%	94 Missing and 19 partials ⚠️

📢 Thoughts on this report? Let us know!

- Change return type from dict/struct to Box<dyn RecordBatchReader + Send> - Output schema: representative (uint64), duplicates (list<uint64>) - ClusteringResult::into_reader() yields batches of 10k clusters - Rename hamming_cluster_hashes -> hamming_clustering_from_hashes - Log timing info via tracing instead of returning in struct - Python bindings return pa.RecordBatchReader Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Use take_rows() which returns _rowid column, instead of using positional indices from sample() as row IDs. This ensures the cluster results contain actual row IDs that can be used for downstream operations like deleting duplicates. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

- Fix sampling path to request _rowid column explicitly in take_rows projection - Add integration tests for IVF partition clustering and sampled clustering - Remove .unwrap() in Python binding closures, use ? operator - Change to_record_batch to into_record_batch to avoid cloning Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

take() accepts positional indices (0 to num_rows-1), while take_rows() expects internal ROW IDs. The sample() function returns positional indices. Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

- Rename hamming_clustering_sampled to hamming_clustering_for_sample for consistency with hamming_clustering_for_ivf_partition naming convention - Add hamming_clustering_for_range function that reads a contiguous range of rows from a specific fragment and performs hamming clustering - This is useful for distributed processing where each worker handles a specific range of a fragment - Add Python bindings and type stubs for both functions - Add Rust tests for the new hamming_clustering_for_range function Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

feat: support hamming clustering

8c69ce8

github-actions bot added enhancement New feature or request python labels Mar 24, 2026

jackye1995 and others added 6 commits March 23, 2026 22:04

fix: escape angle brackets in rustdoc comments

872ad24

Generated with [Claude Code](https://claude.ai/code) via [Happy](https://happy.engineering) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Happy <yesreply@happy.engineering>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support hamming clustering#6265

feat: support hamming clustering#6265
jackye1995 wants to merge 7 commits intolance-format:mainfrom
jackye1995:hamming

jackye1995 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 •

edited by jackye1995

Loading

Uh oh!

codecov bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackye1995 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026 • edited by jackye1995 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat: support hamming clustering

P1: total_rows silently drops fragments

Uh oh!

codecov bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Mar 24, 2026 •

edited by jackye1995

Loading

P1: `total_rows` silently drops fragments

codecov bot commented Mar 24, 2026 •

edited

Loading