Skip to content

feat: support hamming clustering#6265

Open
jackye1995 wants to merge 7 commits intolance-format:mainfrom
jackye1995:hamming
Open

feat: support hamming clustering#6265
jackye1995 wants to merge 7 commits intolance-format:mainfrom
jackye1995:hamming

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

Add support for SIMD accelerated pairwise hamming distance computation, and the ability to compute a cluster of binary vectors that are within a given hamming distance threshold, these are considered similar or potentially duplicated vectors of the original representation.

Also expose the feature in python for easy consumption.

@github-actions github-actions bot added enhancement New feature or request python labels Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 24, 2026

PR Review: feat: support hamming clustering

P1: total_rows silently drops fragments

let total_rows: usize = dataset
    .get_fragments()
    .iter()
    .filter_map(|f| f.metadata().physical_rows)  // silently skips None
    .sum();

If any fragment has physical_rows = None, the total is undercounted. This can lead to incorrect sampling behavior (sampling more than intended, or use_sampling being wrong). Consider using dataset.count_rows(None).await? instead.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 80.92158% with 236 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-linalg/src/distance/hamming.rs 85.44% 117 Missing and 6 partials ⚠️
rust/lance/src/index/vector/hamming.rs 71.17% 94 Missing and 19 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995 and others added 6 commits March 23, 2026 22:04
- Change return type from dict/struct to Box<dyn RecordBatchReader + Send>
- Output schema: representative (uint64), duplicates (list<uint64>)
- ClusteringResult::into_reader() yields batches of 10k clusters
- Rename hamming_cluster_hashes -> hamming_clustering_from_hashes
- Log timing info via tracing instead of returning in struct
- Python bindings return pa.RecordBatchReader

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Use take_rows() which returns _rowid column, instead of using
positional indices from sample() as row IDs. This ensures the
cluster results contain actual row IDs that can be used for
downstream operations like deleting duplicates.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Fix sampling path to request _rowid column explicitly in take_rows projection
- Add integration tests for IVF partition clustering and sampled clustering
- Remove .unwrap() in Python binding closures, use ? operator
- Change to_record_batch to into_record_batch to avoid cloning

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
take() accepts positional indices (0 to num_rows-1), while take_rows()
expects internal ROW IDs. The sample() function returns positional indices.

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
- Rename hamming_clustering_sampled to hamming_clustering_for_sample for
  consistency with hamming_clustering_for_ivf_partition naming convention
- Add hamming_clustering_for_range function that reads a contiguous range
  of rows from a specific fragment and performs hamming clustering
- This is useful for distributed processing where each worker handles a
  specific range of a fragment
- Add Python bindings and type stubs for both functions
- Add Rust tests for the new hamming_clustering_for_range function

Generated with [Claude Code](https://claude.ai/code)
via [Happy](https://happy.engineering)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Happy <yesreply@happy.engineering>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant