Skip to content

feat: add storage_options to BasePath#6297

Open
cmccabe wants to merge 7 commits intolance-format:mainfrom
cmccabe:colin_multi_base_storage_options
Open

feat: add storage_options to BasePath#6297
cmccabe wants to merge 7 commits intolance-format:mainfrom
cmccabe:colin_multi_base_storage_options

Conversation

@cmccabe
Copy link
Copy Markdown
Contributor

@cmccabe cmccabe commented Mar 25, 2026

For Azure, different storage accounts can have buckets (containers) of the same name. Also storage account is the unit of rate limit cap.

When expressed as base, today we try to match base based on the prefix, and that would mean it can only match 1 bucket but not the others if user defines multiple bases of the same buckets in different storage accounts.

This PR adds the ability to define storage_options as a part of the base definition.

For Azure, different storage accounts can have buckets (containers) of the same
name. Also storage account is the unit of rate limit cap.

When expressed as base, today we try to match base based on the prefix, and
that would mean it can only match 1 bucket but not the others if user defines
multiple bases of the same buckets in different storage accounts.

This PR adds the ability to define storage_options as a part of the base
definition.
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat: add storage_options to BasePath

The feature design is sound — storage options as part of base path identity is the right approach for Azure multi-account scenarios. Proto design, Java/Python bindings, conflict resolution, and ambiguity handling all look good.

Two issues to flag:

P1: clone_with_new_initial_options shares the cache Arc — cross-account credential leakage

In storage_options.rs, clone_with_new_initial_options clones the Arc<RwLock<…>> cache:

Self {
    initial_options: new_initial_options,
    provider: self.provider.clone(),
    cache: self.cache.clone(),  // ← shared!
    refresh_offset: self.refresh_offset.clone(),
}

Two accessors created for different storage accounts (different initial_options) will share the same mutable cache. When one refreshes, the other sees stale or wrong credentials. This can cause intermittent auth failures in production when a dynamic StorageOptionsProvider is involved.

Fix: Create a fresh cache seeded from new_initial_options instead of cloning the Arc:

cache: Arc::new(RwLock::new(new_initial_options.as_ref().map(|opts| CachedStorageOptions {
    options: opts.clone(),
    expires_at_millis: opts.get(EXPIRES_AT_MILLIS_KEY).and_then(|s| s.parse().ok()),
}))),

P1: No guardrail against persisting credentials in storage_options

The proto comment correctly says credentials shouldn't go here, but nothing enforces that. A user could pass azure_storage_account_key in storage_options and it would be persisted in plaintext in every manifest version forever (manifests are append-only). Consider an allowlist filter (e.g., only azure_storage_account_name) at the serialization boundary, or at minimum a log warning when known credential keys are detected.


Everything else (proto schema, bindings wiring, matches_identity, ambiguity error in validate_and_resolve_target_bases, conflict resolver update, test coverage) looks correct and well-structured.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 95.17544% with 11 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/write.rs 82.00% 9 Missing ⚠️
rust/lance/src/dataset/utils.rs 97.56% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, only some questions.

input_params: Option<&'a ObjectStoreParams>,
) -> Cow<'a, ObjectStoreParams> {
if let Some(params) = input_params
&& base_path.storage_options.is_empty()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if base_path.storage_options.is_empty(), we will return the input_params directly so that the base_path_prefix doesn't work anymore.

{
return Cow::Borrowed(params);
}
let mut merged_storage_options = base_path.storage_options.clone();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, can you add a comments for object_store_params_for_base_path to declare the expected merge order? It seems we can to use input_params to override the base path's params.

// Storage options relevant to identifying the base path. For example, azure_storage_account_name
// goes here. Credentials and options which don't affect the identity of the data being stored
// should not be placed here.
map<string, string> storage_options = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconveniently, this is a spec change which means it needs a vote

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this file, we have a place that needs to be changed around ExternalBaseResolver::resolve_external_uri

This becomes interesting with this change, because I can now have multiple bases of the same prefix, and the resolution logic cannot determine which one to use.

I think there is a way to resolve this, by checking the external link's query parameters, for example I can write az://container/file.jpg?base=1 to indicate the exact base to use. @Xuanwo curious about your opinion on this.

Another alternative commonly used is az://container@account/... but that feels too Azure/OCI specific.

@cmccabe
Copy link
Copy Markdown
Contributor Author

cmccabe commented Mar 26, 2026

#6307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants