-
Notifications
You must be signed in to change notification settings - Fork 506
feat: make namespace related api more friendly for distributed engines #5547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make namespace related api more friendly for distributed engines #5547
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
bdff78f to
8dc1662
Compare
Code Review
This PR introduces the StorageOptionsAccessor abstraction to unify storage options and provider management for distributed engine scenarios. The refactoring consolidates storage_options and storage_options_provider into a single accessor pattern with caching and automatic credential refresh.
P0 Issues
None identified.
P1 Issues
-
Potential race condition in credential refresh retry loop (
storage_options.rs)In
get_provider_options_with_version(), when the write lock is busy, the code returnsNoneand the caller retries with a 10ms sleep. However, there's no maximum retry limit, which could lead to indefinite spinning in pathological cases:loop { match self.do_get_provider_options().await? { Some((opts, version)) => return Ok((opts, version)), None => { tokio::time::sleep(Duration::from_millis(10)).await; } } }
Consider adding a maximum retry count or exponential backoff with a timeout.
-
Test coverage for
StorageOptionsAccessorThe new
StorageOptionsAccessorstruct (~300+ lines of new code) appears to lack dedicated unit tests. Given this is a critical component for credential management across Python, Java, and Rust, I'd recommend adding tests covering:- Caching behavior (expiration, refresh before expiry)
- Merging of initial options with provider options
- Version tracking on refresh
- Edge cases (empty options, no provider, etc.)
Observations
- The API surface is well-designed with clear precedence rules (provider options override initial options).
- The migration from
storage_options + storage_options_providerto unifiedstorage_options_accessoris consistently applied across Python, Java, and Rust bindings. - Documentation and examples in the Java/Python bindings are thorough.
Code Review
This PR refactors how storage options and credential providers are handled by introducing a unified StorageOptionsAccessor that manages both static options and dynamic providers with automatic caching and refresh.
Summary
The changes unify the previous separate storage_options and storage_options_provider into a single StorageOptionsAccessor that handles option merging, caching, and credential refresh. This is a significant refactoring that improves the API for distributed engines.
P1 Issues
-
Blocking Hash implementation -
StorageOptionsAccessor::hash()usesblocking_read()on a tokioRwLock. This is called from theHashtrait which is sync. If this is called from an async context (which is likely given the codebase's async-first architecture), this could lead to deadlocks or panic. Consider either:- Using
std::sync::RwLockforinitial_optionsif it's rarely written to - Adding documentation warning about calling this from async contexts
- Adding
#[track_caller]or panic guards for async context detection
- Using
-
Test assertion on concurrent refresh - In
test_accessor_concurrent_refresh, the assertioncall_count >= 1is very weak for a concurrency test. With 20 concurrent tasks hitting expired cache, you should validate that the provider isn't called 20 times (which would indicate the caching isn't working). Consider adding an upper bound likecall_count <= 3to catch regression where the cache stops working.
Minor Observations
- The consolidation of storage options and provider into
StorageOptionsAccessoris a good API improvement. - Good test coverage for the new accessor with various scenarios including expiration and concurrent access.
- Documentation in Python and Java APIs is comprehensive with clear examples.
The architecture change looks solid overall. The main concern is the blocking Hash impl which should be addressed before merge.
Uh oh!
There was an error while loading. Please reload this page.
in distributed engines situation, we have the following problem with vended credentials: we pass in the namespace and table ID to get location and allow dynamic credentials refresh. Then the table is cached and used for serving multiple queries.
When executing in another worker (e.g. spark, lancedb enterprise, etc.), we have to basically fetch the credentials again because we don't know what is the current credentials to use, and the credentials could already been refreshed and is different from the initial input.
This PR adds an API
dataset.current_storage_options()to get the latest storage options to be used, so that it can be served as the initial storage options to use in the worker node. This ensures we only make a single call tonamespace_client.describe_table. Note that the engine should configure the credentials refresh lead time to be long enough that it is sufficient to use a single credential in the work in most cases.Another problem is that when distributing to the worker, we already know the location of the table and the storage options to use, so we should just pass that in and use it. But today the API is an either-or, user either pass in uri or pass in namespace + table ID and it would reload uri and storage options. We added documentation and updated API so that if user pass in namespace + table_id, we do the automated workflow to get uri and storage options and set provider as usual, but also give caller the option to set each component manually to match various caching conditions.