8066efb43ac38af83d16dba3111ebb8a5e3933a6
603 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
Zuul
|
8066efb43a | Merge "sharder: support rows_per_shard in config file" | ||
|
Alistair Coles
|
a87317db6e |
sharder: support rows_per_shard in config file
Make rows_per_shard an option that can be configured in the [container-sharder] section of a config file. For auto-sharding, this option was previously hard-coded to shard_container_threshold // 2. The swift-manage-shard-ranges command line tool already supported rows_per_shard on the command line and will now also load it from a config file if specified. Any value given on the command line takes precedence over any value found in a config file. Change-Id: I820e133a4e24400ed1e6a87ebf357f7dac463e38 |
||
|
Alistair Coles
|
2fd5b87dc5 |
reconstructor: make quarantine delay configurable
Previously the reconstructor would quarantine isolated durable fragments that were more than reclaim_age old. This patch adds a quarantine_age option for the reconstructor which defaults to reclaim_age but can be used to configure the age that a fragment must reach before quarantining. Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0 |
||
|
Zuul
|
653daf73ed | Merge "relinker: tolerate existing tombstone with same timestamp" | ||
|
Zuul
|
2efd4316a6 | Merge "Make dark data watcher ignore the newly updated objects" | ||
|
Alistair Coles
|
574897ae27 |
relinker: tolerate existing tombstone with same timestamp
It is possible for the current and next part power locations to both have existing tombstones with different inodes when the relinker tries to relink. This can be caused, for example, by concurrent reconciler DELETEs that specify the same timestamp. The relinker previously failed to relink and reported an error when encountering this situation. With this patch the relinker will tolerate an existing tombstone with the same filename but different inode in the next part power location. Since [1] the relinker had special case handling for EEXIST errors caused by a different inode tombstone already existing in the next partition power location: the relinker would check to see if the existing next part power tombstone linked to a tombstone in a previous part power (i.e. < current part power) location, and if so tolerate the EEXIST. This special case handling is no longer necessary because the relinker will now tolerate an EEXIST when linking a tombstone provided the two files have the same timestamp. There is therefore no need to search previous part power locations for a tombstone that does link with the next part power location. The link_check_limit is no longer used but the --link-check-limit command line option is still allowed (although ignored) for backwards compatibility. [1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b Closes-Bug: #1934142 |
||
|
Pete Zaitcev
|
95e0316451 |
Make dark data watcher ignore the newly updated objects
When objects are freshly uploaded, they may take a little time to appear in container listings, producing false positives. Because we needed to test this, we also reworked/added the tests and fixed some issues, including adding an EC fragment (thanks to Alistair's code). Closes-Bug: 1925782 Change-Id: Ieafa72a496328f7a487ca7062da6253994a5a07d Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> |
||
|
Alistair Coles
|
2934818d60 |
reconstructor: Delay purging reverted non-durable datafiles
The reconstructor may revert a non-durable datafile on a handoff concurrently with an object server PUT that is about to make the datafile durable. This could previously lead to the reconstructor deleting the recently written datafile before the object-server attempts to rename it to a durable datafile, and consequently a traceback in the object server. The reconstructor will now only remove reverted nondurable datafiles that are older (according to mtime) than a period set by a new nondurable_purge_delay option (defaults to 60 seconds). More recent nondurable datafiles may be made durable or will remain on the handoff until a subsequent reconstructor cycle. Change-Id: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0 |
||
|
Zuul
|
60dd36cb6b | Merge "Add absolute values for shard shrinking config options" | ||
|
Zuul
|
b3def185c6 | Merge "Allow floats for all intervals" | ||
|
Alistair Coles
|
18f20daf38 |
Add absolute values for shard shrinking config options
Add two new sharder config options for configuring shrinking behaviour: - shrink_threshold: the size below which a shard may shrink - expansion_limit: the maximum size to which an acceptor shard may grow The new options match the 'swift-manage-shard-ranges' command line options and take absolute values. The new options provide alternatives to the current equivalent options 'shard_shrink_point' and 'shard_shrink_merge_point', which are expressed as percentages of 'shard_container_threshold'. 'shard_shrink_point' and 'shard_shrink_merge_point' are deprecated and will be overridden by the new options if the new options are explicitly set in a config file. The default values of the new options are the same as the values that would result from the default 'shard_container_threshold', 'shard_shrink_point' and 'shard_shrink_merge_point' i.e.: - shrink_threshold: 100000 - expansion_limit: 750000 Change-Id: I087eac961c1eab53540fe56be4881e01ded1f60e |
||
|
Alistair Coles
|
f7fd99a880 |
Use ContainerSharderConf class in sharder and manage-shard-ranges
Change the swift-manage-shard-ranges default expansion-limit to equal the sharder daemon default merge_size i.e 750000. The previous default of 500000 had erroneously differed from the sharder default value. Introduce a ContainerSharderConf class to encapsulate loading of sharder conf and the definition of defaults. ContainerSharder inherits this and swift-manage-shard-ranges instantiates it. Rename ContainerSharder member vars to match the equivalent vars and cli options in manage_shard_ranges: shrink_size -> shrink_threshold merge_size -> expansion_limit split_size -> rows_per_shard (This direction of renaming is chosen so that the manage_shard_ranges cli options are not changed.) Rename ContainerSharder member vars to match the conf file option name: scanner_batch_size -> shard_scanner_batch_size Remove some ContainerSharder member vars that were not used outside of the __init__ method: shrink_merge_point shard_shrink_point Change-Id: I8a58a82c08ac3abaddb43c11d26fda9fb45fe6c1 |
||
|
Zuul
|
5ec3826246 | Merge "Quarantine stale EC fragments after checking handoffs" | ||
|
Alistair Coles
|
46ea3aeae8 |
Quarantine stale EC fragments after checking handoffs
If the reconstructor finds a fragment that appears to be stale then it will now quarantine the fragment. Fragments are considered stale if insufficient fragments at the same timestamp can be found to rebuild missing fragments, and the number found is less than or equal to a new reconstructor 'quarantine_threshold' config option. Before quarantining a fragment the reconstructor will attempt to fetch fragments from handoff nodes in addition to the usual primary nodes. The handoff requests are limited by a new 'request_node_count' config option. 'quarantine_threshold' defaults to zero i.e. no fragments will be quarantined. 'request node count' defaults to '2 * replicas'. Closes-Bug: 1655608 Change-Id: I08e1200291833dea3deba32cdb364baa99dc2816 |
||
|
Matthew Oliver
|
4ce907a4ae |
relinker: Add /recon/relinker endpoint and drop progress stats
To further benefit the stats capturing for the relinker, drop partition progress to a new relinker.recon recon cache and add a new recon endpoint: GET /recon/relinker To gather get live relinking progress data: $ curl http://127.0.0.3:6030/recon/relinker |python -mjson.tool { "devices": { "sdb3": { "parts_done": 523, "policies": { "1": { "next_part_power": 11, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 1630, "hash_dirs": 1630, "linked": 1630, "policies": 1, "removed": 0 }, "timestamp": 1618998730.24672, "total_parts": 1029, "total_time": 5.400741815567017 }}, "start_time": 1618998724.845946, "stats": { "errors": 0, "files": 836, "hash_dirs": 836, "linked": 836, "removed": 0 }, "timestamp": 1618998730.24672, "total_parts": 523, "total_time": 5.400741815567017 }, "sdb7": { "parts_done": 506, "policies": { "1": { "next_part_power": 11, "part_power": 10, "parts_done": 506, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 794, "hash_dirs": 794, "linked": 794, "removed": 0 }, "step": "relink", "timestamp": 1618998730.166175, "total_parts": 506, "total_time": 5.320528984069824 } }, "start_time": 1618998724.845616, "stats": { "errors": 0, "files": 794, "hash_dirs": 794, "linked": 794, "removed": 0 }, "timestamp": 1618998730.166175, "total_parts": 506, "total_time": 5.320528984069824 } }, "workers": { "100": { "drives": ["sda1"], "return_code": 0, "timestamp": 1618998730.166175} }} Also, add a constant DEFAULT_RECON_CACHE_PATH to help fix failing tests by mocking recon_cache_path, so that errors are not logged due to dump_recon_cache exceptions. Mock recon_cache_path more widely and assert no error logs more widely. Change-Id: I625147dadd44f008a7c48eb5d6ac1c54c4c0ef05 |
||
|
Tim Burke
|
c374a7a851 |
Allow floats for all intervals
Change-Id: I91e9bc02d94fe7ea6e89307305705c383087845a |
||
|
Zuul
|
e8580f0346 | Merge "s3api: Add config option to return 429s on ratelimit" | ||
|
Tim Burke
|
abfa6bee72 |
relinker: Parallelize per disk
Add a new option, workers, that works more or less like the same option from background daemons. Disks will be distributed across N worker sub-processes so we can make the best use of the I/O available. While we're at it, log final stats at warning if there were errors. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I039d2b8861f69a64bd9d2cdf68f1f534c236b2ba |
||
|
Zuul
|
7594d97f38 | Merge "relinker: retry links from older part powers" | ||
|
Alistair Coles
|
3bdd01cf4a |
relinker: retry links from older part powers
If a previous partition power increase failed to cleanup all files in their old partition locations, then during the next partition power increase the relinker may find the same file to relink in more than one source partition. This currently leads to an error log due to the second relink attempt getting an EEXIST error. With this patch, when an EEXIST is raised, the relinker will attempt to create/verify a link from older partition power locations to the next part power location, and if such a link is found then suppress the error log. During the relink step, if an alternative link is verified and if a file is found that is neither linked to the next partition power location nor in the current part power location, then the file is removed during the relink step. That prevents the same EEXIST occuring again during the cleanup step when it may no longer be possible to verify that an alternative link exists. For example, consider identical filenames in the N+1th, Nth and N-1th partition power locations, with the N+1th being linked to the Nth: - During relink, the Nth location is visited and its link is verified. Then the N-1th location is visited and an EEXIST error is encountered, but the new check verifies that a link exists to the Nth location, which is OK. - During cleanup the locations are visited in the same order, but files are removed so that the Nth location file no longer exists when the N-1th location is visited. If the N-1th location still has a conflicting file then existence of an alternative link to the Nth location can no longer be verified, so an error would be raised. Therefore, the N-1th location file must be removed during relink. The error is only suppressed for tombstones. The number of partition power location that the relinker will look back over may be configured using the link_check_limit option in a conf file or --link-check-limit on the command line, and defaults to 2. Closes-Bug: 1921718 Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c |
||
|
Alistair Coles
|
71a4aea31a |
Update docs to discourage policy names being numbers
There are times when it is convenient to specify a policy by name or by index (see Related-Change), but policy names can unfortunately collide with indexes. Using a number as a policy name should at least be discouraged. Change-Id: I0cdd3b86b527d6656b7fb50c699e3c0cc566e732 Related-Change: Icf1517bd930c74e9552b88250a7b4019e0ab413e |
||
|
Tim Burke
|
e35365df51 |
s3api: Add config option to return 429s on ratelimit
Change-Id: If04c083ccc9f63696b1f53ac13edc932740a0654 |
||
|
Zuul
|
310298a948 | Merge "s3api: Allow CORS preflight requests" | ||
|
Tim Burke
|
27a734c78a |
s3api: Allow CORS preflight requests
Unfortunately, we can't identify the user, so we can't map to an account, so we can't respect whatever CORS metadata might be set on the container. As a result, the allowed origins must be configured cluster-wide. Add a new config option, cors_preflight_allow_origin, for that; default it to blank (ie, deny preflights from all origins, preserving existing behavior), but allow either a comma-separated list of origins or * (to allow all origins). Change-Id: I985143bf03125a05792e79bc5e5f83722d6431b3 Co-Authored-By: Matthew Oliver <matt@oliver.net.au> |
||
|
Matthew Oliver
|
fb186f6710 |
Add a config file option to swift-manage-shard-ranges
While working on the shrinking recon drops, we want to display numbers that directly relate to how tool should behave. But currently all options of the s-m-s-r tool is driven by cli options. This creates a disconnect, defining what should be used in the sharder and in the tool via options are bound for failure. It would be much better to be able to define the required default options for your environment in one place that both the sharder and tool could use. This patch does some refactoring and adding max_shrinking and max_expanding options to the sharding config. As well as adds a --config option to the tool. The --config option expects a config with at '[container-sharder]' section. It only supports the shard options: - max_shrinking - max_expanding - shard_container_threshold - shard_shrink_point - shard_merge_point The latter 2 are used to generate the s-m-s-r's: - shrink_threshold - expansion_limit - rows_per_shard Use of cli arguments take precedence over that of the config. Change-Id: I4d0147ce284a1a318b3cd88975e060956d186aec |
||
|
Zuul
|
5c3eb488f2 | Merge "Report final in_progress when sharding is complete" | ||
|
Matthew Oliver
|
1de9834816 |
Report final in_progress when sharding is complete
On every sharder cycle up update in progress recon stats for each sharding container. However, we tend to not run it one final time once sharding is complete because the DB state is changed to SHARDED and therefore the in_progress stats never get their final update. For those collecting this data to monitor, this makes sharding/cleaving shards never complete. This patch, adds a new option `recon_shared_timeout` which will now allow sharded containers to be processed by `_record_sharding_progress()` after they've finished sharding for an amount of time. Change-Id: I5fa39d41f9cd3b211e45d2012fd709f4135f595e |
||
|
Zuul
|
0c2cc63b59 | Merge "tempauth: Add .reseller_reader group" | ||
|
Tim Burke
|
53c0fc3403 |
relinker: Add option to ratelimit relinking
Sure, you could use stuff like ionice or cgroups to limit relinker I/O, but sometimes a nice simple blunt instrument is handy. Change-Id: I7fe29c7913a9e09bdf7a787ccad8bba2c77cf995 |
||
|
Tim Burke
|
cf4f320644 |
tempauth: Add .reseller_reader group
Change-Id: I8c5197ed327fbb175c8a2c0e788b1ae14e6dfe23 |
||
|
Zuul
|
0c072e244c | Merge "relinker: Allow conf files for configuration" | ||
|
Pete Zaitcev
|
98a0275a9d |
Add a read-only role to keystoneauth
An idea was floated recently of a read-only role that can be used for cluster-wide audits, and is otherwise safe. It was also included into the "Consistent and Secure Default Policies" effort in OpenStack, where it implements "reader" personas in system, domain, and project scopes. This patch implements it for system scope, where it's most useful for operators. Change-Id: I5f5fff2e61a3e5fb4f4464262a8ea558a6e7d7ef |
||
|
Tim Burke
|
1b7dd34d38 |
relinker: Allow conf files for configuration
Swap out the standard logger stuff in place of --logfile. Keep --device as a CLI-only option. Everything else is pretty standard stuff that ought to be in [DEFAULT]. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I32f979f068592eaac39dcc6807b3114caeaaa814 |
||
|
Zuul
|
48b26ba833 | Merge "docs: Clarify that encryption should not be in reconciler pipeline" | ||
|
Tim Burke
|
13c0980e71 |
docs: Clarify that encryption should not be in reconciler pipeline
UpgradeImpact ============= Operators should verify that encryption is not enabled in their reconciler pipelines; having it enabled there may harm data durability. For more information, see https://launchpad.net/bugs/1910804 Change-Id: I1a1d78ed91d940ef0b4eba186dcafd714b4fb808 Closes-Bug: #1910804 |
||
|
Alistair Coles
|
6896f1f54b |
s3api: actually execute check_pipeline in real world
Previously, S3ApiMiddleware.check_pipeline would always exit early because the __file__ attribute of the Config instance passed to check_pipeline was never set. The __file__ key is typically passed to the S3ApiMiddleware constructor in the wsgi config dict, so this dict is now passed to check_pipeline() for it to test for the existence of __file__. Also, the use of a Config object is replaced with a dict where it mimics the wsgi conf object in the unit tests setup. UpgradeImpact ============= The bug prevented the pipeline order checks described in proxy-server.conf-sample being made on the proxy-server pipeline when s3api middleware was included. With this change, these checks will now be made and an invalid pipeline configuration will result in a ValueError being raised during proxy-server startup. A valid pipeline has another middleware (presumed to be an auth middleware) between s3api and the proxy-server app. If keystoneauth is found, then a further check is made that s3token is configured after s3api and before keystoneauth. The pipeline order checks can be disabled by setting the s3api auth_pipeline_check option to False in proxy-server.conf. This mitigation is recommended if previously operating with what will now be considered an invalid pipeline. The bug also prevented a check for slo middleware being in the pipeline between s3api and the proxy-server app. If the slo middleware is not found then multipart uploads will now not be supported, regardless of the value of the allow_multipart_uploads option described in proxy-server.conf-sample. In this case a warning will be logged during startup but no exception is raised. Closes-Bug: #1912391 Change-Id: I357537492733b97e5afab4a7b8e6a5c527c650e4 |
||
|
Tim Burke
|
10d9a737d8 |
s3api: Make allowable clock skew configurable
While we're at it, make the default match AWS's 15 minute limit (instead of our old 5 minute limit). UpgradeImpact ============= This (somewhat) weakens some security protections for requests over the S3 API; operators may want to preserve the prior behavior by setting allowable_clock_skew = 300 in the [filter:s3api] section of their proxy-server.conf Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I0da777fcccf056e537b48af4d3277835b265d5c9 |
||
|
Zuul
|
d5bb644a17 | Merge "Use cached shard ranges for container GETs" | ||
|
Zuul
|
8c611be876 | Merge "Memcached client TLS support" | ||
|
Grzegorz Grasza
|
6930bc24b2 |
Memcached client TLS support
This patch specifies a set of configuration options required to build a TLS context, which is used to wrap the client connection socket. Closes-Bug: #1906846 Change-Id: I03a92168b90508956f367fbb60b7712f95b97f60 |
||
|
Alistair Coles
|
077ba77ea6 |
Use cached shard ranges for container GETs
This patch makes four significant changes to the handling of GET requests for sharding or sharded containers: - container server GET requests may now result in the entire list of shard ranges being returned for the 'listing' state regardless of any request parameter constraints. - the proxy server may cache that list of shard ranges in memcache and the requests environ infocache dict, and subsequently use the cached shard ranges when handling GET requests for the same container. - the proxy now caches more container metadata so that it can synthesize a complete set of container GET response headers from cache. - the proxy server now enforces more container GET request validity checks that were previously only enforced by the backend server, e.g. checks for valid request parameter values With this change, when the proxy learns from container metadata that the container is sharded then it will cache shard ranges fetched from the backend during a container GET in memcache. On subsequent container GETs the proxy will use the cached shard ranges to gather object listings from shard containers, avoiding further GET requests to the root container until the cached shard ranges expire from cache. Cached shard ranges are most useful if they cover the entire object name space in the container. The proxy therefore uses a new X-Backend-Override-Shard-Name-Filter header to instruct the container server to ignore any request parameters that would constrain the returned shard range listing i.e. 'marker', 'end_marker', 'includes' and 'reverse' parameters. Having obtained the entire shard range listing (either from the server or from cache) the proxy now applies those request parameter constraints itself when constructing the client response. When using cached shard ranges the proxy will synthesize response headers from the container metadata that is also in cache. To enable the full set of container GET response headers to be synthezised in this way, the set of metadata that the proxy caches when handling a backend container GET response is expanded to include various timestamps. The X-Newest header may be used to disable looking up shard ranges in cache. Change-Id: I5fc696625d69d1ee9218ee2a508a1b9be6cf9685 |
||
|
Samuel Merritt
|
b971280907 |
Let developers/operators add watchers to object audit
Swift operators may find it useful to operate on each object in their cluster in some way. This commit provides them a way to hook into the object auditor with a simple, clearly-defined boundary so that they can iterate over their objects without additional disk IO. For example, a cluster operator may want to ensure a semantic consistency with all SLO segments accounted in their manifests, or locate objects that aren't in container listings. Now that Swift has encryption support, this could be used to locate unencrypted objects. The list goes on. This commit makes the auditor locate, via entry points, the watchers named in its config file. A watcher is a class with at least these four methods: __init__(self, conf, logger, **kwargs) start(self, audit_type, **kwargs) see_object(self, object_metadata, data_file_path, **kwargs) end(self, **kwargs) The auditor will call watcher.start(audit_type) at the start of an audit pass, watcher.see_object(...) for each object audited, and watcher.end() at the end of an audit pass. All method arguments are passed as keyword args. This version of the API is implemented on the context of the auditor itself, without spawning any additional processes. If the plugins are not working well -- hang, crash, or leak -- it's easier to debug them when there's no additional complication of processes that run by themselves. In addition, we include a reference implementation of plugin for the watcher API, as a help to plugin writers. Change-Id: I1be1faec53b2cdfaabf927598f1460e23c206b0a |
||
|
Zuul
|
ebfc3a61fa | Merge "Use socket_timeout kwarg instead of useless eventlet.wsgi.WRITE_TIMEOUT" | ||
|
Zuul
|
cd228fafad | Merge "Add a new URL parameter to allow for async cleanup of SLO segments" | ||
|
Tim Burke
|
918ab8543e |
Use socket_timeout kwarg instead of useless eventlet.wsgi.WRITE_TIMEOUT
No version of eventlet that I'm aware of hasany sort of support for eventlet.wsgi.WRITE_TIMEOUT; I don't know why we've been setting that. On the other hand, the socket_timeout argument for eventlet.wsgi.Server has been supported for a while -- since 0.14 in 2013. Drive-by: Fix up handling of sub-second client_timeouts. Change-Id: I1dca3c3a51a83c9d5212ee5a0ad2ba1343c68cf9 Related-Change: I1d4d028ac5e864084a9b7537b140229cb235c7a3 Related-Change: I433c97df99193ec31c863038b9b6fd20bb3705b8 |
||
|
Tim Burke
|
e78377624a |
Add a new URL parameter to allow for async cleanup of SLO segments
Add a new config option to SLO, allow_async_delete, to allow operators to opt-in to this new behavior. If their expirer queues get out of hand, they can always turn it back off. If the option is disabled, handle the delete inline; this matches the behavior of old Swift. Only allow an async delete if all segments are in the same container and none are nested SLOs, that way we only have two auth checks to make. Have s3api try to use this new mode if the data seems to have been uploaded via S3 (since it should be safe to assume that the above criteria are met). Drive-by: Allow the expirer queue and swift-container-deleter to use high-precision timestamps. Change-Id: I0bbe1ccd06776ef3e23438b40d8fb9a7c2de8921 |
||
|
Zuul
|
2593f7f264 | Merge "memcache: Make error-limiting values configurable" | ||
|
Tim Burke
|
aff65242ff |
memcache: Make error-limiting values configurable
Previously these were all hardcoded; let operators tweak them as needed. Significantly, this also allows operators to disable error-limiting entirely, which may be a useful protection in case proxies are configured with a single memcached server. Use error_suppression_limit and error_suppression_interval to mirror the option names used by the proxy-server to ratelimit backend Swift servers. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: Ife005cb8545dd966d7b0e34e5496a0354c003881 |
||
|
Zuul
|
b9a404b4d1 | Merge "ec: Add an option to write fragments with legacy crc" | ||
|
Clay Gerrard
|
b05ad82959 |
Add tasks_per_second option to expirer
This allows operators to throttle expirers as needed. Partial-Bug: #1784753 Change-Id: If75dabb431bddd4ad6100e41395bb6c31a4ce569 |