52730e1037563ad8ba0e09da93886c856ed9e875
693 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
Yan Xiao
|
dcd5a265f6 |
proxy-logging: Add real-time transfer bytes counters
Currently we can get one proxy-logging transfer stat emission over the duration of the upload/download. We want another stat coming out of proxy-logging: something that gets emitted periodically as bytes are actually sent/received so we can get reasonably accurate point-in-time breakdowns of bandwidth usage. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Co-Authored-By: Shreeya Deshpande <shreeyad@nvidia.com> Change-Id: Ideecd0aa58ddf091c9f25f15022a9066088f532b Signed-off-by: Yan Xiao <yanxiao@nvidia.com> |
||
|
Zuul
|
c161aa168f | Merge "relinker: allow clobber-hardlink-collision" | ||
|
Clay Gerrard
|
be62933d00 |
relinker: allow clobber-hardlink-collision
The relinker has already been robust to hardlink collisions on tombstones for some time; this change allows ops to optionally (non-default) enable a similar handling of other files when relinking the old=>new partdir. If your cluster is having a bunch of these kinds of collisions and after spot checking you determine the data is in fact duplicate copies the same data - you'd much rather have the option for the relinker to programatically handle them non-destructively than forcing ops to rm a bunch of files manually just get out of a PPI. Once the PPI is over and you reconstrcutors are running again, after some validation you can probably clean out your quarantine dirs. Drive-by: log unknown relink errors at error level to match expected non-zero return code Closes-Bug: #2127779 Change-Id: Iaae0d9fb7a1949d1aad9aa77b0daeb249fb471b5 Signed-off-by: Clay Gerrard <clay.gerrard@gmail.com> |
||
|
Tim Burke
|
79feb12b28 |
docs: More proxy-server.conf-sample cleanup
Change-Id: I99dbd9590ff39343422852e4154f98bc194d161d Signed-off-by: Tim Burke <tim.burke@gmail.com> |
||
|
Clay Gerrard
|
389747a8b2 |
doc: specify seconds in proxy-server.conf-sample
Most of swift's timing configuration values should accept units in seconds; make this explicit in the sample config for values that did not already do so. Related-Change-Id: I38c11b7aae8c4112bb3d671fa96012ab0c44d5a2 Change-Id: I5b25b7e830a31f03d11f371adf12289222222eb2 Signed-off-by: Clay Gerrard <clay.gerrard@gmail.com> |
||
|
Jianjian Huo
|
d9883d0834 |
proxy: use cooperative tokens to coalesce updating shard range requests into backend
The cost of memcache misses could be deadly. For example, when updating shard range cache query miss, PUT requests would have to query the backend to figure out which shard to upload the objects. And when a lot of requests are sending to the backend at the same time, this could easily overload the root containers and cause a lot of 500/503 errors; and when proxy-servers receive responses of all those 200 backend shard range queries, they could in turn try to write the same shard range data into memcached servers at the same time, and cause memcached to return OOM failures too. We have seen cache misses frequently to updating shard range cache in production, due to Memcached out-of-memory and cache evictions. To cope with those kind of situations, a memcached based cooperative token mechanism can be added into proxy-server to coalesce lots of in-flight backend requests into a few: when updating shard range cache misses, only the first few of requests will get global cooperative tokens and then be able to fetch updating shard ranges from backend container servers. And the following cache miss requests will wait for cache filling to finish, instead of all querying the backend container servers. This will prevent a flood of backend requests to overload both container servers and memcached servers. Drive-by fix: when memcache is not available, object controller will only need to retrieve a specific shard range from the container server to send the update request to. Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Tim Burke <tim.burke@gmail.com> Co-Authored-By: Yan Xiao <yanxiao@nvidia.com> Co-Authored-By: Shreeya Deshpande <shreeyad@nvidia.com> Signed-off-by: Jianjian Huo <jhuo@nvidia.com> Change-Id: I38c11b7aae8c4112bb3d671fa96012ab0c44d5a2 |
||
|
Zuul
|
e10c2bafcb | Merge "proxy-logging: create field for access_user_id" | ||
|
Vitaly Bordyug
|
32eaab20b1 |
proxy-logging: create field for access_user_id
Added the new field to be able to log the access key during the s3api calls, while reserving the field to be filled with auth relevant information in case of other middlewares. Added respective code to the tempauth and keystone middlewares. Since s3api creates a copy of the environ dict for the downstream request object when translating the s3req.to_swift_req the environ dict that is seen/modifed in other mw module is not the same instance seen in proxy-logging - using mutable objects get transfered into the swift_req.environ. Change the assert in test_proxy_logging from "the last field" to the index 21 in the interests of maintainability. Also added some regression tests for object, bucket and s3 v4 apis and updated the documentation with the details about the new field. Signed-off-by: Vitaly Bordyug <vbordug@gmail.com> Change-Id: I0ce4e92458e2b05a4848cc7675604c1aa2b64d64 |
||
|
Tim Burke
|
ae062f8b09 |
ring: Introduce a v2 ring format
There's a bunch of moving pieces here: - Add a new RingWriter class. Stick it in a new swift.common.ring.io module. You *can* use it like the old gzip file, but you can also define named sections which can be referenced later on read. Section names may be arbitrary strings, but the "swift/" prefix is reserved for upstream use. Sections must contain a single length-value encoded BLOB. If sections are used, an additional BLOB is written at the end containing a JSON section-index, followed by an uncompressed offset for the index. Move RingReader to ring/io.py, too. - Clean up some ring metadata handling: - Drop MD5 tracking in RingReader. It was brittle at best anyway, and nothing uses it. YAGNI - Fix size/raw_size attributes when loading only metadata. - Add the ability to seek within RingReaders, though you need to know what you're doing and only seek to flush points. - Let RingBuilder objects change how wide their replica2part2dev_id arrays are. Add a dev_id_bytes key to serialized ring metadata. dev_id_bytes may be either 2 or 4, but 4 requires v2 rings. We considered allowing dev_id_bytes of 1, but dropped it as unnecessary complexity for a niche use case. - swift-ring-builder version subcommand added, which takes a ring. This lets operators see the serialization format of a ring on disk: $ swift-ring-builder object.ring.gz version object.ring.gz: Serialization version: 2 (2-byte IDs), build version: 54 Signed-off-by: Tim Burke <tim.burke@gmail.com> Change-Id: Ia0ac4ea2006d8965d7fdb6659d355c77386adb70 |
||
|
Tim Burke
|
74030236ad |
tempauth: Support fernet tokens
Tempauth fernet tokens use a secret shared among all proxies to encrypt user group information. Because they are encrypted, clients can neither view nor edit this information; it is an opaque bearer token similar to the existing memcached-backed tokens (just much longer). Note that tokens still expire after the configured token_life. Add a new set of config options of the form fernet_key_<keyid> = <32 url-safe base64-encoded bytes> Any of the configured keys will be used to attempt to decrypt tokens starting with "ftk" and extract group information. Another new config option active_fernet_key_id = <keyid> dictates which key should be used when minting tokens. Such tokens will start with "ftk" to distinguish them from memcached-backed tokens (which continue to start with "tk"). If active_fernet_key_id is not configured, memcached-backed tokens continue to be used. Together, these allow seamless transitions from memcached-backed tokens to fernet tokens, as well as transitions from one fernet key to another: 1. Add a new fernet_key_<keyid> entry. 2. Ensure all proxies have the new config with fernet_key_<keyid>. 3. Set active_fernet_key_id = <keyid>. 4. Ensure all proxies have the new config with the new active_fernet_key_id. This is similar to the key-rotation process for the encryption feature, except that old keys may be pruned following a token_life period. Additionally, opportunistically compress groups before minting tokens. Compressed tokens will begin with "zftk" but otherwise behave just like "ftk" tokens. Change-Id: I0bdc98765d05e91f872ef39d4722f91711a5641f |
||
|
Clay Gerrard
|
0e2791a88a |
Remove deprecated statsd label_mode
Hopefully if we never do a release that supports signalfx no one will ever use it and we won't have to maintain it. Drive-by: refactor label model dispatch to fix a weird bug where a config name could be a class attribute and blow up weird. Change-Id: I2c67b59820c5ca094077bf47628426f4b0445ba0 |
||
|
Tim Burke
|
7e5235894b |
stats: API for native labeled metrics
Introduce a LabeledStatsdClient API; no callers yet. Include three config options: - statsd_label_mode, which specifies which label format to use - statsd_emit_legacy, which dictates whether to emit old-style metrics dotted metrics - statsd_user_label_<name> = <value>, which supports user defined labels in restricted ASCII characters Co-Authored-By: yanxiao@nvidia.com Co-Authored-By: alistairncoles@gmail.com Change-Id: I115ffb1dc601652a979895d7944e011b951a91c1 |
||
|
Clay Gerrard
|
b69a2bef45 |
Deprecate expirer options
The following configuration options are deprecated: * expiring_objects_container_divisor * expiring_objects_account_name The upstream maintainers are not aware of any clusters where these have been configured to non-default values. UpgradeImpact: Operators are encouraged to remove their "container_divisor" setting and use the default value of 86400. If a cluster was deployed with a non-standard "account_name", operators should remove the option from all configs so they are using a supported configuration going forward, but will need to deploy stand-alone expirer processes with legacy expirer config to clean-up old expiration tasks from the previously configured account name. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Co-Authored-By: Jianjian Huo <jhuo@nvidia.com> Change-Id: I5ea9e6dc8b44c8c5f55837debe24dd76be7d6248 |
||
|
Tim Burke
|
ae6300af86 |
wsgi: Reap stale workers (after a timeout) following a reload
Add a new tunable, `stale_worker_timeout`, defaulting to 86400 (i.e. 24 hours). Once this time elapses following a reload, the manager process will issue SIGKILLs to any remaining stale workers. This gives operators a way to configure a limit for how long old code and configs may still be running in their cluster. To enable this, the temporary reload child (which waits for the reload to complete then closes the accept socket on all the old workers) has grown the ability to send state to the re-exec'ed manager. Currently, this is limited to just the set of pre-re-exec child PIDs and their reload times, though it was designed to be reasonably extensible. This allows the new manager to recognize stale workers as they exit instead of logging Ignoring wait() result from unknown PID ... With the improved knowledge of subprocesses, we can kick the log level for the above message up from info to warning; we no longer expect it to trigger in practice. Drive-by: Add logging to ServersPerPortStrategy.register_worker_exit that's comparable to what WorkersStrategy does. Change-Id: I8227939d04fda8db66fb2f131f2c71ce8741c7d9 |
||
|
Zuul
|
94d3a5dee8 | Merge "obj: Add option to tune down etag validation in object-server" | ||
|
Tim Burke
|
3d8fb046cb |
obj: Add option to tune down etag validation in object-server
Historically, the object-server would validate the ETag of an object whenever it was streaming the complete object. This minimizes the possibility of returning corrupted data to clients, but - Clients that only ever make ranged requests get no benefit and - MD5 can be rather CPU-intensive; this is especially noticeable in all-flash clusters/policies where Swift is not disk-constrained. Add a new `etag_validate_pct` option to tune down this validation. This takes values from 100 (default; all whole-object downloads are validated) down to 0 (none are). Note that even with etag validation turned off, the object-auditor should eventually detect and quarantine corrupted objects. However, transient read errors may cause clients to download corrupted data. Hat-tip to Jianjian for all the profiling work! Co-Authored-By: Jianjian Huo <jhuo@nvidia.com> Change-Id: Iae48e8db642f6772114c0ae7c6bdd9c653cd035b |
||
|
Tim Burke
|
a55a48ffc8 |
docs: Call out that xprofile is not intended for production
Change-Id: I1e9d4d5df403040d69db93a08647cd0abe1b8037 |
||
|
Jianjian Huo
|
ea1d84c1d7 |
Object-server: add periodic greenthread yielding during file write
Currently, when object-server serves PUT request and DiskFile writer write file chunks to disk, there is no explicit eventlet sleep called. When network outpace the slow disk IO, it's possible one large and slow PUT request could cause eventlet hub not to schedule any other green threads for a long period of time. To improve this, this patch enable the configurable yield parameter 'cooperative_period' into object server controller write path. Related-Change: I80b04bad0601b6cd6caef35498f89d4ba70a4fd4 Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I1c0aba9830433f093d024b4c39cd3a3b2f0d69f1 |
||
|
Zuul
|
7662cde704 | Merge "Add oldest failed async pending tracker" | ||
|
Chinemerem
|
0a5348eb48 |
Add oldest failed async pending tracker
In the past we have had some async pendings that repeatedly fail for months at a time. This patch adds an OldestAsyncPendingTracker class which manages the tracking of the oldest async pending updates for each account-container pair. This class maintains timestamps for pending updates associated with account-container pairs. It evicts the newest pairs when the max_entries is reached. It supports retrieving the N oldest pending updates or calculating the age of the oldest pending update. Change-Id: I6d9667d555836cfceda52708a57a1d29ebd1a80b |
||
|
Clay Gerrard
|
df22032d79 |
object-expirer: add round_robin_cache_size option
Drive-Bys: * DRY out redundent configuration examples in expiring objects overview documentation. * Add missing delay_reaping man page docs. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I8879dbd13527233c878dff764ec411ce9619ee39 |
||
|
Tim Burke
|
ef8764cb06 |
logging: Add UPDATE to valid http methods
We introduced this a while back, but forgot to add it then. Related-Change: Ia13ee5da3d1b5c536eccaadc7a6fdcd997374443 Change-Id: Ib65ddf50d7f5c3e27475626000943eb18e65c73a |
||
|
Alistair Coles
|
d555755423 |
proxy_logging config: unit tests and doc pointers
Add unit tests to verify the precedence of access_log_ and log_ prefixes to options. Add pointers from proxy_logging sections in other sample config files to the proxy-server.conf-sample file. Change-Id: Id18176d3790fd187e304f0e33e3f74a94dc5305c |
||
|
Thomas Goirand
|
90da23c7d2 |
kms_keymaster: allow specifying barbican_endpoint
Under a multi-region deployment with a single Keystone server, specifying the Keystone auth credentials isn't enough. Indeed, Castellan succeeds when logging-in, but may use the wrong Barbican endpoint (if there are 2 Barbican deployed). This is what happened to us, when deploying our 2nd region. They way to fix it would be to tell Castellan what region to use, unfortunately, there's no such option in Castellan. Though we may specify the barbican_endpoint, which is what this patch allows. Change-Id: Ib7f4219ef5fdef65e9cfd5701e28b5288741783e |
||
|
Zuul
|
d1aa735a37 | Merge "backend ratelimit: support per-method rate limits" | ||
|
Zuul
|
bf206ed2fe | Merge "backend ratelimit: support reloadable config file" | ||
|
Zuul
|
937af35e62 | Merge "object-expirer: add example to delay_reaping sample config" | ||
|
indianwhocodes
|
11eb17d3b2 |
support x-open-expired header for expired objects
If the global configuration option 'enable_open_expired' is set to true in the config, then the client will be able to make a request with the header 'x-open-expired' set to true in order to access an object that has expired, provided it is in its grace period. If this config flag is set to false, the client will not be able to access any expired objects, even with the header, which is the default behavior unless the flag is set. When a client sets a 'x-open-expired' header to a true value for a GET/HEAD/POST request the proxy will forward x-backend-open-expired to storage server. The storage server will allow clients that set x-backend-open-expired to open and read an object that has not yet been reaped by the object-expirer, even after the x-delete-at time has passed. The header is always ignored when used with temporary URLs. Co-Authored-By: Anish Kachinthaya <akachinthaya@nvidia.com> Related-Change: I106103438c4162a561486ac73a09436e998ae1f0 Change-Id: Ibe7dde0e3bf587d77e14808b169c02f8fb3dddb3 |
||
|
Alistair Coles
|
ce619137db |
object-expirer: add example to delay_reaping sample config
Add an example of a delay_reaping config option with quoted key. Change-Id: I0c7ead6795822ea0fb0e81abc1e4685d7946942c Related-Change: I106103438c4162a561486ac73a09436e998ae1f0 |
||
|
Mandell Degerness
|
5961ba0ca7 |
expirer: account and container level delay_reaping
The object expirer can be configured to delay the reaping of objects from disk after their expiration time using account and container level delay_reaping values. The delay_reaping value of accounts and containers in seconds is configured in the object server config. The object expirer references these configured values to only reap objects from specified accounts and containers after their corresponding delays. The goal of the delay_reaping feature is to prevent accidental or premature data loss if an object marked for deletion with the 'x-delete-at' feature should not be reaped immediately, for whatever reason. Configuring the delay_reaping value at a granular account and container level is beneficial for being able to keep storage capacity consumption in control while maintaining a desired data recovery window. This patch also adds a sample configuration, documentation, and tests for bad configurations and grace period functionality. Co-Authored-By: Anish Kachinthaya <akachinthaya@nvidia.com> Change-Id: I106103438c4162a561486ac73a09436e998ae1f0 |
||
|
Alistair Coles
|
3517ca453e |
backend ratelimit: support per-method rate limits
Add support for config options such as: head_requests_per_device_per_second = 100 Change-Id: I2936f799b6112155ff01dcd8e1f985849a1af178 |
||
|
Alistair Coles
|
e9abfd76ee |
backend ratelimit: support reloadable config file
Add support for a backend_ratelimit_conf_path option in the [filter:backend_ratelimit] config. If specified then the middleware will give precedence to config options from that file over config options from the [filter:backend_ratelimit] section. The path defaults to /etc/swift/backend-ratelimit.conf. The config file is periodically reloaded and any changed options are applied. The middleware will log a warning the first time it fails to load a config file that had previously been successfully loaded. The middleware also logs at info level when it first successfully loads a config file that had previously failed to be loaded. Otherwise, the middleware will log when a config file is loaded that results in the config being changed. Change-Id: I6554e37c6ab5b0a260f99b54169cb90ab5718f81 |
||
|
Tim Burke
|
6a426f7fa0 |
sharder: Add periodic_warnings_interval to example config
Change-Id: Ie3c64646373580b70557f2720a13a5a0c5ef7097 |
||
|
Zuul
|
07c8e8bcdc | Merge "Object-server: add periodic greenthread yielding during file read." | ||
|
Jianjian Huo
|
d5877179a5 |
Object-server: add periodic greenthread yielding during file read.
Currently, when object-server serves GET request and DiskFile reader iterate over disk file chunks, there is no explicit eventlet sleep called. When network outpace the slow disk IO, it's possible one large and slow GET request could cause eventlet hub not to schedule any other green threads for a long period of time. To improve this, this patch add a configurable sleep parameter into DiskFile reader, which is 'cooperative_period' with a default value of 0 (disabled). Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Change-Id: I80b04bad0601b6cd6caef35498f89d4ba70a4fd4 |
||
|
Alistair Coles
|
2500fbeea9 |
proxy: don't use recoverable_node_timeout with x-newest
Object GET requests with a truthy X-Newest header are not resumed if a backend request times out. The GetOrHeadHandler therefore uses the regular node_timeout when waiting for a backend connection response, rather than the possibly shorter recoverable_node_timeout. However, previously while reading data from a backend response the recoverable_node_timeout would still be used with X-Newest requests. This patch simplifies GetOrHeadHandler to never use recoverable_node_timeout when X-Newest is truthy. Change-Id: I326278ecb21465f519b281c9f6c2dedbcbb5ff14 |
||
|
Takashi Kajinami
|
bd64748a03 |
Document allowed_digests for formpost middleware
The allowed_digests option were added to the formpost middleware in
addition to the tempurl middleware[1], but the option was not added to
the formpost section in the example proxy config file.
[1]
|
||
|
Tim Burke
|
0c9b545ea7 |
docs: Clean up proxy logging docs
Change-Id: I6ef909e826d3901f24d3c42a78d2ab1e4e47bb64 |
||
|
Jianjian Huo
|
cb1e584e64 |
Object-server: keep SLO manifest files in page cache.
Currently, SLO manifest files will be evicted from page cache after reading it, which cause hard drives very busy when user requests a lot of parallel byte range GETs for a particular SLO object. This patch will add a new config 'keep_cache_slo_manifest', and try keeping the manifest files in page cache by not evicting them after reading if config settings allow so. Co-Authored-By: Tim Burke <tim.burke@gmail.com> Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com> Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I557bd01643375d7ad68c3031430899b85908a54f |
||
|
Tim Burke
|
469c38e9fb |
wsgi: Add keepalive_timeout option
Clients sometimes hold open connections "just in case" they might later pipeline requests. This can cause issues for proxies, especially if operators restrict max_clients in an effort to improve response times for the requests that *do* get serviced. Add a new keepalive_timeout option to give proxies a way to drop these established-but-idle connections without impacting active connections (as may happen when reducing client_timeout). Note that this requires eventlet 0.33.4 or later. Change-Id: Ib5bb84fa3f8a4b9c062d58c8d3689e7030d9feb3 |
||
|
Zuul
|
5fae344ef4 | Merge "internal_client: Remove allow_modify_pipeline option" | ||
|
Matthew Oliver
|
e5105ffa09 |
internal_client: Remove allow_modify_pipeline option
The internal client is suppose to be internal to the cluster, and as such we rely on it to not remove any headers we decide to send. However if the allow_modify_pipeline option is set the gatekeeper middleware is added to the internal client's proxy pipeline. So firstly, this patch removes the allow_modify_pipeline option from the internal client constructor. And when calling loadapp allow_modify_pipeline is always passed with a False. Further, an op could directly put the gatekeeper middleware into the internal client config. The internal client constructor will now check the pipeline and raise a ValueError if one has been placed in the pipeline. To do this, there is now a check_gatekeeper_loaded staticmethod that will walk the pipeline which called from the InternalClient.__init__ method. Enabling this walking through the pipeline, we are now stashing the wsgi pipeline in each filter so that we don't have to rely on 'app' naming conventions to iterate the pipeline. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: Idcca7ac0796935c8883de9084d612d64159d9f92 |
||
|
Tim Burke
|
cbba65ac91 |
quotas: Add account-level per-policy quotas
Reseller admins can set new headers on accounts like X-Account-Quota-Bytes-Policy-<policy-name>: <quota> This may be done to limit consumption of a faster, all-flash policy, for example. This is independent of the existing X-Account-Meta-Quota-Bytes header, which continues to limit the total storage for an account across all policies. Change-Id: Ib25c2f667e5b81301f8c67375644981a13487cfe |
||
|
Zuul
|
0470994a03 | Merge "slo: Default allow_async_delete to true" | ||
|
Jianjian Huo
|
4ed2b89cb7 |
Sharder: warn when sharding appears to have stalled.
This patch add a configurable timeout after which the sharder will warn if a container DB has not completed sharding. The new config is container_sharding_timeout with a default of 172800 seconds (2 days). Drive-by fix: recording sharding progress will cover the case of shard range shrinking too. Co-Authored-By: Alistair Coles <alistairncoles@gmail.com> Change-Id: I6ce299b5232a8f394e35f148317f9e08208a0c0f |
||
|
Zuul
|
8ab6af27c5 | Merge "proxy: Add a chance to skip memcache for get_*_info calls" | ||
|
Zuul
|
b05b27c0b6 | Merge "Add note about rsync_bwlimit suffixes" | ||
|
Tim Burke
|
5c6407bf59 |
proxy: Add a chance to skip memcache for get_*_info calls
If you've got thousands of requests per second for objects in a single container, you basically NEVER want that container's info to ever fall out of memcache. If it *does*, all those clients are almost certainly going to overload the container. Avoid this by allowing some small fraction of requests to bypass and refresh the cache, pushing out the TTL as long as there continue to be requests to the container. The likelihood of skipping the cache is configurable, similar to what we did for shard range sets. Change-Id: If9249a42b30e2a2e7c4b0b91f947f24bf891b86f Closes-Bug: #1883324 |
||
|
Zuul
|
24acc6e56b | Merge "Add backend rate limiting middleware" | ||
|
Tim Burke
|
a9177a4b9d |
Add note about rsync_bwlimit suffixes
Change-Id: I019451e118d3bd7263a52cf4bf354d0d0d2b4607 |