c8da8676fdda47b66aa8dda6576b9311be6f57eb
Commit Graph

696 Commits

Author SHA1 Message Date
Tim Burke
5c6407bf59 proxy: Add a chance to skip memcache for get_*_info calls
If you've got thousands of requests per second for objects in a single
container, you basically NEVER want that container's info to ever fall
out of memcache. If it *does*, all those clients are almost certainly
going to overload the container.
Avoid this by allowing some small fraction of requests to bypass and
refresh the cache, pushing out the TTL as long as there continue to be
requests to the container. The likelihood of skipping the cache is
configurable, similar to what we did for shard range sets.
Change-Id: If9249a42b30e2a2e7c4b0b91f947f24bf891b86f
Closes-Bug: #1883324 
2022年08月30日 18:49:48 +10:00
Zuul
24acc6e56b Merge "Add backend rate limiting middleware" 2022年08月30日 07:18:57 +00:00
Tim Burke
a9177a4b9d Add note about rsync_bwlimit suffixes
Change-Id: I019451e118d3bd7263a52cf4bf354d0d0d2b4607
2022年08月26日 08:54:06 -07:00
Tim Burke
f6196b0a22 AUTHORS/CHANGELOG for 2.30.0
Change-Id: If7c9e13fc62f8104ccb70a12b9c839f78e7e6e3e
2022年08月17日 22:21:45 -07:00
Zuul
5ff37a0d5e Merge "DB Replicator: Add handoff_delete option" 2022年07月22日 01:45:31 +00:00
Matthew Oliver
bf4edefce4 DB Replicator: Add handoff_delete option
Currently the object-replicator has an option called `handoff_delete`
which allows us to define the the number of replicas which are ensured
in swift. Once a handoff node ensures that many successful responses it
can go ahead and delete the handoff partition.
By default it's 'auto' or rather the number of primary nodes. But this
can be reduced. It's useful in draining full disks, but has to be used
carefully.
This patch adds the same option to the DB replicator and works the same
way. But instead of deleting a partition it's done at the per DB level.
Because it's done in the DB Replicator level it means the option is now
available to both the Account and Container replicators.
Change-Id: Ide739a6d805bda20071c7977f5083574a5345a33
2022年07月21日 13:35:24 +10:00
Zuul
73b2730f71 Merge "Add ring_ip option to object services" 2022年06月06日 21:04:48 +00:00
Clay Gerrard
12bc79bf01 Add ring_ip option to object services
This will be used when finding their own devices in rings, defaulting to
the bind_ip.
Notably, this allows services to be containerized while servers_per_port
is enabled:
* For the object-server, the ring_ip should be set to the host ip and
 will be used to discover which ports need binding. Sockets will still
 be bound to the bind_ip (likely 0.0.0.0), with the assumption that the
 host will publish ports 1:1.
* For the replicator and reconstructor, the ring_ip will be used to
 discover which devices should be replicated. While bind_ip could
 previously be used for this, it would have required a separate config
 from the object-server.
Also rename object deamon's bind_ip attribute to ring_ip so that it's
more obvious wherever we're using the IP for ring lookups instead of
socket binding.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Change-Id: I1c9bb8086994f7930acd8cda8f56e766938c2218
2022年06月02日 16:31:29 -05:00
Zuul
5398204f22 Merge "tempurl: Deprecate sha1 signatures" 2022年06月01日 15:54:25 +00:00
Zuul
d1f2e82556 Merge "replicator: Log rsync file transfers less" 2022年05月27日 18:32:46 +00:00
Alistair Coles
ccaf49a00c Add backend rate limiting middleware
This is a fairly blunt tool: ratelimiting is per device and
applied independently in each worker, but this at least provides
some limit to disk IO on backend servers.
GET, HEAD, PUT, POST, DELETE, UPDATE and REPLICATE methods may be
rate-limited.
Only requests with a path starting '<device>/<partition>', where
<partition> can be cast to an integer, will be rate-limited. Other
requests, including, for example, recon requests with paths such as
'recon/version', are unconditionally forwarded to the next app in the
pipeline.
OPTIONS and SSYNC methods are not rate-limited. Note that
SSYNC sub-requests are passed directly to the object server app
and will not pass though this middleware.
Change-Id: I78b59a081698a6bff0d74cbac7525e28f7b5d7c1
2022年05月20日 14:40:00 +01:00
Zuul
7dfecb332b Merge "Add missing services to sample rsyslog.conf" 2022年05月18日 20:54:57 +00:00
Takashi Kajinami
d2b0c04d33 Add missing services to sample rsyslog.conf
The sample rsyslog.conf file doesn't include some container services
and object services. This change adds these services so that all daemon
services are listed.
Change-Id: Ica45b86d5b4da4e3ffc334c86bd383bebe7e7d5d
2022年05月13日 11:47:46 +09:00
Zuul
bff6e5f8fb Merge "Rip out pickle support in our memcached client" 2022年05月05日 07:03:16 +00:00
Tim Burke
7e69176817 replicator: Log rsync file transfers less
- Drop log level for successful rsyncs to debug; ops don't usually care.
- Add an option to skip "send" lines entirely -- in a large cluster,
 during a meaningful expansion, there's too much information getting
 logged; it's just wasting disk space.
Note that we already have similar filtering for directory creation;
that's been present since the initial commit of Swift code.
Drive-by: make it a little more clear that more than one suffix was
likely replicated when logging about success.
Change-Id: I02ba67e77e3378b2c2c8c682d5d230d31cd1bfa9
2022年04月28日 12:35:00 -07:00
Tim Burke
043e0163ed Clarify that rsync_io_timeout is also used for contimeout
Change-Id: I5e4a270add2a625e6d5cb0ae9468313ddc88a81b
2022年04月28日 10:07:50 -07:00
Tim Burke
11b9761cdf Rip out pickle support in our memcached client
We said this would be going away back in 1.7.0 -- lets actually remove it.
Change-Id: I9742dd907abea86da9259740d913924bb1ce73e7
Related-Change: Id7d6d547b103b4f23ebf5be98b88f09ec6027ce4
2022年04月27日 11:16:16 -07:00
Tim Burke
118cf2ba8a tempurl: Deprecate sha1 signatures
We've known this would eventually be necessary for a while [1], and
way back in 2017 we started seeing SHA-1 collisions [2].
[1] https://www.schneier.com/blog/archives/2012/10/when_will_we_se.html
[2] https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
UpgradeImpact:
==============
"sha1" has been removed from the default set of `allowed_digests` in the
tempurl middleware config. If your cluster still has clients requiring
the use of SHA-1,
- explicitly configure `allowed_digests` to include "sha1" and
- encourage your clients to move to more-secure algorithms.
Depends-On: https://review.opendev.org/c/openstack/tempest/+/832771
Change-Id: I6e6fa76671c860191a2ce921cb6caddc859b1066
Related-Change: Ia9dd1a91cc3c9c946f5f029cdefc9e66bcf01046
Closes-Bug: #1733634 
2022年04月22日 20:43:01 +10:00
Zuul
c7774d960c Merge "object-updater: defer ratelimited updates" 2022年02月22日 07:38:37 +00:00
Alistair Coles
51da2543ca object-updater: defer ratelimited updates
Previously, objects updates that could not be sent immediately due to
per-container/bucket ratelimiting [1] would be skipped and re-tried
during the next updater cycle. There could potentially be a period of
time at the end of a cycle when the updater slept, having completed a
sweep of the on-disk async pending files, despite having skipped
updates during the cycle. Skipped updates would then be read from disk
again during the next cycle.
With this change the updater will defer skipped updates to an
in-memory queue (up to a configurable maximum number) until the sweep
of async pending files has completed, and then trickle out deferred
updates until the cycle's interval expires. This increases the useful
work done in the current cycle and reduces the amount of repeated disk
IO during the next cycle.
The deferrals queue is bounded in size and will evict least recently
read updates in order to accept more recently read updates. This
reduces the probablility that a deferred update has been made obsolete
by newer on-disk async pending files while waiting in the deferrals
queue.
The deferrals queue is implemented as a collection of per-bucket
queues so that updates can be drained from the queues in the order
that buckets cease to be ratelimited.
[1] Related-Change: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I95e58df9f15c5f9d552b8f4c4989a474f52262f4
2022年02月21日 10:56:23 +00:00
Zuul
36d32b907c Merge "memcache: Add an item_size_warning_threshold option" 2022年02月15日 21:41:40 +00:00
Matthew Oliver
05d83b0a47 memcache: Add an item_size_warning_threshold option
Whenever an item is set which is larger than item_size_warning_threshold
then a warning is logged in the form:
 'Item size larger than warning threshold: 2048576 (2Mi) >= 1000000 (977Ki)'
Setting the value to -1 (default) will turn off the warning.
Change-Id: I1fb50844d6b9571efaab8ac67705b2fc1fe93e25
2022年02月15日 16:54:17 +00:00
Matthew Oliver
f2c279bae9 Trim sensitive information in the logs (CVE-2017-8761)
Several headers and query params were previously revealed in logs but
are now redacted:
 * X-Auth-Token header (previously redacted in the {auth_token} field,
 but not the {headers} field)
 * temp_url_sig query param (used by tempurl middleware)
 * Authorization header and X-Amz-Signature and Signature query
 parameters (used by s3api middleware)
This patch adds some new middleware helper methods to track headers and
query parameters that should be redacted by proxy-logging. While
instantiating the middleware, authors can call either:
 register_sensitive_header('case-insensitive-header-name')
 register_sensitive_param('case-sensitive-query-param-name')
to add items that should be redacted. The redaction uses proxy-logging's
existing reveal_sensitive_prefix config option to determine how much to
reveal.
Note that query params will still be logged in their entirety if
eventlet_debug is enabled.
UpgradeImpact
=============
The reveal_sensitive_prefix config option now applies to more items;
operators should review their currently-configured value to ensure it
is appropriate for these new contexts. In particular, operators should
consider reducing the value if it is more than 20 or so, even if that
previously offered sufficient protection for auth tokens.
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
Closes-Bug: #1685798
Change-Id: I88b8cfd30292325e0870029058da6fb38026ae1a
2022年02月09日 10:53:46 +00:00
Zuul
c1d2e661b1 Merge "s3api: Allow multiple storage domains" 2022年01月28日 20:24:04 +00:00
Zuul
4d48004483 Merge "proxy: Add a chance to skip memcache when looking for shard ranges" 2022年01月27日 21:37:03 +00:00
Tim Burke
8c6ccb5fd4 proxy: Add a chance to skip memcache when looking for shard ranges
By having some small portion of calls skip cache and go straight to
disk, we can ensure the cache is always kept fresh and never expires (at
least, for active containers). Previously, when shard ranges fell out of
cache there would frequently be a thundering herd that could overwhelm
the container server, leading to 503s served to clients or an increase
in async pendings.
Include metrics for hit/miss/skip rates.
Change-Id: I6d74719fb41665f787375a08184c1969c86ce2cf
Related-Bug: #1883324 
2022年01月26日 18:15:09 +00:00
Zuul
4606911010 Merge "Modify log_name in internal clients' pipeline configs" 2022年01月26日 12:32:43 +00:00
Tim Burke
11d1022163 s3api: Allow multiple storage domains
Sometimes a cluster might be accessible via more than one set
of domain names. Allow operators to configure them such that
virtual-host style requests work with all names.
Change-Id: I83b2fded44000bf04f558e2deb6553565d54fd4a
2022年01月24日 15:39:13 -08:00
Alistair Coles
035d91dce5 Modify log_name in internal clients' pipeline configs
Modify the 'log_name' option in the InternalClient wsgi config for the
following services: container-sharder, container-reconciler,
container-deleter, container-sync and object-expirer.
Previously the 'log_name' value for all internal client instances
sharing a single internal-client.conf file took the value configured
in the conf file, or would default to 'swift'. This resulted in no
distinction between logs from each internal client, and no association
with the service using a particular internal client.
With this change the 'log_name' value will typically be <log_route>-ic
where <log_route> is the service's conf file section name. For
example, 'container-sharder-ic'.
Note: any 'log_name' value configured in an internal client conf file
will now be ignored for these services unless the option key is
preceded by 'set'.
Note: by default, the logger's StatdsClient uses the log_name as its
tail_prefix when composing metrics' names. However, the proxy-logging
middleware overrides the tail_prefix with the hard-coded value
'proxy-server'. This change to log_name therefore does not change the
statsd metric names emitted by the internal client's proxy-logging.
This patch does not change the logging of the services themselves,
just their internal clients.
Change-Id: I844381fb9e1f3462043d27eb93e3fa188b206d05
Related-Change: Ida39ec7eb02a93cf4b2aa68fc07b7f0ae27b5439
2022年01月12日 11:07:25 +00:00
Clay Gerrard
de88862981 Finer grained ratelimit for update
Throw our stream of async_pendings through a hash ring; if the virtual
bucket gets hot just start leaving the updates on the floor and move on.
It's off by default; and if you use it you're probably going to leave a
bunch of async updates pointed at a small set of containers in the queue
for the next sweep every sweep (so maybe turn it off at some point)
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: Idef25cd6026b02c1b5c10a9816c8c6cbe505e7ed
2022年01月06日 12:47:09 -08:00
Tim Burke
fa1058b6ed slo: Default allow_async_delete to true
We've had this option for a year now, and it seems to help. Let's enable
it for everyone. Note that Swift clients still need to opt into the
async delete via a query param, while S3 clients get it for free.
Change-Id: Ib4164f877908b855ce354cc722d9cb0be8be9921
2021年12月21日 14:12:34 -08:00
Alistair Coles
8ee631ccee reconstructor: restrict max objects per revert job
Previously the ssync Sender would attempt to revert all objects in a
partition within a single SSYNC request. With this change the
reconstructor daemon option max_objects_per_revert can be used to limit
the number of objects reverted inside a single SSYNC request for revert
type jobs i.e. when reverting handoff partitions.
If more than max_objects_per_revert are available, the remaining objects
will remain in the sender partition and will not be reverted until the
next call to ssync.Sender, which would currrently be the next time the
reconstructor visits that handoff partition.
Note that the option only applies to handoff revert jobs, not to sync
jobs.
Change-Id: If81760c80a4692212e3774e73af5ce37c02e8aff
2021年12月03日 12:43:23 +00:00
Tim Burke
5b9a90b65d sharder: Make stats interval configurable
Change-Id: Ia794a7e21794d2c1212be0e2d163004f85c2ab78
2021年10月01日 14:35:09 -07:00
Pete Zaitcev
6198284839 Add a project scope read-only role to keystoneauth
This patch continues work for more of the "Consistent and
Secure Default Policies". We already have system scope
personas implemented, but the architecture people are asking
for project scope now. At least we don't need domain scope.
Change-Id: If7d39ac0dfbe991d835b76eb79ae978fc2fd3520
2021年08月02日 14:35:32 -05:00
Zuul
4fc567cb29 Merge "container-reconciler: support multiple processes" 2021年07月22日 03:24:12 +00:00
Clay Gerrard
eb969fdeea container-reconciler: support multiple processes
This follows the same pattern of configuration used in the
object-expirer. When the container-recociler has a configuration value
for processes it expects that many instances of the reconciler to be
configured with a process value from [0, processes).
Change-Id: Ie46bda37ca3f6e692ec31a4ddcd46f343fb1aeca
2021年07月21日 11:45:01 -07:00
Alistair Coles
2696a79f09 reconstructor: retire nondurable_purge_delay option
The nondurable_purge_delay option was introduced in [1] to prevent the
reconstructor removing non-durable data files on handoffs that were
about to be made durable. The DiskFileManager commit_window option has
since been introduced [2] which specifies a similar time window during
which non-durable data files should not be removed. The commit_window
option can be re-used by the reconstructor, making the
nondurable_purge_delay option redundant.
The nondurable_purge_delay option has not been available in any tagged
release and is therefore removed with no backwards compatibility.
[1] Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
[2] Related-Change: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
Change-Id: I1589a7517b7375fcc21472e2d514f26986bf5079
2021年07月19日 21:18:06 +01:00
Alistair Coles
bbaed18e9b diskfile: don't remove recently written non-durables
DiskFileManager will remove any stale files during
cleanup_ondisk_files(): these include tombstones and nondurable EC
data fragments whose timestamps are older than reclaim_age. It can
usually be safely assumed that a non-durable data fragment older than
reclaim_age is not going to become durable. However, if an agent PUTs
objects with specified older X-Timestamps (for example the reconciler
or container-sync) then there is a window of time during which the
object server has written an old non-durable data file but has not yet
committed it to make it durable.
Previously, if another process (for example the reconstructor) called
cleanup_ondisk_files during this window then the non-durable data file
would be removed. The subsequent attempt to commit the data file would
then result in a traceback due to there no longer being a data file to
rename, and of course the data file is lost.
This patch modifies cleanup_ondisk_files to not remove old, otherwise
stale, non-durable data files that were only written to disk in the
preceding 'commit_window' seconds. 'commit_window' is configurable for
the object server and defaults to 60.0 seconds.
Closes-Bug: #1936508
Related-Change: I0d519ebaaade35249fb7b17bd5f419ffdaa616c0
Change-Id: I5f3318a44af64b77a63713e6ff8d0fd3b6144f13
2021年07月19日 21:18:02 +01:00
Clay Gerrard
4e52d946bf Add concurrency to reconciler
Each reconciler process can now reconcile more than one queue entry at a
time, up to the configured concurrency.
By default concurrency is 1. There is no expected change to existing
behavior. Entries are processed serially one a time.
Change-Id: I72e9601b58c2f20bb1294876bb39f2c78827d5f8
2021年07月14日 12:27:26 -07:00
Zuul
a5fc6a8211 Merge "reconciler: PPI aware reconciler" 2021年07月14日 18:53:25 +00:00
Matthew Oliver
e491693e36 reconciler: PPI aware reconciler
This patch makes the reconciler PPI aware. It does this by adding a
helper method `can_reconcile_policy` that is used to check that the
policies used for the source and destination aren't in the middle of a
PPI (their ring doesn't have next_part_power set).
In order to accomplish this the reconciler has had to include the
POLICIES singleton and grown swift_dir and ring_check_interval config options.
Closes-Bug: #1934314
Change-Id: I78a94dd1be90913a7a75d90850ec5ef4a85be4db
2021年07月13日 13:55:13 +10:00
Zuul
17489ce7bf Merge "sharder: avoid small tail shards" 2021年07月08日 17:00:52 +00:00
Zuul
8066efb43a Merge "sharder: support rows_per_shard in config file" 2021年07月07日 23:06:08 +00:00
Alistair Coles
2a593174a5 sharder: avoid small tail shards
A container is typically sharded when it has grown to have an object
count of shard_container_threshold + N, where N <<
shard_container_threshold. If sharded using the default
rows_per_shard of shard_container_threshold / 2 then this would
previously result in 3 shards: the tail shard would typically be
small, having only N rows. This behaviour caused more shards to be
generated than desirable.
This patch adds a minimum-shard-size option to
swift-manage-shard-ranges, and a corresponding option in the sharder
config, which can be used to avoid small tail shards. If set to
greater than one then the final shard range may be extended to more
than rows_per_shard in order to avoid a further shard range with less
than minimum-shard-size rows. In the example given, if
minimum-shard-size is set to M > N then the container would shard into
two shards having rows_per_shard rows and rows_per_shard + N
respectively.
The default value for minimum-shard-size is rows_per_shard // 5. If
all options have their default values this results in
minimum-shard-size being 100000.
Closes-Bug: #1928370
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: I3baa278c6eaf488e3f390a936eebbec13f2c3e55
2021年07月07日 13:59:36 +01:00
Alistair Coles
a87317db6e sharder: support rows_per_shard in config file
Make rows_per_shard an option that can be configured
in the [container-sharder] section of a config file.
For auto-sharding, this option was previously hard-coded to
shard_container_threshold // 2.
The swift-manage-shard-ranges command line tool already supported
rows_per_shard on the command line and will now also load it from a
config file if specified. Any value given on the command line takes
precedence over any value found in a config file.
Change-Id: I820e133a4e24400ed1e6a87ebf357f7dac463e38
2021年07月07日 13:59:36 +01:00
Alistair Coles
2fd5b87dc5 reconstructor: make quarantine delay configurable
Previously the reconstructor would quarantine isolated durable
fragments that were more than reclaim_age old. This patch adds a
quarantine_age option for the reconstructor which defaults to
reclaim_age but can be used to configure the age that a fragment must
reach before quarantining.
Change-Id: I867f3ea0cf60620c576da0c1f2c65cec2cf19aa0
2021年07月06日 16:41:08 +01:00
Zuul
653daf73ed Merge "relinker: tolerate existing tombstone with same timestamp" 2021年07月02日 22:59:34 +00:00
Zuul
2efd4316a6 Merge "Make dark data watcher ignore the newly updated objects" 2021年07月02日 20:52:55 +00:00
Alistair Coles
574897ae27 relinker: tolerate existing tombstone with same timestamp
It is possible for the current and next part power locations to
both have existing tombstones with different inodes when the
relinker tries to relink. This can be caused, for example, by
concurrent reconciler DELETEs that specify the same timestamp.
The relinker previously failed to relink and reported an error when
encountering this situation. With this patch the relinker will
tolerate an existing tombstone with the same filename but different
inode in the next part power location.
Since [1] the relinker had special case handling for EEXIST errors
caused by a different inode tombstone already existing in the next
partition power location: the relinker would check to see if the
existing next part power tombstone linked to a tombstone in a previous
part power (i.e. < current part power) location, and if so tolerate
the EEXIST.
This special case handling is no longer necessary because the relinker
will now tolerate an EEXIST when linking a tombstone provided the two
files have the same timestamp. There is therefore no need to search
previous part power locations for a tombstone that does link with the
next part power location.
The link_check_limit is no longer used but the --link-check-limit
command line option is still allowed (although ignored) for backwards
compatibility.
[1] Related-Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
Change-Id: I07ffee3b4ba6c7ff6c206beaf6b8f746fe365c2b
Closes-Bug: #1934142 
2021年07月02日 12:14:47 +01:00
Pete Zaitcev
95e0316451 Make dark data watcher ignore the newly updated objects
When objects are freshly uploaded, they may take a little time
to appear in container listings, producing false positives.
Because we needed to test this, we also reworked/added the tests
and fixed some issues, including adding an EC fragment (thanks
to Alistair's code).
Closes-Bug: 1925782
Change-Id: Ieafa72a496328f7a487ca7062da6253994a5a07d
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
2021年06月30日 16:38:57 -05:00