c374a7a85152699fe1edc6078076f66fcbce3dfb
Commit Graph

119 Commits

Author SHA1 Message Date
Tim Burke
c374a7a851 Allow floats for all intervals
Change-Id: I91e9bc02d94fe7ea6e89307305705c383087845a
2021年05月05日 15:30:21 -07:00
Tim Burke
abfa6bee72 relinker: Parallelize per disk
Add a new option, workers, that works more or less like the same option
from background daemons. Disks will be distributed across N worker
sub-processes so we can make the best use of the I/O available.
While we're at it, log final stats at warning if there were errors.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
Change-Id: I039d2b8861f69a64bd9d2cdf68f1f534c236b2ba
2021年04月05日 12:15:56 -07:00
Alistair Coles
3bdd01cf4a relinker: retry links from older part powers
If a previous partition power increase failed to cleanup all files in
their old partition locations, then during the next partition power
increase the relinker may find the same file to relink in more than
one source partition. This currently leads to an error log due to the
second relink attempt getting an EEXIST error.
With this patch, when an EEXIST is raised, the relinker will attempt
to create/verify a link from older partition power locations to the
next part power location, and if such a link is found then suppress
the error log.
During the relink step, if an alternative link is verified and if a
file is found that is neither linked to the next partition power
location nor in the current part power location, then the file is
removed during the relink step. That prevents the same EEXIST occuring
again during the cleanup step when it may no longer be possible to
verify that an alternative link exists.
For example, consider identical filenames in the N+1th, Nth and N-1th
partition power locations, with the N+1th being linked to the Nth:
 - During relink, the Nth location is visited and its link is
 verified. Then the N-1th location is visited and an EEXIST error
 is encountered, but the new check verifies that a link exists to
 the Nth location, which is OK.
 - During cleanup the locations are visited in the same order, but
 files are removed so that the Nth location file no longer exists
 when the N-1th location is visited. If the N-1th location still
 has a conflicting file then existence of an alternative link to
 the Nth location can no longer be verified, so an error would be
 raised. Therefore, the N-1th location file must be removed during
 relink.
The error is only suppressed for tombstones. The number of partition
power location that the relinker will look back over may be configured
using the link_check_limit option in a conf file or --link-check-limit
on the command line, and defaults to 2.
Closes-Bug: 1921718
Change-Id: If9beb9efabdad64e81d92708f862146d5fafb16c
2021年04月01日 18:56:57 +01:00
Tim Burke
53c0fc3403 relinker: Add option to ratelimit relinking
Sure, you could use stuff like ionice or cgroups to limit relinker I/O,
but sometimes a nice simple blunt instrument is handy.
Change-Id: I7fe29c7913a9e09bdf7a787ccad8bba2c77cf995
2021年02月11日 11:31:39 -08:00
Tim Burke
1b7dd34d38 relinker: Allow conf files for configuration
Swap out the standard logger stuff in place of --logfile. Keep --device
as a CLI-only option. Everything else is pretty standard stuff that
ought to be in [DEFAULT].
Co-Authored-By: Alistair Coles <alistairncoles@gmail.com>
Change-Id: I32f979f068592eaac39dcc6807b3114caeaaa814
2021年02月08日 14:39:27 -08:00
Samuel Merritt
b971280907 Let developers/operators add watchers to object audit
Swift operators may find it useful to operate on each object in their
cluster in some way. This commit provides them a way to hook into the
object auditor with a simple, clearly-defined boundary so that they
can iterate over their objects without additional disk IO.
For example, a cluster operator may want to ensure a semantic
consistency with all SLO segments accounted in their manifests,
or locate objects that aren't in container listings. Now that Swift
has encryption support, this could be used to locate unencrypted
objects. The list goes on.
This commit makes the auditor locate, via entry points, the watchers
named in its config file.
A watcher is a class with at least these four methods:
 __init__(self, conf, logger, **kwargs)
 start(self, audit_type, **kwargs)
 see_object(self, object_metadata, data_file_path, **kwargs)
 end(self, **kwargs)
The auditor will call watcher.start(audit_type) at the start of an
audit pass, watcher.see_object(...) for each object audited, and
watcher.end() at the end of an audit pass. All method arguments are
passed as keyword args.
This version of the API is implemented on the context of the
auditor itself, without spawning any additional processes.
If the plugins are not working well -- hang, crash, or leak --
it's easier to debug them when there's no additional complication
of processes that run by themselves.
In addition, we include a reference implementation of plugin for
the watcher API, as a help to plugin writers.
Change-Id: I1be1faec53b2cdfaabf927598f1460e23c206b0a
2020年12月26日 17:16:14 -06:00
Tim Burke
918ab8543e Use socket_timeout kwarg instead of useless eventlet.wsgi.WRITE_TIMEOUT
No version of eventlet that I'm aware of hasany sort of support for
eventlet.wsgi.WRITE_TIMEOUT; I don't know why we've been setting that.
On the other hand, the socket_timeout argument for eventlet.wsgi.Server
has been supported for a while -- since 0.14 in 2013.
Drive-by: Fix up handling of sub-second client_timeouts.
Change-Id: I1dca3c3a51a83c9d5212ee5a0ad2ba1343c68cf9
Related-Change: I1d4d028ac5e864084a9b7537b140229cb235c7a3
Related-Change: I433c97df99193ec31c863038b9b6fd20bb3705b8
2020年11月11日 14:23:40 -08:00
Zuul
b9a404b4d1 Merge "ec: Add an option to write fragments with legacy crc" 2020年11月02日 23:03:49 +00:00
Clay Gerrard
b05ad82959 Add tasks_per_second option to expirer
This allows operators to throttle expirers as needed.
Partial-Bug: #1784753
Change-Id: If75dabb431bddd4ad6100e41395bb6c31a4ce569
2020年10月23日 10:24:52 -05:00
Tim Burke
599f63e762 ec: Add an option to write fragments with legacy crc
When upgrading from liberasurecode<=1.5.0, you may want to continue
writing legacy CRCs until all nodes are upgraded and capabale of reading
fragments with zlib CRCs.
Starting in liberasurecode>=1.6.2, we can use the environment variable
LIBERASURECODE_WRITE_LEGACY_CRC to control whether we write zlib or
legacy CRCs, but for many operators it's easier to manage swift configs
than environment variables. Add a new option, write_legacy_ec_crc, to the
proxy-server app and object-reconstructor; if set to true, ensure legacy
frags are written.
Note that more daemons instantiate proxy-server apps than just the
proxy-server. The complete set of impacted daemons should be:
 * proxy-server
 * object-reconstructor
 * container-reconciler
 * any users of internal-client.conf
UpgradeImpact
=============
To ensure a smooth liberasurecode upgrade:
 1. Determine whether your cluster writes legacy or zlib CRCs. Depending
 on the order in which shared libraries are loaded, your servers may
 already be reading and writing zlib CRCs, even with old
 liberasurecode. In that case, no special action is required and
 WRITING LEGACY CRCS DURING THE UPGRADE WILL CAUSE AN OUTAGE.
 Just upgrade liberasurecode normally. See the closed bug for more
 information and a script to determine which CRC is used.
 2. On all nodes, ensure Swift is upgraded to a version that includes
 write_legacy_ec_crc support and write_legacy_ec_crc is enabled on
 all daemons.
 3. On each node, upgrade liberasurecode and restart Swift services.
 Because of (2), they will continue writing legacy CRCs which will
 still be readable by nodes that have not yet upgraded.
 4. Once all nodes are upgraded, remove the write_legacy_ec_crc option
 from all configs across all nodes. After restarting daemons, they
 will write zlib CRCs which will also be readable by all nodes.
Change-Id: Iff71069f808623453c0ff36b798559015e604c7d
Related-Bug: #1666320
Closes-Bug: #1886088
Depends-On: https://review.opendev.org/#/c/738959/ 
2020年09月30日 16:49:59 -07:00
Tim Burke
9eb81f6e69 Allow replication servers to handle all request methods
Previously, the replication_server setting could take one of three
states:
 * If unspecified, the server would handle all available methods.
 * If "true", "yes", "on", etc. it would only handle replication
 methods (REPLICATE, SSYNC).
 * If any other value (including blank), it would only handle
 non-replication methods.
However, because SSYNC tunnels PUTs, POSTs, and DELETEs through
the same object-server app that's responding to SSYNC, setting
`replication_server = true` would break the protocol. This has
been the case ever since ssync was introduced.
Now, get rid of that second state -- operators can still set
`replication_server = false` as a principle-of-least-privilege guard
to ensure proxy-servers can't make replication requests, but replication
servers will be able to serve all traffic. This will allow replication
servers to be used as general internal-to-the-cluster endpoints, leaving
non-replication servers to handle client-driven traffic.
Closes-Bug: #1446873
Change-Id: Ica2b41a52d11cb10c94fa8ad780a201318c4fc87
2020年07月23日 09:11:07 -07:00
Clay Gerrard
4601548dab Deprecate per-service auto_create_account_prefix
If we move it to constraints it's more globally accessible in our code,
but more importantly it's more obvious to ops that everything breaks if
you try to mis-configure different values per-service.
Change-Id: Ib8f7d08bc48da12be5671abe91a17ae2b49ecfee
2020年01月05日 09:53:30 -06:00
Tim Burke
39a54fecdc py3: add swift-dsvm-functional-py3 job
Note that keystone wants to stick some UTF-8 encoded bytes into
memcached, but we want to store it as JSON... or something?
Also, make sure we can hit memcache for containers with invalid UTF-8.
Although maybe it'd be better to catch that before we ever try memcache?
Change-Id: I1fbe133c8ec73ef6644ecfcbb1931ddef94e0400
2019年06月21日 22:31:18 -07:00
Clay Gerrard
34bd4f7fa3 Clarify usage of dequeue_from_legacy option
Change-Id: Iae9aa7a91b9afc19cb8613b5bc31de463b853dde
2019年05月05日 03:20:34 +00:00
Kazuhiro MIYAHARA
443f029a58 Enable to configure object-expirer in object-server.conf
To prepare for object-expirer's general task queue feature [1],
this patch enables to configure object-expirer in object-server.conf.
Object-expirer.conf can be used in the same manner as before, but deprecated.
If both of object-server.conf with "object-expirer" section and
object-expirer.conf are in a node, only object-server.conf is used.
Object-expirer.conf is used only if all object-server.conf doesn't have
"object-expirer" section.
There are two differences between "object-expirer.conf" style and
"object-server.conf" style.
The first difference is `dequeue_from_legacy` default value.
`dequeue_from_legacy` defines task queue mode. In "object-expirer.conf"
style, the default mode is legacy queue. In "object-server.conf" style,
the default mode is general queue. But general mode means no-op mode
for now, because general task queue is not implemented yet.
The second difference is internal client config. In "object-expirer.conf"
style, config file of internal client is the object-expirer.conf itself.
In "object-server.conf" style, config file of internal client is
another file.
[1]: https://review.openstack.org/#/c/517389/
Co-Authored-By: Matthew Oliver <matt@oliver.net.au>
Change-Id: Ib21568f9b9d8547da87a99d65ae73a550e9c3230
2019年05月04日 15:45:02 +00:00
Gilles Biannic
a4cc353375 Make log format for requests configurable
Add the log_msg_template option in proxy-server.conf and log_format in
a/c/o-server.conf. It is a string parsable by Python's format()
function. Some fields containing user data might be anonymized by using
log_anonymization_method and log_anonymization_salt.
Change-Id: I29e30ef45fe3f8a026e7897127ffae08a6a80cd9
2019年05月02日 17:43:25 -06:00
Clay Gerrard
ea8e545a27 Rebuild frags for unmounted disks
Change the behavior of the EC reconstructor to perform a fragment
rebuild to a handoff node when a primary peer responds with 507 to the
REPLICATE request.
Each primary node in a EC ring will sync with exactly three primary
peers, in addition to the left & right nodes we now select a third node
from the far side of the ring. If any of these partners respond
unmounted the reconstructor will rebuild it's fragments to a handoff
node with the appropriate index.
To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
we must give the remote handoff node the correct backend_index for the
fragments it will recieve. In the common case we will use
determistically different handoffs for each fragment index to prevent
multiple unmounted primary disks from forcing a single handoff node to
hold more than one rebuilt fragment.
Handoff nodes will continue to attempt to revert rebuilt handoff
fragments to the appropriate primary until it is remounted or
rebalanced. After a rebalance of EC rings (potentially removing
unmounted/failed devices), it's most IO efficient to run in
handoffs_only mode to avoid unnecessary rebuilds.
Closes-Bug: #1510342
Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec
2019年02月08日 18:04:55 +00:00
FatemaKhalid
cfeb32c66b Adding keep_idle config value to socket
User can cofigure KEEPIDLE time for sockets in TCP connection.
The default value is the old value which is 600.
Change-Id: Ib7fb166deb8a87ae4e97ba0671048b1ec079a2ef
Closes-Bug:1759606
2018年09月15日 01:30:53 +02:00
Samuel Merritt
d5c532a94e object-updater: add concurrent updates
The object updater now supports two configuration settings:
"concurrency" and "updater_workers". The latter controls how many
worker processes are spawned, while the former controls how many
concurrent container updates are performed by each worker
process. This should speed the processing of async_pendings.
There is a change to the semantics of the configuration
options. Previously, "concurrency" controlled the number of worker
processes spawned, and "updater_workers" did not exist. I switched the
meanings for consistency with other configuration options. In the
object reconstructor, object replicator, object server, object
expirer, container replicator, container server, account replicator,
account server, and account reaper, "concurrency" refers to the number
of concurrent tasks performed within one process (for reference, the
container updater and object auditor use "concurrency" to mean number
of processes).
On upgrade, a node configured with concurrency=N will still handle
async updates N-at-a-time, but will do so using only one process
instead of N.
UpgradeImpact:
If you have a config file like this:
 [object-updater]
 concurrency = <N>
and you want to take advantage of faster updates, then do this:
 [object-updater]
 concurrency = 8 # the default; you can omit this line
 updater_workers = <N>
If you want updates to be processed exactly as before, do this:
 [object-updater]
 concurrency = 1
 updater_workers = <N>
Change-Id: I17e18088e61f664e1b9942d66423666d0cae1689
2018年06月13日 17:39:34 -07:00
Thiago da Silva
36dbd38e48 Add s3api headers to allowed_headers by default
Previously, these headers had to be added by operators to their
object-server.conf when enabling swift3 middleware. Since s3api
is now imported into swift we should go ahead and add these headers
by default too.
Change-Id: Ib82e175096716e42aecdab48f01f079e09da6a1d
Signed-off-by: Thiago da Silva <thiago@redhat.com>
2018年05月29日 16:02:50 -04:00
Zuul
3313392462 Merge "Import swift3 into swift repo as s3api middleware" 2018年04月30日 16:00:56 +00:00
Kota Tsuyuzaki
636b922f3b Import swift3 into swift repo as s3api middleware
This attempts to import openstack/swift3 package into swift upstream
repository, namespace. This is almost simple porting except following items.
1. Rename swift3 namespace to swift.common.middleware.s3api
1.1 Rename also some conflicted class names (e.g. Request/Response)
2. Port unittests to test/unit/s3api dir to be able to run on the gate.
3. Port functests to test/functional/s3api and setup in-process testing
4. Port docs to doc dir, then address the namespace change.
5. Use get_logger() instead of global logger instance
6. Avoid global conf instance
Ex. fix various minor issue on those steps (e.g. packages, dependencies,
 deprecated things)
The details and patch references in the work on feature/s3api are listed
at https://trello.com/b/ZloaZ23t/s3api (completed board)
Note that, because this is just a porting, no new feature is developed since
the last swift3 release, and in the future work, Swift upstream may continue
to work on remaining items for further improvements and the best compatibility
of Amazon S3. Please read the new docs for your deployment and keep track to
know what would be changed in the future releases.
Change-Id: Ib803ea89cfee9a53c429606149159dd136c036fd
Co-Authored-By: Thiago da Silva <thiago@redhat.com>
Co-Authored-By: Tim Burke <tim.burke@gmail.com>
2018年04月27日 15:53:57 +09:00
Samuel Merritt
c28004deb0 Multiprocess object replicator
Add a multiprocess mode to the object replicator. Setting the
"replicator_workers" setting to a positive value N will result in the
replicator using up to N worker processes to perform replication
tasks.
At most one worker per disk will be spawned, so one can set
replicator_workers=99999999 to always get one worker per disk
regardless of the number of disks in each node. This is the same
behavior that the object reconstructor has.
Worker process logs will have a bit of information prepended so
operators can tell which messages came from which worker. It looks
like this:
 [worker 1/2 pid=16529] 154/154 (100.00%) partitions replicated in 1.02s (150.87/sec, 0s remaining)
The prefix is "[worker M/N pid=P] ", where M is the worker's index, N
is the total number of workers, and P is the process ID. Every message
from the replicator's logger will have the prefix; this includes
messages from down in diskfile, but does not include things printed to
stdout or stderr.
Drive-by fix: don't dump recon stats when replicating only certain
policies. When running the object replicator with replicator_workers >
0 and "--policies=X,Y,Z", the replicator would update recon stats
after running. Since it only ran on a subset of objects, it should not
update recon, much like it doesn't update recon when run with
--devices or --partitions.
Change-Id: I6802a9ad9f1f9b9dafb99d8b095af0fdbf174dc5
2018年04月24日 04:05:08 +00:00
Samuel Merritt
f64c00b00a Improve object-updater's stats logging
The object updater has five different stats, but its logging only told
you two of them (successes and failures), and it only told you after
finishing all the async_pendings for a device. If you have a cluster
that's been sick and has millions upon millions of async_pendings
laying around, then your object-updaters are frustratingly
silent. I've seen one cluster with around 8 million async_pendings per
disk where the object-updaters only emitted stats every 12 hours.
Yes, if you have StatsD logging set up properly, you can go look at
your graphs and get real-time feedback on what it's doing. If you
don't have that, all you get is a frustrating silence.
Now, the object updater tells you all of its stats (successes,
failures, quarantines due to bad pickles, unlinks, and errors), and it
tells you incremental progress every five minutes. The logging at the
end of a pass remains and has been expanded to also include all stats.
Also included is a small change to what counts as an error: unmounted
drives no longer do. The goal is that only abnormal things count as
errors, like permission problems, malformed filenames, and so
on. These are things that should never happen, but if they do, may
require operator intervention. Drives fail, so logging an error upon
encountering an unmounted drive is not useful.
Change-Id: Idbddd507f0b633d14dffb7a9834fce93a10359ab
2018年01月17日 13:59:23 -08:00
Romain LE DISEZ
e199192cae Replace replication_one_per_device by custom count
This commit replaces boolean replication_one_per_device by an integer
replication_concurrency_per_device. The new configuration parameter is
passed to utils.lock_path() which now accept as an argument a limit for
the number of locks that can be acquired for a specific path.
Instead of trying to lock path/.lock, utils.lock_path() now tries to lock
files path/.lock-X, where X is in the range (0, N), N being the limit for
the number of locks allowed for the path. The default value of limit is
set to 1.
Change-Id: I3c3193344c7a57a8a4fc7932d1b10e702efd3572
2017年10月24日 16:17:41 +01:00
shangxiaobj
c93c0c0c6e [Trivialfix]Fix typos in swift
Fix typos that found in swift.
Change-Id: I52fad1a4882cec4456f22174b46d54e42ec66d97
2017年08月04日 07:50:10 +00:00
Clay Gerrard
701a172afa Add multiple worker processes strategy to reconstructor
This change adds a new Strategy concept to the daemon module similar to
how we manage WSGI workers. We need to leverage multiple python
processes to get the concurrency properties we need. More workers will
rebalance much faster on dense chassis with many devices.
Currently the default is still only one process, and no workers. Set
reconstructor_workers in the [object-reconstructor] section to some
whole number <= the number of devices on a node to get that many
reconstructor workers.
Each worker will operate on a different subset of disks.
Once mode works as before, but tends to want to update recon drops a
little bit more.
If you change the rings, the strategy will shutdown workers and spawn
new ones.
You can kill the worker pids and the daemon strategy will respawn them.
New per-disk reconstructor stats are dumped to recon under the
object_reconstruction_per_disk key. To maintain legacy compatibility
and replication monitoring based on cycle times they are aggregated
every stats_interval (default 5 mins).
Change-Id: I28925a37f3985c9082b5a06e76af4dc3ec813abe
2017年07月26日 16:55:10 -07:00
Alistair Coles
9c5628b4f1 Add reconstructor section to deployment guide
Change-Id: I062998e813718828b7adf4e7c3f877b6a31633c0
Closes-Bug: #1626290 
2017年07月20日 11:40:17 +01:00
Jenkins
f1e1dbb80a Merge "Make eventlet.tpool's thread count configurable in object server" 2017年07月04日 11:49:24 +00:00
Samuel Merritt
d9c4913e3b Make eventlet.tpool's thread count configurable in object server
If you're running servers_per_port > 0 and threads_per_disk = 0 (as it
should be with servers_per_port on), each object-server process will
have 20 IO threads waiting around to service eventlet.tpool
calls. This is far too many; with servers_per_port, there's no real
benefit to having so many IO threads.
This commit makes it so that, when servers_per_port > 0, each object
server defaults to having one main thread and one IO thread.
Also, eventlet's tpool size is now configurable via the object-server
config file. If a tpool size is set, that's what we'll use regardless
of servers_per_port. This allows operators with an excess of threads
to remove some regardless of servers_per_port.
Change-Id: I8f8914b7e70f2510393eb7c5e6be9708631ac027
Closes-Bug: 1554233
2017年06月23日 16:16:03 +10:00
Ondřej Nový
a8bc94c7e3 Replace slowdown option with *_per_second option
container and object updaters sleeps "slowdown" (default 0.01) seconds
after every processed container/object. Because time.sleep call adds overhead,
use ratelimit_sleep from common.utils instead. Same as in auditor.
Change-Id: I362aa0f13c78ad03ce1f76ee0257b0646f981212
2017年06月16日 19:22:00 +00:00
Clay Gerrard
da557011ec Deprecate broken handoffs_first in favor of handoffs_only
The handoffs_first mode in the replicator has the useful behavior of
processing all handoff parts across all disks until there aren't any
handoffs anymore on the node [1] and then it seemingly tries to drop
back into normal operation. In practice I've only ever heard of
handoffs_first used while rebalancing and turned off as soon as the
rebalance finishes - it's not recommended to run with handoffs_first
mode turned on and it emits a warning on startup if option is enabled.
The handoffs_first mode on the reconstructor doesn't work - it was
prioritizing handoffs *per-part* [2] - which is really unfortunate
because in the reconstructor during a rebalance it's often *much* more
attractive from an efficiency disk/network perspective to revert a
partition from a handoff than it is to rebuild an entire partition from
another primary using the other EC fragments in the cluster.
This change deprecates handoffs_first in favor of handoffs_only in the
reconstructor which is far more useful - and just like handoffs_first
mode in the replicator - it gives the operator the option of forcing the
consistency engine to focus on rebalance. The handoffs_only behavior is
somewhat consistent with the replicator's handoffs_first option (any
error on any handoff in the replicactor will make it essentially handoff
only forever) but the option does what you want and is named correctly
in the reconstructor.
For consistency with the replicator the reconstructor will mostly honor
the handoffs_first option, but if you set handoffs_only in the config it
always takes precedence. Having handoffs_first in your config always
results in a warning, but if handoff_only is not set and handoffs_first
is true the reconstructor will assume you need handoffs_only and behaves
as such.
When running in handoffs_only mode the reconstructor will start to log a
warning every cycle if you leave it running in handoffs_only after it
finishes reverting handoffs. However you should be monitoring on-disk
partitions and disable the option as soon as the cluster finishes the
full rebalance cycle.
1. Ia324728d42c606e2f9e7d29b4ab5fcbff6e47aea fixed replicator
handoffs_first "mode"
2. Unlike replication each partition in a EC policy can have a different
kind of job per frag_index, but the cardinality of jobs is typically
only one (either sync or revert) unless there's been a bunch of errors
during write and then handoffs partitions maybe hold a number of
different fragments.
Known-Issues:
handoffs_only is not documented outside of the example config, see lp
bug #1626290
Closes-Bug: #1653018
Change-Id: Idde4b6cf92fab6c45f2c0c2733277701eb436898
2017年02月13日 21:13:29 -08:00
Mahati Chamarthy
69f7be99a6 Move documented reclaim_age option to correct location
The reclaim_age is a DiskFile option, it doesn't make sense for two
different object services or nodes to use different values.
I also driveby cleanup the reclaim_age plumbing from get_hashes to
cleanup_ondisk_files since it's a method on the Manager and has access
to the configured reclaim_age. This fixes a bug where finalize_put
wouldn't use the [DEFAULT]/object-server configured reclaim_age - which
is normally benign but leads to weird behavior on DELETE requests with
really small reclaim_age.
There's a couple of places in the replicator and reconstructor that
reach into their manager to borrow the reclaim_age when emptying out
the aborted PUTs that failed to cleanup their files in tmp - but that
timeout doesn't really need to be coupled with reclaim_age and that
method could have just as reasonably been implemented on the Manager.
UpgradeImpact: Previously the reclaim_age was documented to be
configurable in various object-* services config sections, but that did
not work correctly unless you also configured the option for the
object-server because of REPLICATE request rehash cleanup. All object
services must use the same reclaim_age. If you require a non-default
reclaim age it should be set in the [DEFAULT] section. If there are
different non-default values, the greater should be used for all object
services and configured only in the [DEFAULT] section.
If you specify a reclaim_age value in any object related config you
should move it to *only* the [DEFAULT] section before you upgrade. If
you configure a reclaim_age less that your consistency window you are
likely to be eaten by a Grue.
Closes-Bug: #1626296
Change-Id: I2b9189941ac29f6e3be69f76ff1c416315270916
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
2017年01月13日 03:10:47 +00:00
Ondřej Nový
99a13d9386 Fixed rysnc -> rsync typo
Change-Id: I671b4206072c6e22f4ae38033502336ec32e86ad
2016年10月19日 20:17:00 +02:00
Peter Lisák
ed772236c7 Change schedule priority of daemon/server in config
The goal is to modify schedule priority and I/O scheduling class and
priority of daemon/server via configuration.
Setting is optional, default keeps current behaviour.
Use case:
Prioritize object-server to object-auditor, because all user's requests
needed to be served in peak hours and audit could wait.
Co-Authored-By: Clay Gerrard <clay.gerrard@gmail.com>
DocImpact
Change-Id: I1018a18f4706daabdb84574ffd9a58d831e68396
2016年08月10日 23:56:15 +02:00
Jenkins
a403faadd4 Merge "Allow fallocate_reserve to be a percentage" 2016年05月12日 08:18:39 +00:00
Jenkins
6a88f27eb0 Merge "Remove threads_per_disk setting" 2016年05月11日 01:36:43 +00:00
Shashirekha Gundur
cf48e75c25 change default ports for servers
Changing the recommended ports for Swift services
from ports 6000-6002 to unused ports 6200-6202;
so they do not conflict with X-Windows or other services.
Updated SAIO docs.
DocImpact
Closes-Bug: #1521339
Change-Id: Ie1c778b159792c8e259e2a54cb86051686ac9d18
2016年04月29日 14:47:38 -04:00
Christian Schwede
9d6a055b31 Remove threads_per_disk setting
This patch removes the threads_per_disk setting. It was already a deprecated
setting and by default set to 0, which effectively meant to not use a per-disk
thread pool at all. Users are encouraged to use servers_per_port instead.
DocImpact
Change-Id: Ie76be5c8a74d60a1330627caace19e06d1b9383c
2016年04月28日 12:06:24 -05:00
Andy McCrae
0da9da5131 Allow fallocate_reserve to be a percentage
Add the ability to set the fallocate_reserve value as a percentage.
This happens automatically when adding the '%' at the end of the value.
Having the ability to set a % of free space rather than a byte value is
useful especially when drive sizes are heterogenous.
The default for fallocate_reserve has been adjusted to 1%, having the
fallocate_reserve set seems sensible for all deploys and percentages are
far safer to default than byte values (across drives of any size).
Tests added for using fallocate_reserve as a percentage.
Duplicate tests for fallocate_reserve have been removed.
Docs updated to reflect the fallocate_reserve change.
Change-Id: I4aea613a708205c917e81d6b2861396655e73238
2016年04月23日 08:02:00 -05:00
Clay Gerrard
1d03803a85 Auditor will clean up stale rsync tempfiles
DiskFile already fills in the _ondisk_info attribute when it tries to open
a diskfile - even if the DiskFile's fileset is not valid or deleted.
During this process the rsync tempfiles would be discovered and logged,
but no-one would attempt to clean them up - even if they were really old.
Instead of logging and ignoring unexpected files when validate a DiskFile
fileset we'll add unexpected files to the unexpected key in the
_ondisk_info attribute.
With a little bit of re-organization in the auditor's object_audit method
to get things into a single return path we can add an unconditional check
for unexpected files and remove those that are "old enough".
Since the replicator will kill any rsync processes that are running longer
than the configured rsync_timeout we know that any rsync tempfiles older
than this can be deleted.
Split unlink_older_than in common.utils into two functions to allow an
explicit list of previously discovered paths to be passed in to avoid an
extra listdir. Since the getmtime handling already ignores OSError
there's less concern of race condition where a previous discovered
unexpected file is reaped by rsync while we're attempting to clean it up.
Update some doc on the new config option.
Closes-Bug: #1554005
Change-Id: Id67681cb77f605e3491b8afcb9c69d769e154283
2016年03月23日 19:34:34 +00:00
Kota Tsuyuzaki
ecbcc94989 Fix ssync related object-server docs
Swift now uses SSYNC verb instead of old REPLICATION verb for ssync
protocol. This patch replaces all docs written as REPLICATION into
SSYNC and fix a few words for explanation.
Change-Id: I1253210d4f49749e7d425d6252dd262b650d9548
2016年03月16日 08:58:31 +00:00
gh159m
b5311f63db Removed default value for log_statsd_host
Multiple files and documents showed that log_statsd_host had
a default value, usually localhost. This was incorrect, instead
setting a value for log_statsd_host enables statsd logging.
Removed any reference of log_statsd_host having a default value.
Also changed descriptions to show setting a value enables logging.
Change-Id: I3ca5c0e8b8e4981de3aa6db0c476072b5a59723d
Closes-Bug: #1542227 
2016年02月10日 10:36:59 -06:00
Clay Gerrard
f27ad34e1d Document use-case for slow option
Change-Id: Iec4087a896a2277179e3720d802cca101fa7ad54
2016年02月02日 16:10:47 -08:00
Christian Schwede
ccdf4a9f30 Document slow option in etc/object-server.conf
Change-Id: Ic9940b0b830a468887878f7b0d7ca42c2cbbebd5
2016年02月02日 09:39:32 +01:00
Ondřej Nový
a4c2fe95ab Allow to change auditor sleep interval in config
Change-Id: Ic451c5e0b686509f8982ed1bf65a223a2d77b9a0
2016年01月14日 12:52:52 +01:00
Bill Huber
0bcd7fd50e Update Erasure Coding Overview doc to remove Beta version
The major functionality of EC has been released for Liberty and
the beta version of the code has been removed since it is now
in production.
Change-Id: If60712045fb1af803093d6753fcd60434e637772
2015年12月18日 11:43:12 -06:00
Romain LE DISEZ
71f6fd025e Allows to configure the rsync modules where the replicators will send data
Currently, the rsync module where the replicators send data is static. It
forbids administrators to set rsync configuration based on their current
deployment or needs.
As an example, the rsyncd configuration example encourages to set a connections
limit for the modules account, container and object. It permits to protect
devices from excessives parallels connections, because it would impact
performances.
On a server with many devices, it is tempting to increase this number
proportionally, but nothing guarantees that the distribution of the connections
will be balanced. In the worst scenario, a single device can receive all the
connections, which is a severe impact on performances.
This commit adds a new option named 'rsync_module' to the *-replicator sections
of the *-server configuration file. This configuration variable can be
extrapolated with device attributes like ip, port, device, zone, ... by using
the format {NAME}. eg:
 rsync_module = {replication_ip}::object_{device}
With this configuration, an administrators can solve the problem of connections
distribution by creating one module per device in rsyncd configuration.
The default values are backward compatible:
 {replication_ip}::account
 {replication_ip}::container
 {replication_ip}::object
Option vm_test_mode is deprecated by this commit, but backward compatibility is
maintained. The option is only effective when rsync_module is not set. In that
case, {replication_port} is appended to the default value of rsync_module.
Change-Id: Iad91df50dadbe96c921181797799b4444323ce2e
2015年09月07日 08:00:18 +02:00
John Dickinson
2289137164 do container listing updates in another (green)thread
The actual server-side changes are simple. The tests are a different
matter. Many changes were needed to the object server tests to
handle the now-async calls to the container server. In an effort to
test this properly, some drive-by changes were made to improve tests.
I tested this patch by doing zero-byte object writes to one container
as fast as possible. Then I did it again while also saturating 2 of the
container replica's disks. The results are linked below.
https://gist.github.com/notmyname/2bb85acfd8fbc7fc312a
DocImpact
Change-Id: I737bd0af3f124a4ce3e0862a155e97c1f0ac3e52
2015年07月22日 01:19:58 -07:00
Darrell Bishop
df134df901 Allow 1+ object-servers-per-disk deployment
Enabled by a new > 0 integer config value, "servers_per_port" in the
[DEFAULT] config section for object-server and/or replication server
configs. The setting's integer value determines how many different
object-server workers handle requests for any single unique local port
in the ring. In this mode, the parent swift-object-server process
continues to run as the original user (i.e. root if low-port binding
is required), binds to all ports as defined in the ring, and forks off
the specified number of workers per listen socket. The child, per-port
servers drop privileges and behave pretty much how object-server workers
always have, except that because the ring has unique ports per disk, the
object-servers will only be handling requests for a single disk. The
parent process detects dead servers and restarts them (with the correct
listen socket), starts missing servers when an updated ring file is
found with a device on the server with a new port, and kills extraneous
servers when their port is found to no longer be in the ring. The ring
files are stat'ed at most every "ring_check_interval" seconds, as
configured in the object-server config (same default of 15s).
Immediately stopping all swift-object-worker processes still works by
sending the parent a SIGTERM. Likewise, a SIGHUP to the parent process
still causes the parent process to close all listen sockets and exit,
allowing existing children to finish serving their existing requests.
The drop_privileges helper function now has an optional param to
suppress the setsid() call, which otherwise screws up the child workers'
process management.
The class method RingData.load() can be told to only load the ring
metadata (i.e. everything except replica2part2dev_id) with the optional
kwarg, header_only=True. This is used to keep the parent and all
forked off workers from unnecessarily having full copies of all storage
policy rings in memory.
A new helper class, swift.common.storage_policy.BindPortsCache,
provides a method to return a set of all device ports in all rings for
the server on which it is instantiated (identified by its set of IP
addresses). The BindPortsCache instance will track mtimes of ring
files, so they are not opened more frequently than necessary.
This patch includes enhancements to the probe tests and
object-replicator/object-reconstructor config plumbing to allow the
probe tests to work correctly both in the "normal" config (same IP but
unique ports for each SAIO "server") and a server-per-port setup where
each SAIO "server" must have a unique IP address and unique port per
disk within each "server". The main probe tests only work with 4
servers and 4 disks, but you can see the difference in the rings for the
EC probe tests where there are 2 disks per server for a total of 8
disks. Specifically, swift.common.ring.utils.is_local_device() will
ignore the ports when the "my_port" argument is None. Then,
object-replicator and object-reconstructor both set self.bind_port to
None if server_per_port is enabled. Bonus improvement for IPv6
addresses in is_local_device().
This PR for vagrant-swift-all-in-one will aid in testing this patch:
https://github.com/swiftstack/vagrant-swift-all-in-one/pull/16/
Also allow SAIO to answer is_local_device() better; common SAIO setups
have multiple "servers" all on the same host with different ports for
the different "servers" (which happen to match the IPs specified in the
rings for the devices on each of those "servers").
However, you can configure the SAIO to have different localhost IP
addresses (e.g. 127.0.0.1, 127.0.0.2, etc.) in the ring and in the
servers' config files' bind_ip setting.
This new whataremyips() implementation combined with a little plumbing
allows is_local_device() to accurately answer, even on an SAIO.
In the default case (an unspecified bind_ip defaults to '0.0.0.0') as
well as an explict "bind to everything" like '0.0.0.0' or '::',
whataremyips() behaves as it always has, returning all IP addresses for
the server.
Also updated probe tests to handle each "server" in the SAIO having a
unique IP address.
For some (noisy) benchmarks that show servers_per_port=X is at least as
good as the same number of "normal" workers:
https://gist.github.com/dbishop/c214f89ca708a6b1624a#file-summary-md
Benchmarks showing the benefits of I/O isolation with a small number of
slow disks:
https://gist.github.com/dbishop/fd0ab067babdecfb07ca#file-results-md
If you were wondering what the overhead of threads_per_disk looks like:
https://gist.github.com/dbishop/1d14755fedc86a161718#file-tabular_results-md
DocImpact
Change-Id: I2239a4000b41a7e7cc53465ce794af49d44796c6
2015年06月18日 12:43:50 -07:00