Merge "Add backlog spec for Graceful shutodwn of nova services"
This commit is contained in:
1 changed files with 711 additions and 0 deletions
711
specs/backlog/approved/nova-services-graceful-shutdown.rst
Normal file
711
specs/backlog/approved/nova-services-graceful-shutdown.rst
Normal file
@@ -0,0 +1,711 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
==================================
Graceful Shutdown of Nova Services
==================================
https://blueprints.launchpad.net/nova/+spec/nova-services-graceful-shutdown
This is backlog spec proposing the design of graceful shutdown.
Nova services do not shut down gracefully. When services are stopped, it also
stops all the in-progress operations, which not only interrupt the in-progress
operations, but can leave instances in an unwanted or unrecoverable state. The
idea is to let services stop processing the new request, but complete the
in-progress operations before service is terminated.
Problem description
===================
Nova services do not have a way to shutdown gracefully means they do not wait
for the in-progress operations to be completed. When shutdown is initiated,
services wait for the RPC server to stop and wait so that they can consume all
the existing request messages (RPC call/cast) from the queue, but the service
does not complete the operation.
Each Nova compute service has a single worker running and listening on a single
RPC server (topic: compute.<host>). The same RPC server is used for the new
requests as well as for in-progress operations where other compute or conductor
services communicate. When shutdown is initiated, the RPC server is stopped
means it will stop handling the new request, which is ok, but at the same
time it will stop the communication needed for the in-progress operations. For
example, if live migration is in progress, the source and destination compute
communicate (sync and async way) multiple times with each other. Once the RPC
server on the compute service is stopped, it cannot communicate with the other
compute and fail the live migration. It will lead the system as well as the
instance to be in an unwanted or unrecoverable state
Use Cases
---------
As an operator, I want to be able to gracefully shut down (SIGTERM) the Nova
services so that it will not impact the users' in-progress operations or
keep resources in usable state.
As an operator, I want to be able to keep instances and other resources in a
usable state even if service is gracefully terminated (SIGTERM).
As an operator, I want to be able to take the actual benefits of the k8s pod
graceful shutdown when Nova services are running in k8s pods.
As a user, I want in-progress operations to be completed before the service
is gracefully terminated (SIGTERM).
Proposed change
===============
Scope: The proposed solution is to gracefully shutdown the services for
the SIGTERM signal.
The graceful shutdown is based on the following design principles:
* When service shutdown is initiated by SIGTERM:
* Do not process any new requests
* New requests should not be lost. Once service is restarted, it should
process the requests.
* Allow in-progress operations to reach their quickest safe termination
point, either completion or abort.
* Proper logging of the state of in-progress operations
* Keep instances or other resources in a usable state
* When service shutdown is completed:
* Proper logging of unfinished operations.
Ideally, all the in-progress operations should be completed before service
is terminated, but if graceful shutdown times out (due to a configured
timeout, adding the timeout details in later section) then there should be
a proper logging of all the unfinished operations. This will help to
recover the system or instances.
* When service is started again:
* Start processing the new requests in the normal way.
* If the requests were not processed due to the shutdown being initiated,
then they stay in message broker queue and there are multiple
possibilities:
* Requests might have been picked by the other worker of that service.
For example, you can run more than one Nova scheduler (or conductor)
worker. If one of the worker is shutting down, then other worker will
process the request. This is not the case for Nova compute which is
always a single worker per compute service on specific host.
* If a service has single worker running, then request can be picked up
once service is up again.
* There is an opportunity for the compute service to cleanup or recover
the interrupted operation on instances during init_host(). The action
taken will depends on the tasks and its status.
* If the service is in the stopped state for a long time, based on the
RPC and message queue timeout, there is chance that:
* The RPC client or server will timeout the call.
* The message broker queue may drop messages due to timeout.
* The order of requests and messages can be stale.
As a graceful shutdown goal, we need to do two things:
#. A way to stop new requests, but do not interrupt in-progress operations.
This is proposed to be done via RPC.
#. Give services enough time to finish the operations. As a first step,
this is proposed to be done via time-based wait and later with a proper
tracking mechanism.
This backlog spec proposes achieving the above goals in two steps. Each step
will be proposed as a separate spec for a specific release.
The Nova services which already gracefully shutdown:
----------------------------------------------------
For the below services, their graceful shutdown is handled by their
deployment servers or used library.
* Nova API & Nova metadata API:
Those services are deployed using a server with WSGI support. That server
will ensure that Nova API services shuts down gracefully, meaning it
finishes the in-progress requests and rejects the new requests.
I investigate with uWSGI/mod_proxy_uwsgi (devstack env). On service start,
uWSGI server pre-spawn the number of workers for API service which will
handle the API requests in distributed way. When shutdown is initiated
by SIGTERM, the uWSGI server SIGTERM handler check if there are any
in-progress request on any worker. It wait for all the workers to finish
the request and then terminates each worker. Once all worker are terminated
then it will terminate the Nova API service.
If any new request comes after the shutdown is initiated, it will be rejected
with "503 Service Unavailable" error.
Testing:
I tested two types of requests:
#. Sync request: 'openstack server list':
* To observe the graceful shutdown, I added 10 seconds of sleep in the
server list API code.
* Start a API request 'request1': ``openstack server list``
* Wait till the server list request reaches the Nova API (you can see
the log from the controller)
* Because of sleep(10), the server list takes time to finish.
* Initiate the Nova API service shutdown.
* Start a new API request 'request2': ``openstack server list``. This new
requests came after shutdown is initiated so it should be denied.
* Nova API service will wait because 'request1' is not finished.
* 'request1' will get the response of the server list before the service
is terminated.
* 'request2' is denied and will receive the error
"503 Service Unavailable"
#. Async request: ``openstack server pause <server>``:
* To observe the graceful shutdown, I added 10 seconds of sleep in the
server pause API code.
* Start a API request 'request1': ``openstack server pause server1``
* Wait till the pause server request reaches the Nova API (you can see
the log from the controller)
* Because of sleep(10), the pause server takes time to finish.
* Initiate the Nova API service shutdown.
* Service will wait because 'request1' is not finished.
* Nova API will make an RPC cast to the Nova compute service and return.
* 'request1' is completed, and the response is returned to the user.
* Nova API service is terminated now.
* Nova compute service is operating the pause server request.
* Check if server is paused ``openstack server list``
* You can see the server is paused.
* Nova console proxy services: nova-novncproxy, nova-serialproxy, and
nova-spicehtml5proxy:
All the console proxy services run as websockify.websocketproxy_ service.
The websockify_ library handles the SIGTERM signal and the graceful shutdown,
which is enough for the Nova services.
When a user access the console, websockify library starts a new process
in start_service_ and calls Nova new_websocket_client_ . Nova will be
authorizing the token, creating a socket on the host & port, which will
be used to send the data/frames. After that, user can access the console.
If a shutdown request is initiated, websockify handle the signal. First,
it will terminate all the child processes and then raise the terminate
exception, which ends up calling the Nova close_connection_ method. The
Nova close_connection_ method calls shutdown() on the socket first and
then close(), which makes sure to send the remaining data/frame before
closing the socket.
This way, user console sessions will be terminated gracefully, and they will
get "Disconnected" message. Once service is up, the user can refresh the
browser, and the console will be up again (if the token has not expired).
Spec 1: Split the new and in-progress requests via RPC:
-------------------------------------------------------
RPC communication is an important part of services to finish a particular
operation. During shutdown, we need to make sure we keep the required RPC
servers/buses up. If we stop the RPC communication, then it is nothing
different than service termination.
Nova implements, and this spec talks a lot about RPC server ``start``,
``stop``, and ``wait``, so let's cover them briefly from oslo.messaging/RPC
resources point of view, and to understand this proposal in an easy way.
Most of you might know this, so you can skip this section.
* RPC server:
* creation and start():
* It will create the required resources on oslo.messaging side, for
example, dispatcher, consumer, listener, and queues.
* It will handle the binding to the required exchanges.
* stop():
* It will disable the listener ability to pick up any new message
from the queue, but will dispatch the already picked message to
the dispatcher.
* It will delete the consumer.
* It will not delete the queues and exchange on the message broker side.
* It will not stop RPC clients sending new messages to the queue, however,
they will not be picked because the consumer and listener are stopped.
* wait():
* It will wait for the thread pool to finish dispatching all the already
picked messages. Basically, this will make sure methods are called on the
manager.
Analysis per services and the required proposed RPC design change:
* The services listed below communicate with other Nova services' RPC servers.
Since they do not have their own RPC server, no change needed:
* Nova API
* Nova metadata API
* nova-novncproxy
* nova-serialproxy
* nova-spicehtml5proxy
* Nova scheduler: No RPC change needed.
* Requests handling:
Nova scheduler service runs as multiple workers, each having its own RPC
server, but all the Nova scheduler workers will listen to the same RPC
topic and queue ``scheduler`` with fanout way.
Currently, nova.service.py->stop() calls stop() and wait() on RPC server.
Once RPC server is stopped, it will stop listening to any new messages.
But it will not impact anything on the other scheduler worker, and they
continue listening to the same queue and process the request. If any of
the scheduler worker is stopped, then the other workers will process the
request.
* Response handling:
Whenever there is a RPC call, oslo.messaging creates another reply queue
connected with the unique message id. This reply queue will be used to
send the RPC call response to the caller. Even if the RPC server is stopped
on this worker, it will not impact the reply queue.
We still need to keep the worker up until all the responses are sent via
the reply queue, and for that, we need to implement the in-progress task
tracking in scheduler services, but that will be handled in step 2.
This way, stopping a Nova scheduler worker will not impact the RPC
communication on the scheduler service.
* Nova conductor: No RPC change needed.
The Nova conductor binary is a stateless service that can spawn multiple
worker threads. Each instance of the Nova conductor has its own RPC server,
but all the Nova conductor instances will listen to the same RPC topic
and queue ``conductor``. This allows the conductor instance to ack as a
distributed worker pool such that stopping an individual conductor instance
will not impact the RPC communication for the pool of conductor instances,
allowing other available workers to process the request. Each cell has its
own pool of conductors meaning as long as one conductor is up for any given
cell the RPC communication will continue to function even when one or more
conductors are stopped.
The request and response handling is done in the same way as mentioned for
the scheduler.
.. note::
This spec does not cover the conductor single worker case. That might
requires the RPC designing for conductor as well but it need more
investigation.
* Nova compute: RPC design change needed
* Request handling:
The Nova compute runs as a single worker per host, and each compute per
host has their own RPC server, listener, and separate queues. It handles
the new request as well as the communication needed for in-progress
operations on the same RPC server. To achieve the graceful shutdown, we
need to separate communication for the new requests and in-progress
operations. This will be done by adding a new RPC server in the compute
service.
For easy readability, we will be using a different term for each RPC
server:
* 'ops RPC server': This will be used for the new RPC server, which
will be used to finish the in-progress requests and will stay up during
shutdown.
* 'new request RPC server': This will be used for the current RPC server,
which is used for the new requests and will be stopped during shutdown.
* 'new request RPC server' per compute:
No change in this RPC server, but it will be used for all the new requests,
so that we can stop it during shutdown and stop the new requests on the
compute.
* 'ops RPC server' per compute:
* Each compute will have a new 'ops RPC server' which will listen to a new
topic ``compute-ops.<host>``. ``compute-ops`` name is used because it
is mainly for compute operations, but a better name can be used if
needed.
* It will use the same transport layer/bus and exchange that the
'new request RPC server' uses.
* It will create its own dispatcher, listener, and queue.
* Both RPC server will be bound to the same endpoints (same compute
manager), so that requests coming from either server are handled by
the same compute manager.
* This server will be mainly used for the compute-to-compute operations and
server external events. The idea is to keep this RPC server up during
shutdown so that the in-progress operations can be finished.
* In shutdown, nova.service will wait for the compute to tell if they
finished all their tasks, so that it can stop the 'ops RPC server' and
finish the shutdown.
* Response handling:
Irrespective of request is coming from either RPC server, whenever there
is a RPC call, oslo.messaging creates another reply queue connected with
the unique message id. This reply queue will be used to send the RPC call
response to the caller. Even RPC server is stopped on this worker, it
will not impact the reply queue.
* Compute service workflow:
* SIGTERM signal is handled by oslo.service, it will call stop on
nova.service
* nova.service will stop the 'new request RPC server' so that no new
requests are picked by the compute. The 'ops RPC server' is running and
up.
* nova.service will wait for the manager to signal once all in-progress
operations are finished.
* Once compute signal to nova.service, then it will stop the
'ops RPC server' and proceed with service shutdown.
* Timeout:
* There is an existing graceful_shutdown_timeout_ config option present
on oslo.service which can be set per service.
* That is honoured to timeout the service stop, and it will stop service
irrespective of the compute finishing the things.
* RPC client:
* The RPC client stays as a singleton class, which is created with the
topic ``compute.<host>``, meaning that by default message will be
sent via 'new request RPC server'.
* If any RPC cast/call wants to send a message via the 'ops RPC server',
they need to override the ``topic`` to ``compute-ops.<host>`` during
client.prepare() call.
* Which RPC cast/call will be using the 'ops RPC server' will be decided
during implementation, so that we can have a better judgment on what all
methods are used for the operations we want to finish during shutdown.
A draft list where we can use the 'ops RPC server':
.. note::
This is draft list and can be changed during implementation.
* Migrations:
- Live migration:
.. note::
We will be using the 'new request RPC server' for
check_can_live_migrate_destination and
check_can_live_migrate_source methods, as this is the very initial
phase where the compute service has not started the live
migration. If shutdown is initiated before live migration request,
came then migration should be rejected.
- pre_live_migration()
- live_migration()
- prep_snapshot_based_resize_at_dest()
- remove_volume_connection()
- post_live_migration_at_destination()
- rollback_live_migration_at_destination()
- drop_move_claim_at_destination()
- resize methods
- cold migration methods
* Server external event
* Rebuild instance
* validate_console_port()
This is when the console is already requested, and if port validation
request is going on, the compute should finish it before shutdown so
that users can get their requested console.
* Time based waiting for services to finish the in-progress operations:
.. note::
The time based waiting is a temporary solution in spec 1. In spec 2,
it will be replaced by the proper tracking of in-progress tasks.
* To make the graceful shutdown less complicated, spec 1 proposes to
configurable time-based waiting for services to complete their operations.
* The wait time should be less than global graceful shutdown timeout. So that
external system or oslo.service does not shut down the service before the
service wait time is over.
* It will be configurable per service.
* Proposal for the default value:
* compute service: 150 sec, considering long-running operations on compute.
* conductor service: 60 sec should be enough.
* scheduler service: 60 sec should be enough.
* PoC:
This PoC shows the working of the spec 1 proposal.
* Code change: https://review.opendev.org/c/openstack/nova/+/967261
* PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/
* Some specific examples of the shutdown issues which will be solved by this
proposal:
* Migrations:
* Migration operations will use the 'ops RPC server'.
* If migration is in-progress then the service shutdown will not
terminate the migration; instead will be able to wait for the migration
to complete.
* Instance boot:
* Instance boot operations will continue to use the
'new request RPC server'. Otherwise, we will not be able to stop the
new requests.
* If instance boot requests are in progress by compute services, then
shutdown will wait for compute to boot them successfully.
* If a new instance boot request arrives after the shutdown is initiated,
then it will stay in the queue, and the compute will handle it once it
is started again.
* Any operations which is reached to compute will be completed before the
service is shut down.
.. note::
As per my PoC and manual testing till now, it does not require any
change on oslo.messaging side.
Spec 2: Smartly track and wait for the in-progress operations:
--------------------------------------------------------------
* The below services graceful shutdown is handled by their deployed server or
library so no work is needed for Spec 2:
* Nova API
* Nova metadata API
* nova-novncproxy
* nova-serialproxy
* nova-spicehtml5proxy
* The below services need to implement the tracking system:
* Nova compute
* Nova conductor
* Nova scheduler
This proposal is to make the service wait time based on tracking the
in-progress tasks. Once the service finishes the tasks, then they can signal
to nova.service to proceed with shutting down the service. Basically, this
replaces the wait time approach mentioned above with a tracker-based approach.
* There will be a task tracker introduced to track the in-progress tasks.
* It will be a singleton object.
* It maintains a list of 'method names' and ``request-id``. If task is related
to instance, then we can add the instance UUID also that can help to filter
or know what all operations on specific instance is in-progress. The unique
``request-id`` will help to track multiple calls to the same method.
* Whenever a new request comes to compute, it will add that to the task list
and remove it once the task is completed. Modification to the tracker will be
done under lock.
* Once shutdown is initiated:
* The task tracker will either add the new tasks to the tracker list or
reject them. The decision will be made by case, for example, reject the
tasks if they are not critical to handle during shutdown.
* During shutdown, any new periodic tasks will be denied, but in-progress
periodic tasks will be finished.
* An exact list of tasks which will be rejected and accepted will be decided
during implementation.
* The task tracker will start logging the tasks which are in progress, and
log when they are completed. Basically, log the detail view of in-progress
things during shutdown.
* nova.service will wait for the task tracker to finish the in-progress tasks
until timeout.
* Example of the flow of RPC servers stop, wait, and task tacker wait will be
something like:
* We can signal tast tracker to start logging the in-progress tasks.
* RPCserver1.stop()
* RPCserver1.wait()
* manager.finish_tasks(): wait for manager to finish the in-progress tasks.
* RPCserver2.stop()
* RPCserver2.wait()
Graceful Shutdown Timeouts:
---------------------------
* Nova service timeout:
* oslo.service already has the timeout (graceful_shutdown_timeout_)
which is configurable per service and used to timeout the SIGTERM signal
handler.
* oslo.service will terminate the Nova service based on
graceful_shutdown_timeout_, even Nova service graceful shutdown is not
finished.
* No new configurable timeout will be added for the Nova, instead it will use
the existing graceful_shutdown_timeout_.
* Its default value is 60 sec, which is less for Nova services. The proposal
is to override its default value per Nova services:
* compute service: 180 sec (Considering the long running tasks).
* conductor service: 80 sec
* scheduler service: 80 sec
* External system timeout:
Depending on how Nova services are deployed, there might be an external
system (for example, Nova running on k8s pods) timeout for graceful shutdown.
That can impact the Nova graceful shutdown, so we need to document it
clearly that if there is external system timeout, then Nova service timeout
graceful_shutdown_timeout_ should be set accordingly. The external
system timeout should be higher than graceful_shutdown_timeout_,
otherwise external system will timeout and will interrupt the Nova graceful
shutdown.
Alternatives
------------
One alternative for the RPC redesign is to handle the two topics per RPC
server. This needs a good amount of changes in oslo.messaging framework as well
as driver implementations. The idea is to allow oslo.messaging Target to take
more than one topic (take topic as a list) and ask the driver to create
separate consumers, listeners, dispatchers, and queues for each topic. Create
each topic binding to the exchange. This also requires oslo.messaging to
provide a new way to let the RPC server unsubscribe from a particular topic
and continue listening on other topics. We also need to redesign how RPC server
stop() and wait() works for now. This is too complicated and almost
re-designing the oslo.messaging RPC concepts.
One more alternative is to track and stop sending the request from Nova api or
the scheduler service, but that will not be able to stop all the new requests
(compute to compute tasks) or let in-progress things to complete.
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Notifications impact
--------------------
None
Other end user impact
---------------------
This should provide a positive impact on end users so that the shutdown will
not stop their in-progress operations.
Performance Impact
------------------
No impact on normal operations, but the service shutdown will take more time.
There is a configurable timeout to control the service shutdown wait time.
Other deployer impact
---------------------
None other than a longer shutdown process, but they can configurable an
appropriate timeout for service shutdown.
Developer impact
----------------
None
Upgrade impact
--------------
Adding a new RPC server will impact the upgrade. The old compute will not have
the new 'ops RPC server' listening on topic RPC_TOPIC_OPS, so we need to handle
it with RPC versioning. If the RPC client detects an old compute (based on
version_cap), then it will fall back to send the message to the original RPC
server (listening to RPC_TOPIC).
Implementation
==============
Assignee(s)
-----------
Primary assignee:
gmaan
Other contributors:
None
Feature Liaison
---------------
gmaan
Work Items
----------
* Implement the 'ops RPC server' on the compute service
* Use the 'ops RPC server' for the operations we need to finish during
shutdown, for example, compute-to-compute tasks and server external events.
* RPC versioning due to upgrade impact.
* Implement a task tracker for services to track and report the in-progress
tasks during shutdown.
Dependencies
============
* No dependency as of now, but we will see during implementation if any change
is needed in oslo.messaging.
Testing
=======
* We cannot write tempest tests for this because tempest will not be able to
stop the services.
* We can try (with some heavy live migration which will takes time) some
testing in 'post-run' phase like it is done for evacuate tests.
* Unit and functional tests will be added.
Documentation Impact
====================
Graceful shutdown working will be documented along with other considerations,
for example, timeout or wait time considered for the graceful shutdown.
References
==========
* PoC:
* Code change: https://review.opendev.org/c/openstack/nova/+/967261
* PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/
* PTG discussions:
* https://etherpad.opendev.org/p/nova-2026.1-ptg#L860
* https://etherpad.opendev.org/p/nova-2025.1-ptg#L413
* https://etherpad.opendev.org/p/r.3d37f484b24bb0415983f345582508f7#L180
.. _`websockify.websocketproxy`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websocketproxy.py#L300
.. _`websockify`: https://github.com/novnc/websockify
.. _`start_service`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websockifyserver.py#L861
.. _`new_websocket_client`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L164C9-L164C29
.. _`close_connection`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L150
.. _`graceful_shutdown_timeout`: https://github.com/openstack/oslo.service/blob/8969233a0a45dad06c445fdf4a66920bd5f3eef0/oslo_service/_options.py#L60
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - 2026.1 Gazpacho
- Introduced
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.