Merge "Add backlog spec for Graceful shutodwn of nova services"

@@ -0,0 +1,711 @@

..

This work is licensed under a Creative Commons Attribution 3.0 Unported

License.

http://creativecommons.org/licenses/by/3.0/legalcode

==================================

Graceful Shutdown of Nova Services

==================================

https://blueprints.launchpad.net/nova/+spec/nova-services-graceful-shutdown

This is backlog spec proposing the design of graceful shutdown.

Nova services do not shut down gracefully. When services are stopped, it also

stops all the in-progress operations, which not only interrupt the in-progress

operations, but can leave instances in an unwanted or unrecoverable state. The

idea is to let services stop processing the new request, but complete the

in-progress operations before service is terminated.

Problem description

===================

Nova services do not have a way to shutdown gracefully means they do not wait

for the in-progress operations to be completed. When shutdown is initiated,

services wait for the RPC server to stop and wait so that they can consume all

the existing request messages (RPC call/cast) from the queue, but the service

does not complete the operation.

Each Nova compute service has a single worker running and listening on a single

RPC server (topic: compute.<host>). The same RPC server is used for the new

requests as well as for in-progress operations where other compute or conductor

services communicate. When shutdown is initiated, the RPC server is stopped

means it will stop handling the new request, which is ok, but at the same

time it will stop the communication needed for the in-progress operations. For

example, if live migration is in progress, the source and destination compute

communicate (sync and async way) multiple times with each other. Once the RPC

server on the compute service is stopped, it cannot communicate with the other

compute and fail the live migration. It will lead the system as well as the

instance to be in an unwanted or unrecoverable state

Use Cases

---------

As an operator, I want to be able to gracefully shut down (SIGTERM) the Nova

services so that it will not impact the users' in-progress operations or

keep resources in usable state.

As an operator, I want to be able to keep instances and other resources in a

usable state even if service is gracefully terminated (SIGTERM).

As an operator, I want to be able to take the actual benefits of the k8s pod

graceful shutdown when Nova services are running in k8s pods.

As a user, I want in-progress operations to be completed before the service

is gracefully terminated (SIGTERM).

Proposed change

===============

Scope: The proposed solution is to gracefully shutdown the services for

the SIGTERM signal.

The graceful shutdown is based on the following design principles:

* When service shutdown is initiated by SIGTERM:

* Do not process any new requests

* New requests should not be lost. Once service is restarted, it should

process the requests.

* Allow in-progress operations to reach their quickest safe termination

point, either completion or abort.

* Proper logging of the state of in-progress operations

* Keep instances or other resources in a usable state

* When service shutdown is completed:

* Proper logging of unfinished operations.

Ideally, all the in-progress operations should be completed before service

is terminated, but if graceful shutdown times out (due to a configured

timeout, adding the timeout details in later section) then there should be

a proper logging of all the unfinished operations. This will help to

recover the system or instances.

* When service is started again:

* Start processing the new requests in the normal way.

* If the requests were not processed due to the shutdown being initiated,

then they stay in message broker queue and there are multiple

possibilities:

* Requests might have been picked by the other worker of that service.

For example, you can run more than one Nova scheduler (or conductor)

worker. If one of the worker is shutting down, then other worker will

process the request. This is not the case for Nova compute which is

always a single worker per compute service on specific host.

* If a service has single worker running, then request can be picked up

once service is up again.

* There is an opportunity for the compute service to cleanup or recover

the interrupted operation on instances during init_host(). The action

taken will depends on the tasks and its status.

* If the service is in the stopped state for a long time, based on the

RPC and message queue timeout, there is chance that:

* The RPC client or server will timeout the call.

* The message broker queue may drop messages due to timeout.

* The order of requests and messages can be stale.

As a graceful shutdown goal, we need to do two things:

#. A way to stop new requests, but do not interrupt in-progress operations.

This is proposed to be done via RPC.

#. Give services enough time to finish the operations. As a first step,

this is proposed to be done via time-based wait and later with a proper

tracking mechanism.

This backlog spec proposes achieving the above goals in two steps. Each step

will be proposed as a separate spec for a specific release.

The Nova services which already gracefully shutdown:

----------------------------------------------------

For the below services, their graceful shutdown is handled by their

deployment servers or used library.

* Nova API & Nova metadata API:

Those services are deployed using a server with WSGI support. That server

will ensure that Nova API services shuts down gracefully, meaning it

finishes the in-progress requests and rejects the new requests.

I investigate with uWSGI/mod_proxy_uwsgi (devstack env). On service start,

uWSGI server pre-spawn the number of workers for API service which will

handle the API requests in distributed way. When shutdown is initiated

by SIGTERM, the uWSGI server SIGTERM handler check if there are any

in-progress request on any worker. It wait for all the workers to finish

the request and then terminates each worker. Once all worker are terminated

then it will terminate the Nova API service.

If any new request comes after the shutdown is initiated, it will be rejected

with "503 Service Unavailable" error.

Testing:

I tested two types of requests:

#. Sync request: 'openstack server list':

* To observe the graceful shutdown, I added 10 seconds of sleep in the

server list API code.

* Start a API request 'request1': ``openstack server list``

* Wait till the server list request reaches the Nova API (you can see

the log from the controller)

* Because of sleep(10), the server list takes time to finish.

* Initiate the Nova API service shutdown.

* Start a new API request 'request2': ``openstack server list``. This new

requests came after shutdown is initiated so it should be denied.

* Nova API service will wait because 'request1' is not finished.

* 'request1' will get the response of the server list before the service

is terminated.

* 'request2' is denied and will receive the error

"503 Service Unavailable"

#. Async request: ``openstack server pause <server>``:

* To observe the graceful shutdown, I added 10 seconds of sleep in the

server pause API code.

* Start a API request 'request1': ``openstack server pause server1``

* Wait till the pause server request reaches the Nova API (you can see

the log from the controller)

* Because of sleep(10), the pause server takes time to finish.

* Initiate the Nova API service shutdown.

* Service will wait because 'request1' is not finished.

* Nova API will make an RPC cast to the Nova compute service and return.

* 'request1' is completed, and the response is returned to the user.

* Nova API service is terminated now.

* Nova compute service is operating the pause server request.

* Check if server is paused ``openstack server list``

* You can see the server is paused.

* Nova console proxy services: nova-novncproxy, nova-serialproxy, and

nova-spicehtml5proxy:

All the console proxy services run as websockify.websocketproxy_ service.

The websockify_ library handles the SIGTERM signal and the graceful shutdown,

which is enough for the Nova services.

When a user access the console, websockify library starts a new process

in start_service_ and calls Nova new_websocket_client_ . Nova will be

authorizing the token, creating a socket on the host & port, which will

be used to send the data/frames. After that, user can access the console.

If a shutdown request is initiated, websockify handle the signal. First,

it will terminate all the child processes and then raise the terminate

exception, which ends up calling the Nova close_connection_ method. The

Nova close_connection_ method calls shutdown() on the socket first and

then close(), which makes sure to send the remaining data/frame before

closing the socket.

This way, user console sessions will be terminated gracefully, and they will

get "Disconnected" message. Once service is up, the user can refresh the

browser, and the console will be up again (if the token has not expired).

Spec 1: Split the new and in-progress requests via RPC:

-------------------------------------------------------

RPC communication is an important part of services to finish a particular

operation. During shutdown, we need to make sure we keep the required RPC

servers/buses up. If we stop the RPC communication, then it is nothing

different than service termination.

Nova implements, and this spec talks a lot about RPC server ``start``,

``stop``, and ``wait``, so let's cover them briefly from oslo.messaging/RPC

resources point of view, and to understand this proposal in an easy way.

Most of you might know this, so you can skip this section.

* RPC server:

* creation and start():

* It will create the required resources on oslo.messaging side, for

example, dispatcher, consumer, listener, and queues.

* It will handle the binding to the required exchanges.

* stop():

* It will disable the listener ability to pick up any new message

from the queue, but will dispatch the already picked message to

the dispatcher.

* It will delete the consumer.

* It will not delete the queues and exchange on the message broker side.

* It will not stop RPC clients sending new messages to the queue, however,

they will not be picked because the consumer and listener are stopped.

* wait():

* It will wait for the thread pool to finish dispatching all the already

picked messages. Basically, this will make sure methods are called on the

manager.

Analysis per services and the required proposed RPC design change:

* The services listed below communicate with other Nova services' RPC servers.

Since they do not have their own RPC server, no change needed:

* Nova API

* Nova metadata API

* nova-novncproxy

* nova-serialproxy

* nova-spicehtml5proxy

* Nova scheduler: No RPC change needed.

* Requests handling:

Nova scheduler service runs as multiple workers, each having its own RPC

server, but all the Nova scheduler workers will listen to the same RPC

topic and queue ``scheduler`` with fanout way.

Currently, nova.service.py->stop() calls stop() and wait() on RPC server.

Once RPC server is stopped, it will stop listening to any new messages.

But it will not impact anything on the other scheduler worker, and they

continue listening to the same queue and process the request. If any of

the scheduler worker is stopped, then the other workers will process the

request.

* Response handling:

Whenever there is a RPC call, oslo.messaging creates another reply queue

connected with the unique message id. This reply queue will be used to

send the RPC call response to the caller. Even if the RPC server is stopped

on this worker, it will not impact the reply queue.

We still need to keep the worker up until all the responses are sent via

the reply queue, and for that, we need to implement the in-progress task

tracking in scheduler services, but that will be handled in step 2.

This way, stopping a Nova scheduler worker will not impact the RPC

communication on the scheduler service.

* Nova conductor: No RPC change needed.

The Nova conductor binary is a stateless service that can spawn multiple

worker threads. Each instance of the Nova conductor has its own RPC server,

but all the Nova conductor instances will listen to the same RPC topic

and queue ``conductor``. This allows the conductor instance to ack as a

distributed worker pool such that stopping an individual conductor instance

will not impact the RPC communication for the pool of conductor instances,

allowing other available workers to process the request. Each cell has its

own pool of conductors meaning as long as one conductor is up for any given

cell the RPC communication will continue to function even when one or more

conductors are stopped.

The request and response handling is done in the same way as mentioned for

the scheduler.

.. note::

This spec does not cover the conductor single worker case. That might

requires the RPC designing for conductor as well but it need more

investigation.

* Nova compute: RPC design change needed

* Request handling:

The Nova compute runs as a single worker per host, and each compute per

host has their own RPC server, listener, and separate queues. It handles

the new request as well as the communication needed for in-progress

operations on the same RPC server. To achieve the graceful shutdown, we

need to separate communication for the new requests and in-progress

operations. This will be done by adding a new RPC server in the compute

service.

For easy readability, we will be using a different term for each RPC

server:

* 'ops RPC server': This will be used for the new RPC server, which

will be used to finish the in-progress requests and will stay up during

shutdown.

* 'new request RPC server': This will be used for the current RPC server,

which is used for the new requests and will be stopped during shutdown.

* 'new request RPC server' per compute:

No change in this RPC server, but it will be used for all the new requests,

so that we can stop it during shutdown and stop the new requests on the

compute.

* 'ops RPC server' per compute:

* Each compute will have a new 'ops RPC server' which will listen to a new

topic ``compute-ops.<host>``. ``compute-ops`` name is used because it

is mainly for compute operations, but a better name can be used if

needed.

* It will use the same transport layer/bus and exchange that the

'new request RPC server' uses.

* It will create its own dispatcher, listener, and queue.

* Both RPC server will be bound to the same endpoints (same compute

manager), so that requests coming from either server are handled by

the same compute manager.

* This server will be mainly used for the compute-to-compute operations and

server external events. The idea is to keep this RPC server up during

shutdown so that the in-progress operations can be finished.

* In shutdown, nova.service will wait for the compute to tell if they

finished all their tasks, so that it can stop the 'ops RPC server' and

finish the shutdown.

* Response handling:

Irrespective of request is coming from either RPC server, whenever there

is a RPC call, oslo.messaging creates another reply queue connected with

the unique message id. This reply queue will be used to send the RPC call

response to the caller. Even RPC server is stopped on this worker, it

will not impact the reply queue.

* Compute service workflow:

* SIGTERM signal is handled by oslo.service, it will call stop on

nova.service

* nova.service will stop the 'new request RPC server' so that no new

requests are picked by the compute. The 'ops RPC server' is running and

up.

* nova.service will wait for the manager to signal once all in-progress

operations are finished.

* Once compute signal to nova.service, then it will stop the

'ops RPC server' and proceed with service shutdown.

* Timeout:

* There is an existing graceful_shutdown_timeout_ config option present

on oslo.service which can be set per service.

* That is honoured to timeout the service stop, and it will stop service

irrespective of the compute finishing the things.

* RPC client:

* The RPC client stays as a singleton class, which is created with the

topic ``compute.<host>``, meaning that by default message will be

sent via 'new request RPC server'.

* If any RPC cast/call wants to send a message via the 'ops RPC server',

they need to override the ``topic`` to ``compute-ops.<host>`` during

client.prepare() call.

* Which RPC cast/call will be using the 'ops RPC server' will be decided

during implementation, so that we can have a better judgment on what all

methods are used for the operations we want to finish during shutdown.

A draft list where we can use the 'ops RPC server':

.. note::

This is draft list and can be changed during implementation.

* Migrations:

- Live migration:

.. note::

We will be using the 'new request RPC server' for

check_can_live_migrate_destination and

check_can_live_migrate_source methods, as this is the very initial

phase where the compute service has not started the live

migration. If shutdown is initiated before live migration request,

came then migration should be rejected.

- pre_live_migration()

- live_migration()

- prep_snapshot_based_resize_at_dest()

- remove_volume_connection()

- post_live_migration_at_destination()

- rollback_live_migration_at_destination()

- drop_move_claim_at_destination()

- resize methods

- cold migration methods

* Server external event

* Rebuild instance

* validate_console_port()

This is when the console is already requested, and if port validation

request is going on, the compute should finish it before shutdown so

that users can get their requested console.

* Time based waiting for services to finish the in-progress operations:

.. note::

The time based waiting is a temporary solution in spec 1. In spec 2,

it will be replaced by the proper tracking of in-progress tasks.

* To make the graceful shutdown less complicated, spec 1 proposes to

configurable time-based waiting for services to complete their operations.

* The wait time should be less than global graceful shutdown timeout. So that

external system or oslo.service does not shut down the service before the

service wait time is over.

* It will be configurable per service.

* Proposal for the default value:

* compute service: 150 sec, considering long-running operations on compute.

* conductor service: 60 sec should be enough.

* scheduler service: 60 sec should be enough.

* PoC:

This PoC shows the working of the spec 1 proposal.

* Code change: https://review.opendev.org/c/openstack/nova/+/967261

* PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/

* Some specific examples of the shutdown issues which will be solved by this

proposal:

* Migrations:

* Migration operations will use the 'ops RPC server'.

* If migration is in-progress then the service shutdown will not

terminate the migration; instead will be able to wait for the migration

to complete.

* Instance boot:

* Instance boot operations will continue to use the

'new request RPC server'. Otherwise, we will not be able to stop the

new requests.

* If instance boot requests are in progress by compute services, then

shutdown will wait for compute to boot them successfully.

* If a new instance boot request arrives after the shutdown is initiated,

then it will stay in the queue, and the compute will handle it once it

is started again.

* Any operations which is reached to compute will be completed before the

service is shut down.

.. note::

As per my PoC and manual testing till now, it does not require any

change on oslo.messaging side.

Spec 2: Smartly track and wait for the in-progress operations:

--------------------------------------------------------------

* The below services graceful shutdown is handled by their deployed server or

library so no work is needed for Spec 2:

* Nova API

* Nova metadata API

* nova-novncproxy

* nova-serialproxy

* nova-spicehtml5proxy

* The below services need to implement the tracking system:

* Nova compute

* Nova conductor

* Nova scheduler

This proposal is to make the service wait time based on tracking the

in-progress tasks. Once the service finishes the tasks, then they can signal

to nova.service to proceed with shutting down the service. Basically, this

replaces the wait time approach mentioned above with a tracker-based approach.

* There will be a task tracker introduced to track the in-progress tasks.

* It will be a singleton object.

* It maintains a list of 'method names' and ``request-id``. If task is related

to instance, then we can add the instance UUID also that can help to filter

or know what all operations on specific instance is in-progress. The unique

``request-id`` will help to track multiple calls to the same method.

* Whenever a new request comes to compute, it will add that to the task list

and remove it once the task is completed. Modification to the tracker will be

done under lock.

* Once shutdown is initiated:

* The task tracker will either add the new tasks to the tracker list or

reject them. The decision will be made by case, for example, reject the

tasks if they are not critical to handle during shutdown.

* During shutdown, any new periodic tasks will be denied, but in-progress

periodic tasks will be finished.

* An exact list of tasks which will be rejected and accepted will be decided

during implementation.

* The task tracker will start logging the tasks which are in progress, and

log when they are completed. Basically, log the detail view of in-progress

things during shutdown.

* nova.service will wait for the task tracker to finish the in-progress tasks

until timeout.

* Example of the flow of RPC servers stop, wait, and task tacker wait will be

something like:

* We can signal tast tracker to start logging the in-progress tasks.

* RPCserver1.stop()

* RPCserver1.wait()

* manager.finish_tasks(): wait for manager to finish the in-progress tasks.

* RPCserver2.stop()

* RPCserver2.wait()

Graceful Shutdown Timeouts:

---------------------------

* Nova service timeout:

* oslo.service already has the timeout (graceful_shutdown_timeout_)

which is configurable per service and used to timeout the SIGTERM signal

handler.

* oslo.service will terminate the Nova service based on

graceful_shutdown_timeout_, even Nova service graceful shutdown is not

finished.

* No new configurable timeout will be added for the Nova, instead it will use

the existing graceful_shutdown_timeout_.

* Its default value is 60 sec, which is less for Nova services. The proposal

is to override its default value per Nova services:

* compute service: 180 sec (Considering the long running tasks).

* conductor service: 80 sec

* scheduler service: 80 sec

* External system timeout:

Depending on how Nova services are deployed, there might be an external

system (for example, Nova running on k8s pods) timeout for graceful shutdown.

That can impact the Nova graceful shutdown, so we need to document it

clearly that if there is external system timeout, then Nova service timeout

graceful_shutdown_timeout_ should be set accordingly. The external

system timeout should be higher than graceful_shutdown_timeout_,

otherwise external system will timeout and will interrupt the Nova graceful

shutdown.

Alternatives

------------

One alternative for the RPC redesign is to handle the two topics per RPC

server. This needs a good amount of changes in oslo.messaging framework as well

as driver implementations. The idea is to allow oslo.messaging Target to take

more than one topic (take topic as a list) and ask the driver to create

separate consumers, listeners, dispatchers, and queues for each topic. Create

each topic binding to the exchange. This also requires oslo.messaging to

provide a new way to let the RPC server unsubscribe from a particular topic

and continue listening on other topics. We also need to redesign how RPC server

stop() and wait() works for now. This is too complicated and almost

re-designing the oslo.messaging RPC concepts.

One more alternative is to track and stop sending the request from Nova api or

the scheduler service, but that will not be able to stop all the new requests

(compute to compute tasks) or let in-progress things to complete.

Data model impact

-----------------

None

REST API impact

---------------

None

Security impact

---------------

None

Notifications impact

--------------------

None

Other end user impact

---------------------

This should provide a positive impact on end users so that the shutdown will

not stop their in-progress operations.

Performance Impact

------------------

No impact on normal operations, but the service shutdown will take more time.

There is a configurable timeout to control the service shutdown wait time.

Other deployer impact

---------------------

None other than a longer shutdown process, but they can configurable an

appropriate timeout for service shutdown.

Developer impact

----------------

None

Upgrade impact

--------------

Adding a new RPC server will impact the upgrade. The old compute will not have

the new 'ops RPC server' listening on topic RPC_TOPIC_OPS, so we need to handle

it with RPC versioning. If the RPC client detects an old compute (based on

version_cap), then it will fall back to send the message to the original RPC

server (listening to RPC_TOPIC).

Implementation

==============

Assignee(s)

-----------

Primary assignee:

gmaan

Other contributors:

None

Feature Liaison

---------------

gmaan

Work Items

----------

* Implement the 'ops RPC server' on the compute service

* Use the 'ops RPC server' for the operations we need to finish during

shutdown, for example, compute-to-compute tasks and server external events.

* RPC versioning due to upgrade impact.

* Implement a task tracker for services to track and report the in-progress

tasks during shutdown.

Dependencies

============

* No dependency as of now, but we will see during implementation if any change

is needed in oslo.messaging.

Testing

=======

* We cannot write tempest tests for this because tempest will not be able to

stop the services.

* We can try (with some heavy live migration which will takes time) some

testing in 'post-run' phase like it is done for evacuate tests.

* Unit and functional tests will be added.

Documentation Impact

====================

Graceful shutdown working will be documented along with other considerations,

for example, timeout or wait time considered for the graceful shutdown.

References

==========

* PoC:

* Code change: https://review.opendev.org/c/openstack/nova/+/967261

* PoC results: https://docs.google.com/document/d/1wd_VSw4fBYCXgyh5qwnjvjticNa8AnghzRmRH3H8pu4/

* PTG discussions:

* https://etherpad.opendev.org/p/nova-2026.1-ptg#L860

* https://etherpad.opendev.org/p/nova-2025.1-ptg#L413

* https://etherpad.opendev.org/p/r.3d37f484b24bb0415983f345582508f7#L180

.. _`websockify.websocketproxy`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websocketproxy.py#L300

.. _`websockify`: https://github.com/novnc/websockify

.. _`start_service`: https://github.com/novnc/websockify/blob/e9bd68cbb81ab9b0c4ee5fa7a62faba824a142d1/websockify/websockifyserver.py#L861

.. _`new_websocket_client`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L164C9-L164C29

.. _`close_connection`: https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1e3f/nova/console/websocketproxy.py#L150

.. _`graceful_shutdown_timeout`: https://github.com/openstack/oslo.service/blob/8969233a0a45dad06c445fdf4a66920bd5f3eef0/oslo_service/_options.py#L60

History

=======

.. list-table:: Revisions

:header-rows: 1

* - Release Name

- Description

* - 2026.1 Gazpacho

- Introduced

711 specs/backlog/approved/nova-services-graceful-shutdown.rst Normal file Unescape Escape View File

711
specs/backlog/approved/nova-services-graceful-shutdown.rst Normal file

View File