a22d1fc411639da4a896b5760df34de3ea78fc80
Commit Graph

508 Commits

Author SHA1 Message Date
Zuul
80b0a9a132 Merge "Software RAID: Re-add missing devices" 2020年10月12日 12:24:24 +00:00
Dmitry Tantsur
420ebc0d73 Do not silently swallow errors in the write_image deploy step
Calling join() does not raise, we need to explicitly check the result.
Change-Id: I81d3d727af220c2b50358edab8139f07874611f0
Story: #2008240
Task: #41083 
2020年10月09日 11:24:12 +02:00
Dmitry Tantsur
fc4e0eed6a Don't try to call GRUB when root UUID is not provided
We don't have a really working way to detect root UUID for whole
disk images at the moment, which results in an ignored traceback
every time install_bootloader is called with whole disk images in
UEFI mode. Avoid it by skipping GRUB2 if root UUID is unknown.
Change-Id: I84245538f59c664b72d1cafbca8d61be0978f489
2020年10月07日 12:06:42 +02:00
Zuul
abd9f91813 Merge "Add basic retries for inspection" 2020年10月06日 17:07:20 +00:00
Arne Wiebalck
253b4887d5 Software RAID: Re-add missing devices
Upon md device creation, component devices are sometimes removed
immediately again due to a "disk failure". The disks seem healthy,
though. This patch re-adds compoenent devices in such cases to
prevent that the md device will remain in a degraded state (which
would cause issues later, e.g. during ESP creation).
Story: #2008164
Task: #40914
Change-Id: I2ac7cb4a546de84686d5c3435e850c14b3f6c1d7
2020年10月06日 14:00:57 +02:00
fb45e58d1c Update master for stable/victoria
Add file to the reno documentation build to show release notes for
stable/victoria.
Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/victoria.
Change-Id: Ia3696da8663c140504924b0a1cd23f9aaa517f0a
Sem-Ver: feature
2020年10月01日 18:42:40 +00:00
Zuul
99dee5067e Merge "Software RAID: Get component devices by md UUID" 2020年09月30日 18:30:56 +00:00
Zuul
faeb9441d3 Merge "Simplify heartbeating by removing use of select()" 2020年09月29日 15:47:08 +00:00
Arne Wiebalck
044c64dbc0 Software RAID: Get component devices by md UUID
Scanning the output of mdadm commands for RAID members will
miss component devices which are currently not part of the
RAID. For proper cleaning it is better to scan block devices
for a signature of the md device for which we would like to
get the components.
Story: #2008186
Task: #40947
Change-Id: Ib46612697851e36a16d272ccaeb0115106253863
2020年09月29日 17:08:40 +02:00
Arne Wiebalck
c7aec775ff Software RAID: Don't delete partitions too early
Partions on the holder disk should only be deleted after
all RAID devices have been deleted. Otherwise, super blocks
on partitions which reside on the same disks cannot be cleaned.
Story: #2008199
Task: #40979
Change-Id: I19293f5b992cd1fa68957d6f306dcec8f3b7a820
2020年09月28日 10:35:12 +02:00
Zuul
c7ff931fe6 Merge "Fix: make Intel CNA hardware manager none generic" 2020年09月23日 14:57:40 +00:00
Zuul
11a87365fb Merge "Generate a TLS certificate and send it to ironic" 2020年09月23日 12:14:38 +00:00
Qianbiao.NG
4b0ef13d08 Fix: make Intel CNA hardware manager none generic
Currently, IntelCnaHardwareManager inherits GenericHardwareManager
which makes it a new "GenericHardwareManager" with "MAINLINE" priority.
This causes all other hardware-managers with lower priority than
"MAINLINE" never be used. To fix this, make IntelCnaHardwareManager
inherit basic HardwareManager.
Change-Id: I28b665d8841b0b2e83b132e1f25df95e03e7ba10
Story: 2008142
Task: 40882
2020年09月23日 18:24:26 +08:00
Jay Faulkner
a01646f56b Simplify heartbeating by removing use of select()
Heartbeating in IPA has used select.poll() for years to workaround
a bug where changing the time in the ramdisk could cause heartbeats
to stop and never resume.
Now that IPA syncs time at start and exit, this workaround is no
longer needed. So instead, we'll revert to using threading.Event()
in order to make the code simpler and easier to understand.
Since we need this to be an eventlet-event, and not a standard-thread
event, also monkey_patch threading.
Additionally, there were a few completely unused backoff interval
values set, that were never applied. In respect of maintaining the
5+ years old behavior of not doing error backoffs, that code was
removed instead of being made to work.
Change-Id: Ibcde99de64bb7e95d5df63a42a4ca4999f0c4c9b
2020年09月22日 16:59:47 +00:00
Julia Kreger
bb27badf76 Add basic retries for inspection
A transitory connection failure, such as one caused by
a port being held down for traffic forwarding, can experience
intermittent connectivity failures which result in failed
introspections.
Now the agent retries.
Change-Id: I72c5e3aca000d3854a17f8a461b1a2935e5c0d9b
2020年09月14日 22:38:18 +00:00
Dmitry Tantsur
021e0a6a46 Generate a TLS certificate and send it to ironic
Adds a new flag (on by default) that enables generating a TLS
certificate and sending it to ironic via heartbeat. Whether
ironic supports auto-generated certificates is determined by
checking its API version.
Change-Id: I01f83dd04cfec2adc9e2a6b9c531391773ed36e5
Depends-On: https://review.opendev.org/747136
Depends-On: https://review.opendev.org/749975
Story: #2007214
Task: #40604 
2020年09月11日 17:46:52 +02:00
Julia Kreger
3426963552 Fix backup node lookup
The node lookup code added in change
I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
was slightly broken in that we call a method
with a keyword arguemnt which doesn't exist.
uuid versus node_uuid.
It happens, it is a quick fix!
Spotted on a metalsmith job:
[-] Agent is requesting to perform an explicit node cache update.
 This is to pickup any chanages in the cache before deployment.
[-] Failed to update node cache. Error lookup_node() got an
 unexpected keyword argument 'uuid'
Change-Id: I59ecec65707a2f03918b233f1925395ebe59b8c4
2020年09月09日 15:19:38 -07:00
Zuul
e73b7220c4 Merge "If listen_tls is true, enable TLS on wsgi server" 2020年09月03日 18:59:48 +00:00
Zuul
09f6a4e3da Merge "Update the cache if we don't have a root device hint" 2020年09月03日 09:41:58 +00:00
Jay Faulkner
1d11f0b7dd If listen_tls is true, enable TLS on wsgi server
This change enables operators to set [DEFAULT]listen_tls to
true configure IPA to be host its WSGI server over TLS using
existing SSL support in oslo.service.
In addition to configuring this in IPA, a deployer will need to
also set [ssl]cert_file, [ssl]key_file, and optionally
[ssl]ca_file in their ipa config, in addition to embedding those
files into the IPA ramdisk in order for this to be functional.
In order to make this change work, we also need to monkey patch
socket library early, or else oslo.service will end up passing an
unpatched socket to the eventlet wsgi server, which causes
deadlocks.
Change-Id: Ib7decae410915f3c27b045ee08538c94d455b030
2020年09月02日 16:07:42 -07:00
Jay Faulkner
7d0ad36ebd Make WSGI server respect listen_* directives
The listen_port and listen_host directives are intended to allow
deployers of IPA to change the port and host IPA listens on. These
configs have not been obeyed since the migration to the oslo.service
wsgi server.
Story: 2008016
Task: 40668
Change-Id: I76235a6e6ffdf80a0f5476f577b055223cdf1585
2020年08月31日 14:37:38 +00:00
Julia Kreger
d3c3d4dabe Update the cache if we don't have a root device hint
Or at least try to.
Some deployments just don't use root device hints, and this is okay.
However, other deployments need root device hints, and with fast
track mode in ramdisks, we created a situation where the node cache
could be updated by a human or software between the time the agent
was started, and the deployment was requested.
As a result, the agent has been updated to check if we have a hint
and if we don't, update the cache from the node lookup endpoint.
This is not needed when the inband deploy steps are executed, as
the process of updating the steps does force the node cache to be
updated.
Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
Story: 2008039
Task: 40701
2020年08月25日 19:34:48 +00:00
Zuul
cfede0c5bc Merge "Clarify connection error on heartbeats" 2020年08月24日 13:29:27 +00:00
Julia Kreger
f670f704f3 Clarify connection error on heartbeats
Heartbeat connection errors are often a sign of a transitory
network failures which may resolve themselves. But an operator
looking at the screen doesn't necessarilly know that.
They don't understand that there could have been a network
failure, or a misconfiguration that caused the connectivity
failure and soft of kind of default to "well it failed"
without further clarification.
As such, this patch adds explicit catching of the requests
ConnectionError exception and rasies a new internal error
with a more verbose error message in that event to provide
operators with additional clarity.
Change-Id: I4cb2c0d1f577df1c4451308bd86efa8f94390b0c
Story: 2008046
Task: 40709
2020年08月20日 13:45:47 -07:00
Dmitry Tantsur
d50ff06b6b Enable the logs collection by default
It's incredibly helpful when debugging and most of consumers seem
to enable and rely on it.
Change-Id: I33bf58b3eb16b63b70f2a23e8a04449dc88fd94c
2020年08月19日 17:25:24 +02:00
Zuul
3e938b6fcc Merge "Support changing the protocol part of callback_url to https" 2020年08月10日 14:59:51 +00:00
Zuul
9f88a0cb59 Merge "Fix TypeError on agent lookup failure" 2020年08月07日 16:32:30 +00:00
Dmitry Tantsur
353d09c3b0 Support changing the protocol part of callback_url to https
Adds a new kernel parameter for manual configuration and also creates
foundation for automatic TLS support later.
Change-Id: If341c3a8a268fc8cab6bd6be04b12ca32b31c8d8
Story: #2007214
Task: #40619 
2020年08月06日 15:14:31 +02:00
Julia Kreger
5eab9bced6 Fix TypeError on agent lookup failure
Agent lookups can fail as we presently use logging.exception,
better known in our code as LOG.exception, which can also generate
other fun issues on journald based systems where additional errors
could be raised resulting in us being unable to troubleshoot the
the actual issue.
Because of the mis-use of LOG.exception and the default behavior
of the backoff retry handler, the retry logic was also not
functional as any error no matter how small caused IPA to
just exit.
Change-Id: Ic4608b7c6ff9773d1403926efb3d59869c71343b
Story: 2007968
Task: 40465
2020年08月04日 20:43:02 -07:00
Kaifeng Wang
b424fbfa35 Extends pci devices metrics
Collects PCI class, revision, and bus information for the pci-devices
collector, these metrics as well as vendor id and device id are
components which can be used to construct device information like
lspci output, which is how cyborg agent collects accelerator devices.
Accelerator device based scheduling is possible after ironic has such
information in place.
Change-Id: I6c37c554f37dd5f1d21c8fd4fad2a4f44a3c75d7
Story: 2007971
Task: 40474
2020年08月04日 23:32:37 +08:00
Zuul
ad9c54f55c Merge "Return the final RAID configuration from apply_configuration" 2020年07月29日 14:00:08 +00:00
Dmitry Tantsur
f03d72019a Return the final RAID configuration from apply_configuration
AgentRAID expects it and fails with TypeError if it's not provided.
Change-Id: Id84ac129bba97540338e25f0027aa0a0f51bde52
Story: #2006963 
2020年07月29日 10:10:18 +02:00
Dmitry Tantsur
eb87651496 Allow erase_devices_metadata to be used as a deploy step
Change-Id: I75f156dd76b0e3aaa1592ba24fe42fb2a7057cc8
Story: #2006963 
2020年07月27日 17:57:37 +02:00
Zuul
9ca640a1c5 Merge "Prevent un-needed iscsi cleanup" 2020年07月25日 13:54:51 +00:00
Zuul
f6bf94fe64 Merge "Fix versions in release notes" 2020年07月23日 00:09:02 +00:00
Zuul
daf61f33b0 Merge "Fix bootloader install issue with MDRAID" 2020年07月22日 22:13:34 +00:00
Zuul
bfb395837d Merge "Adds poll mode deployment support" 2020年07月22日 19:53:31 +00:00
Doug Szumski
5e95b1321d Fix bootloader install issue with MDRAID
When no root_device hint is set, an MDRAID partition can be incorrectly
selected as the root device which causes installation of the bootloader
to the physical disks behind the MDRAID volume to fail. See the notes
in the referenced Story for more detail.
This change adds a little more specificity to the listing of block
devices.
Change-Id: I66db457e71a0586723ee753bef961aec5bf58827
Story: 2007905
Task: 40303
2020年07月22日 11:16:13 -07:00
Riccardo Pittau
ab585153c9 Fix versions in release notes
Change-Id: I2ba658d83a15554e135429d464c0a033063d4631
2020年07月22日 15:41:38 +02:00
Julia Kreger
2a56ee03b6 Prevent un-needed iscsi cleanup
When we added software raid support, we started calling bootloader
installation. As time went on, we ehnanced that code path for non
RAID cases in order to ensure that UEFI nvram was setup
for the instance to boot properly.
Somewhere in this process, we missed a possible failure case where
the iscsi client tgtadm may return failures. Obviously, the correct
path is to not call iscsi teardown if we don't need to.
Since it was always semi-opportunistic teardown, we can't blindly
catch any error, and if we started iSCSI and failed to tear the
connection down, we might want to still fail, so this change
moves the logic over to use a flag on the agent object which
one extension to set the flag and the other to read it and take
action based upon that.
Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c
Story: 2007937
Task: 40395
2020年07月20日 14:24:06 -07:00
Dmitry Tantsur
1f3b70c4e9 Ignore devices with size 0 when collecting inventory
delete_configuration still fetches all devices as it needs to clean
ones with broken RAID.
Story: #2007907
Task: #40307
Change-Id: I4b0be2b0755108490f9cd3c4f3b71a5e036761a1
2020年07月09日 18:28:20 +02:00
Zuul
2e9620a2c0 Merge "Limit Inspection->Lookup->Heartbeat lag" 2020年07月06日 18:08:14 +00:00
Zuul
6218725610 Merge "Fix serializing ironic-lib exceptions" 2020年07月06日 16:47:58 +00:00
Julia Kreger
c76b8b2c21 Limit Inspection->Lookup->Heartbeat lag
Caches hardware information collected during inspection
so that the initial lookup can occur without any delay.
Also adds logging to track how long inventory collection takes.
Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com>
Change-Id: I3e0d237d37219e783d81913fa6cc490492b3f96a
2020年07月03日 10:32:26 +02:00
Dmitry Tantsur
ba3caa6c64 Increase the ESP partition size to 550 MiB when using software RAID
This has been a popular guidance, and diskimage-builder has recently
started following it.
Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4
2020年07月02日 17:30:33 +02:00
Dmitry Tantsur
a4855c544c Fix serializing ironic-lib exceptions
Change-Id: If1408e4b81d263c56b4bbab618dd0737db5f762e
Story: #2007889
Task: #40268 
2020年07月02日 12:18:53 +02:00
Julia Kreger
c77a7df851 Extend retries to 9, 10 seconds apart.
The download retry interval was previously five seconds which is
not long enough to recover after a hard network connectivity break
where we may be reliant upon network port forwarding hold-down
timers or even routing protocol route propogation to recover
communication.
Previously the time value was 5 seconds, with 3 attempts, meaning
15 seconds total ignoring the error detection timeouts.
Now it is 10 seconds, with 10 attempts, meaning 100 seconds before
the error detection timeouts.
Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699
2020年06月23日 20:27:49 +00:00
Julia Kreger
159ab9f0ce Add full download retries
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.
Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
2020年06月23日 20:27:41 +00:00
Julia Kreger
c5b97eb781 Add timeout operations to try and prevent hang on read()
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.
In other words, we could be in a situation like:
- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
- read(fd, len) - We drain the in-memory buffer side of the socket.
- Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
 empty OR the buffered data doesn't meet our requried read len value.
 tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
 recognized as dead, which could be a long time.
NOTE: The above read()s are python's read() on an contents being
 streamed. Lower level reads exist, but brains will hurt
 if we try to cover the dynamics at that level.
As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.
When we exceed the timeout time value with our wall clock,
we will fail the download.
Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
2020年06月23日 13:25:09 -07:00
Kaifeng Wang
61c95554ff Adds poll mode deployment support
Adds a new poll extension to provide get_hardware_info and get_node_info
interfaces.
get_hardware_info will be used for node validation by ironic deploy
drivers.
get_node_info will be used for sending lookup data to IPA.
standalone mode is assumed as debug only, but it's not the case
considering the poll mode will be introduced, slightly updates the
description, also prevents the mdns lookup when standalone is true.
Story: 1526486
Task: 28724
Change-Id: I5ad772a18cc4584585c5a7b6fb127547cece1998
2020年06月21日 16:44:00 +08:00