73b76da5fe1b059378407b99ecadb9e30aed2535
Commit Graph

1108 Commits

Author SHA1 Message Date
Zuul
73b76da5fe Merge "Add get_service_steps logic to the agent" 2023年09月15日 22:29:59 +00:00
Julia Kreger
eb95273ffb Add get_service_steps logic to the agent
Initial code patches for service steps have merged in
ironic, and it is now time to add support into the
agent which allows service steps to be raised to
the service.
Updates the default hardware manager version to 1.2,
which has *rarely* been incremented due to oversight.
Change-Id: Iabd2c6c551389ec3c24e94b71245b1250345f7a7
2023年08月31日 06:22:22 -07:00
Julia Kreger
b6c263a5dc preserve/handle config drives on 4k block devices
When an underlying block device (or driver) only supports 4KB IO,
this can cause some issues with aspects like using an ISO9660 filesystem
which can only support a maximum of 2KB IO.
The agent will now attempt to mount the filesystem *before* deleting the
supplied file, and should that fail it will mount the configuration drive
file from the ramdisk utilizing a loopback, and then extract the contents
of the ramdisk into a newly created VFAT filesystem which supports 4KB
block IO.
Closes-Bug: #2028002
Change-Id: I336acb8e8eb5a02dde2f5e24c258e23797d200ee
2023年08月24日 08:10:22 -07:00
Julia Kreger
5ed520df89 Handle the node being locked
If the node is locked, a lookup cannot be performed when an agent
token needs to be generated, which tends to error like this:
 ironic_python_agent.ironic_api_client [-] Failed looking up node
 with addresses '00:6f:bb:34:b3:4d,00:6f:bb:34:b3:4b' at
 https://172.22.0.2:6385. Error 409: Node
 c25e451b-d2fb-4168-b690-f15bc8365520 is locked by host 172.22.0.2,
 please retry after the current operation is completed..
 Check if inspection has completed.
Problem is, if we keep pounding on the door, we can actually worsen
the situation, and previously we would just just let tenacity
retry.
We will now hold for 30 seconds before proceeding, so we have
hopefully allowed the operation to complete.
Also fixes the error logging to help human's sanity.
Change-Id: I97d3e27e2adb731794a7746737d3788c6e7977a0
2023年08月22日 16:47:28 -07:00
Julia Kreger
b68a4c8a92 minor: fix release notes file path
Change-Id: I458d88bf14b55253179488cb771ae42e7b8c84d7
2023年08月07日 12:57:34 -07:00
Zuul
e493cad02c Merge "Log the number of bytes downloaded" 2023年07月27日 21:39:12 +00:00
Julia Kreger
c65ad42ff1 Log the number of bytes downloaded
When troubleshooting download issues, which may present
as checksum validation failures, it is difficult to understand
if the *entire* file was downloaded due to the way HTTP works.
In that, a download may start with a successful result code,
and the content is streamed out until the socket is closed.
But with HTTP there is no way to know if that socket closed
prematurely and the original server size is *also* an optional
field, so just log the size we got to so we don't drive the
humans [more-]insane.
Also now logs the (optional) content-length field if
supplied by the server.
Change-Id: Id71b167f4e330d54b9afddf95f1a2ef9e40398bf
2023年07月19日 16:20:40 +00:00
Zuul
0fb7fec56e Merge "Allow md5 to be disabled from the conductor" 2023年07月12日 03:53:14 +00:00
Zuul
119981a818 Merge "Fix nvidia hardware manager url parser to permit https" 2023年06月26日 10:11:55 +00:00
Zuul
bb156aad6c Merge "Fix Bandit errors" 2023年06月26日 09:25:09 +00:00
Julia Kreger
b83678c968 Fix nvidia hardware manager url parser to permit https
Change-Id: I9a10e543d3256ceaa78c6fbdb01fc0d88c0ee6e6
2023年06月06日 15:35:16 +00:00
Julia Kreger
78c1343a54 Fix Bandit errors
Bandit 1.7.5 released with a timeout check for all requests and
urllib calls.
Fixed those.
In the process, then exposed a bandit b310 issue, which was already
covered by the code, but explicitly marked it as such.
Also, enables bandit checks to be voting for CI..
Change-Id: If0e87790191f5f3648366d571e1d85dd7393a548
2023年06月06日 08:34:55 -07:00
Julia Kreger
e6fd7e753e Allow md5 to be disabled from the conductor
Also fixes my use of set_override, as it is not on the actual
config object. You'd think I'd remember that, since I've done
that before...
Change-Id: I4b578c4319354001cbbd3b3856af96b30fd25555
2023年05月25日 07:59:07 -07:00
Zuul
141c5ff1c3 Merge "Add support for CentOS SUM files" 2023年05月09日 09:03:25 +00:00
Zuul
03e88b579e Merge "Revert disabling MD5 checksums" 2023年05月05日 08:44:37 +00:00
Zuul
44d9c2219f Merge "Add network interface speed to the inventory" 2023年05月04日 09:04:30 +00:00
Dmitry Tantsur
c1c5537ba2 Revert disabling MD5 checksums
This was a significant breaking change that was landed despite explicit
disagreement by some community members (myself included). It has already
resulted in an accidental Ironic CI breakage, has broken Bifrost and has
a potential of breaking Metal3. In case of Metal3, MD5 support is a part
of its public API.
While MD5 is a potential security hazard, I don't see the need to hurry
this change without giving the community time to prepare. This change
reverts the new option md5_enabled to True.
Change-Id: I32b291ea162e8eb22429712c15cb5b225a6daafd
2023年05月04日 09:26:10 +02:00
Harald Jensås
e7a048ecbe Add support for CentOS SUM files
The CentOS Stream SUM files uses format:
 # FILENAME: <size> bytes
 ALGORITHM (FILENAME) = CHECKSUM
Compared to the more common format:
 CHECKSUM *FILE_A
 CHECKSUM FILE_B
Use regular expressions to check for filename both
in the middle with parentheses and at the end.
Similarly look for valid checksums at beginning or
end of line. Also look for know checsum patterns in
case file only contain the checksum iteself.
Change-Id: I9e49c1a6c66e51a7b884485f0bcaf7f1802bda33
2023年05月03日 21:31:23 +02:00
Dmitry Tantsur
9ed232e77e Add network interface speed to the inventory
This is another fact that Metal3's baremetal-operator is currently
consuming from extra-hardware.
Change-Id: I2ec9d5e9369f5508e7583a4e13c2083f5c8b28ba
2023年05月03日 12:20:35 +02:00
Julia Kreger
c05fdf790c Fix checksum validation logic
The checksum validation logic, which was updated early on in the
whole process of deprecating md5, didn't account for a URL *or* a
longer checksum (i.e. sha256/sha512) which was decided while the
overall approach was being decided.
Fixes the logic, and adds additional tests.
Change-Id: Ic4053776e131fc02ace295a1e69e9f9faab47f42
2023年05月02日 17:24:57 -07:00
Zuul
f37ea85a27 Merge "Disable MD5 image checksums" 2023年05月02日 06:41:25 +00:00
Zuul
3cd8c294fb Merge "Deprecate LLDP in inventory in favour of a new collector" 2023年04月27日 12:05:11 +00:00
Dmitry Tantsur
3e05a03f7c Deprecate LLDP in inventory in favour of a new collector
Binary LLDP data is bloating inventory causing us to disable its collection
by default. For other similar low-level information, such as PCI devices
or DMI data, we already use inspection collectors instead. Now that the
inventory format is shared with out-of-band inspection, having LLDP
there makes even less sense.
This change adds a new collector ``lldp`` to replace the now-deprecated
inventory field.
Change-Id: I56be06a7d1db28407e1128c198c12bea0809d3a3
2023年04月26日 19:33:51 +00:00
Julia Kreger
32df26a22a Disable MD5 image checksums
MD5 image checksums have long been supersceeded by the use of a
``os_hash_algo`` and ``os_hash_value`` field as part of the
properties of an image.
In the process of doing this, we determined that checksum via
URL usage was non-trivial and determined that an appropriate
path was to allow the checksum type to be determined as needed.
Change-Id: I26ba8f8c37d663096f558e83028ff463d31bd4e6
2023年04月24日 16:54:42 -07:00
Julia Kreger
76accfb880 Fix UTF-16 result handling for efibootmgr
The tl;dr is that UEFI NVRAM is in encoded
in UTF-16, and when we run the efibootmgr command,
we can get unicode characters back.
Except we previously were forcing everything to be
treated as UTF-8 due to the way oslo.concurrency's
processutils module works.
This could be observed with UTF character 0x00FF
which raises up a nice exception when we try to
decode it.
Anyhow! while fixing handling of this, we discovered
we could get basically the cruft out of the NVRAM,
by getting what was most likey a truncated string
out of our own test VMs. As such, we need to also
permit decoding to be tollerant of failures.
This could be binary data or as simple as flipped
bits which get interpretted invalid characters.
As such, we have introduced such data into one of our
tests involving UEFI record de-duplication.
Closes-Bug: 2015602
Change-Id: I006535bf124379ed65443c7b283bc99ecc95568b
2023年04月17日 09:14:24 -07:00
Dmitry Tantsur
0304c73c0e Report system firmware information in the inventory
Change-Id: I5b6ceb9cdcf4baa97a6f0482d1030d14f3f2ecff
2023年03月31日 14:28:32 +02:00
Arne Wiebalck
b32f6c6d94 [Trivial] Fix typo in efi_utils
Change-Id: I692e045e6bc8683038a2e85a6a132687d2b30f18
2023年03月15日 14:25:42 +01:00
Zuul
088610844a Merge "update NVIDIA NIC firmware images and settings by ironic-python-agent" 2023年01月31日 19:35:53 +00:00
Dmitry Tantsur
c26f498f49 Make logs collection a hardware manager call
This allows hardware managers to collect additional logs.
Change-Id: If082b921d4bf71c4cc41a5a72db6995b08637374
2023年01月25日 15:17:06 +01:00
waleed mousa
2c7f95e3ac update NVIDIA NIC firmware images and settings by ironic-python-agent
Add "update_nvidia_nic_firmware_image" and "update_nvidia_nic_firmware_settings"
clean steps to MellanoxDeviceHardwareManager.
By adding those two steps, we can update the firmware image and
firmware settings of NVIDIA NICs by ironic-python-agent using
manual cleaning command
The clean steps require mstflint package installed on the image.
The "update_nvidia_nic_firmware_image" clean step requires to pass
"images" parameter to the clean command
The "images" parameter is a json blob contains
a list of images, where each image contains a map of:
 * url: to firmware image (file://, http://)
 * checksum: checksum of the provided image
 * checksumType: md5/sha512/sha256
 * componentFlavor: PSID of the nic
 * version: version of the FW
The "update_nvidia_nic_firmware_settings" clean step requires to pass
"settings" parameter to the clean command
The "settings" parameter is a json blob contains
a list of settings, where each settings contains a map of:
 * deviceID: device ID
 * globalConfig: global config
 * function0Config: function 0 config
 * function1Config: function 1 config
Change-Id: Icfaffd7c58c3c73c3fa28cfc2a6c954d2c93c16e
Story: 2010228
Task: 46016
2023年01月11日 14:00:07 +00:00
Riccardo Pittau
604c7081db Fix create configuration unit tests
The unit tests for create_configuration give different result if
ran on a bios or uefi booted machine because they get the
partition table type value based on the utils function
get_node_boot_mode.
Let's mock the boot_mode as we do in other tests to get an
independent result.
Change-Id: Ic0e7daea7ec4ce0806cd126c27166f84690c5d9e
2022年12月15日 11:49:34 +01:00
Zuul
a1670753a2 Merge "Fix failure of bind mount in _install_grub2" 2022年10月17日 23:46:05 +00:00
Rozzii
830fdfa4c6 prioritize lsblk as a source of device serials
The current way of prioritizing ID/DM_SERIAL_SHORT or ID/DM_SERIAL works
in most cases but the udev values seem to be unreliable.
Based on experience it looks like lsblk might be a better
source of truth than udev in regerards to serial number
information. This commit makes lsblk the default provider
of block device serial number information.
Story: 2010263
Task: 46161
Change-Id: I16039b46676f1a61b32ee7ca7e6d526e65829113
2022年10月10日 19:31:47 +03:00
Vanou Ishii
0bf579c955 Fix failure of bind mount in _install_grub2
When IPA runs _install_grub2, IPA tries to bind mount /dev, /proc and /run
to <temporal directory path root partition mounted>/{dev,proc,run}.
However that bind mount fails because there aren't such mount point path
under temporal directory.
To fix this failure, this patch add mkdir command before bind mount.
Story: 2010292
Task: 46273
Change-Id: I434ce1bf1863ee0f11c4d09918d6d2d8dc065c02
2022年09月22日 19:34:12 +09:00
Jakub Jelinek
a99bf274e4 SoftwareRAID: Enable skipping RAIDS
Extend the ability to skip disks to RAID devices
This allows users to specify the volume name of
a logical device in the skip list which is then not cleaned
or created again during the create/apply configuration phase
The volume name can be specified in target raid config provided
the change https://review.opendev.org/c/openstack/ironic-python-agent/+/853182/
passes
Story: 2010233
Change-Id: Ib9290a97519bc48e585e1bafb0b60cc14e621e0f
2022年09月05日 20:43:51 +00:00
Zuul
ed6a8d28b7 Merge "Create RAIDs with volume name" 2022年09月02日 19:26:57 +00:00
Jakub Jelinek
daa20b01d1 Create RAIDs with volume name
Use 'volume_name' field from 'target_raid_config' to create logical
disks if it is present
Do not allow two logical disks to have the same volume name
Change-Id: If3e4e9f8698ec3e0cb49717f8ed2087d2ba03f2c
2022年09月02日 14:51:42 +00:00
Julia Kreger
f3e3de8097 Fix software raid output poisoning
In the event a device name is set to contain a raid device path,
it is possible for the Name and Events field values of mdadm's
detailed output to contain text which inadvertently gets captured and
mapped as component data for the "holder" devices of the RAID set.
This would cause invalid values to get passed to UEFI methods
which would cause a deployment to fail under these circumstances.
We now ignore the Name and Events fields in mdadm output.
Change-Id: If721dfe1caa5915326482969e55fbf4697538231
2022年08月24日 10:15:27 -07:00
Zuul
f89d54f4b8 Merge "Improve function list_block_devices_check_skip_list" 2022年08月17日 12:47:45 +00:00
Jakub Jelinek
1ac61e1dbd Improve function list_block_devices_check_skip_list
Fix minor issues suggested by dtantsur
Add an example of skip list specification to the documentation
A follow-up patch to I3bdad3cca8acb3e0a69ebb218216e8c8419e9d65
Change-Id: Ic94a33b7bc0572a1cc8f92b330474ec63a173e81
2022年08月16日 15:17:15 +00:00
Zuul
3a4baa637f Merge "Enable skipping disks for cleaning" 2022年08月16日 11:49:48 +00:00
Jakub Jelinek
0212337bd5 Enable skipping disks for cleaning
Introduce a field skip_block_devices in properties - this is a list of dictionaries
Create a helper function list_block_devices_check_skip_list
Update tests of erase_devices_express to use node when calling _list_erasable_devices
Add tests covering various options of the skip list definition
Use the helper function in get_os_install_device when node is cached
Story: 2009914
Change-Id: I3bdad3cca8acb3e0a69ebb218216e8c8419e9d65
2022年08月11日 09:30:00 +00:00
Zuul
eb2215090a Merge "Use lsblk json output for safety_check_block_device" 2022年08月03日 23:47:17 +00:00
Jakub Jelinek
e196fdfb62 Remove unused lines of code
The 5 lines of code were extracted from erase_devices_metadata to _list_erasable_devices, but now are duplicated in both functions.
The variable block_devices is not used in erase_devices_metadata.
Change-Id: I89f56c69d90fb0eb61907d6667266fbd57d333af
2022年07月20日 10:00:53 +00:00
Riccardo Pittau
b5fac66bc3 Use lsblk json output for safety_check_block_device
Change-Id: Ibfc2e203287d92e66567c33dc48f59392852b88e
2022年07月20日 11:56:27 +02:00
Zuul
21b21a5f15 Merge "Guard shared device/cluster filesystems" 2022年07月20日 08:23:55 +00:00
Julia Kreger
beb7484858 Guard shared device/cluster filesystems
Certain filesystems are sometimes used in specialty computing
environments where a shared storage infrastructure or fabric exists.
These filesystems allow for multi-host shared concurrent read/write
access to the underlying block device by *not* locking the entire
device for exclusive use. Generally ranges of the disk are reserved
for each interacting node to write to, and locking schemes are used
to prevent collissions.
These filesystems are common for use cases where high availability
is required or ability for individual computers to collaborate on a
given workload is critical, such as a group of hypervisors supporting
virtual machines because it can allow for nearly seamless transfer
of workload from one machine to another.
Similar technologies are also used for cluster quorum and cluster
durable state sharing, however that is not specifically considered
in scope.
Where things get difficult is becuase the entire device is not
exclusively locked with the storage fabrics, and in some cases locking
is handled by a Distributed Lock Manager on the network, or via special
sector interactions amongst the cluster members which understand
and support the filesystem.
As a reult of this IO/Interaction model, an Ironic-Python-Agent
performing cleaning can effectively destroy the cluster just by
attempting to clean storage which it percieves as attached locally.
This is not IPA's fault, often this case occurs when a Storage
Administrator forgot to update LUN masking or volume settings on
a SAN as it relates to an individual host in the overall
computing environment. The net result of one node cleaning the
shared volume may include restoration from snapshot, backup
storage, or may ultimately cause permenant data loss, depending
on the environment and the usage of that environment.
Included in this patch:
- IBM GPFS - Can be used on a shared block device... apparently according
 to IBM's documentation. The standard use of GPFS is more Ceph
 like in design... however GPFS is also a specially licensed
 commercial offering, so it is a red flag if this is
 encountered, and should be investigated by the environment's
 systems operator.
- Red Hat GFS2 - Is used with shared common block devices in clusters.
- VMware VMFS - Is used with shared SAN block devices, as well as
 local block devices. With shared block devices,
 ranges of the disk are locked instead of the whole
 disk, and the ranges are mapped to virtual machine
 disk interfaces.
 It is unknown, due to lack of information, if this
 will detect and prevent erasure of VMFS logical
 extent volumes.
Co-Authored-by: Jay Faulkner <jay@jvf.cc>
Change-Id: Ic8cade008577516e696893fdbdabf70999c06a5b
Story: 2009978
Task: 44985
2022年07月19日 13:24:03 -07:00
Dmitry Tantsur
6a1334a068 Drop support for instance netboot
Change-Id: I2b4c543537dac8904028fdcdb590c1c214238e10
2022年07月07日 16:38:22 +02:00
Zuul
5129eb4933 Merge "Fix passing kwargs in clean steps" 2022年07月04日 13:56:52 +00:00
Zuul
ccf4ee31cf Merge "Gather details about bond interfaces if present" 2022年07月02日 02:56:46 +00:00