Open exchange on fwtar · onekey-sec/unblob · Discussion #757

qkaiser
Feb 11, 2024
Maintainer

We received two excellent bug reports from @AndrewFasano who's working on https://github.com/AndrewFasano/fw2tar which is, according to the README:

[...] an unprivileged utility designed to seamlessly convert firmware images into compressed tar archives, accurately reflecting the root filesystem's original permissions.

They maintain a fork over with a few changes applied to unblob to support their permissions preservation main...AndrewFasano:unblob:main

So, these are a few things I would like to address in this thread:

@AndrewFasano: if you're open to it, can you share some more details about fw2tar ? Is it part of academic research ? What are your plans for it ? do you have a list of filesystems you plan on supporting ?
for unblob maintainers: do we see anything in their fork that would be beneficial to be upstreamed ?
can the unblob maintainer team be of any assistance to fw2tar ?
probably have a discussion on permissions preservation and why we do things the way we do :)

To give everyone a bit of background: our approach - which has not been implemented yet - would be for format handlers / extractors to yield metadata about the files they extract (ownership, permissions, timestamps) and have them saved in the unblob report so it can be used by external tools relying on unblob (either by re-applying these permissions / ownerships, or simply showing a view of it).

We can't rely on the fact that all our users run unblob under fakeroot, so we must adapt permissions as we recurse through extracted content otherwise we would lose visibility into files and directories that have strict permissions or wrong ownership and would end up raising OSErrors all the time.

Some references:

our table listing filesystems metadata preservation capabilities - https://unblob.org/formats/#filesystems
issue about meta-data reporting - Metadata file #16
draft implementation of meta-data reporting on chunks - feat(reporting): report meta-data information about chunks. #557

Replies: 5 comments 8 replies

AndrewFasano
Feb 12, 2024

Hi there, happy to talk a little about fw2tar - it's a pretty small utility focused on a very specific goal, building off the awesome work you all have done here (and also binwalk, though that project seems to have largely died over the last 3 years).

I'm wrapping up a PhD focused on firmware rehosting and dynamic analysis of firmware in general. Given a linux-based firmware image, I want to get it up and running under emulation with PANDA.re. My expertise (and the bulk of my research) is focused around runtime analysis and modification of a emulated guest. But a critical input to this is an accurate root filesystem for a given firmware image. Previously I've used binwalk/firmadyne's extractor but now it seems like unblob is the best tool around.

There are 3 goals I have with fw2tar that don't seem to overlap with unblob which is why I threw those utility scripts into a stand-alone repo instead of opening up PRs:

Identify linux-based root filesystems within extracted filesystems and package these up nicely
Maintain permissions within the filesystem
Support operation in high-performance-computing environments

Root filesystem detection This isn't anything too fancy - my scripts search for some standard linux directories and files to identify potential root filesystems. For each, we create a tar archive and try to exclude any recursive extraction artifacts (after prototyping a few approaches around multiple extraction passes with restricted depth), I ended up just excluding directories with _extract in the name). The generated archives are sorted by size from largest to smallest.

Maintain permissions: If we want to boot a system using an extracted filesystem, it causes all sorts of problems if the permissions are modified. One of my collaborators, @off-by-1-error opened an issue when we noticed no unblob-produced files were executable. That issue got fixed (thanks), but I recently noticed the {FILE,DIR}_PERMISSION_MASKS were also changing permissions so I disabled that and then got a simple permissions test to pass when having unblob extract a tar archive with a variety of permissions.

Support HPC environments: My research is focused on large-scale analyses of firmware - I'm working with thousands of firmware images and running them on a supercomputer where I don't have root access. I'm able to use singularity (basically a very-limited docker alternative) so I can install software and run experiments at scale. After I use fakeroot to run my modified unblob to preserve permissions and build an archive of a root filesystem, I then I feed the filesystem into PANDA.re with a custom kernel and emulate the target.

I don't have any particular asks for the unblob team - I'm grateful you all have tackled the hard problem of filesystem extraction and helped with the issues we've opened! If you have any interest in supporting any of the use cases I mentioned above, I'd certainly prefer your implementations over mine. Any analyses you'd like to build for identifying linux root filesystems or changes to adding the ability to produce tar archives with correct permissions would be awesome. But I'm not sure if those would be broadly useful to other users.

I do think I found an unblob bug around symlink handling and the MaliciousSymlinkRemoved error triggering far more often than it should, but I didn't yet have a chance to minimize a test case and open an issue. I removed that logic in my unblob fork and instead used a fork of the symlinks(8) utility to rewrite symlinks to be relative to a given directory during extraction. Unblob changes here. EDIT: I opened #761 to discuss some of the symlink bugs I'm seeing with unblob, but that's not actually related to the change here that just makes all the symlinks relative.

Let me flip the question around and ask if there's anything I can do to help you - I'm running both unblob and binwalk at scale on thousands of firmware images. I'm creating tar archives for everything that looks like a linux root filesystem and then I'm diffing the archives. I've done some spot checks and think unblob (with my patch to permission handling) is generally extracting files with correct permissions. I'm still investigating differences I see in files extracted between the tools. I haven't done much work to fix issues I've seen with binwalk and I know it's producing more files than it should right now (mostly due to recursive extraction going too deep). On my last test with ~5k FW I saw binwalk find 50,789 files that unblob didn't and unblob find 37,708 files that binwalk didn't. From a few manual spot checks, I believe unblob is producing significantly better output than binwalk.

I'm currently re-running my extraction at scale with #755 + #756 applied on my fork - I'll report back on how well those work and open new issues if I run into any other errors at runtime. But if there's anything else you'd like me to check at scale, let me know!

I like the direction you're exploring with permission as metadata. If that was working, I think we'd be able to just run unblob (with no permissions), consume the metadata and then build the filesystem archive from there.

1 reply

@qkaiser

qkaiser Feb 12, 2024
Maintainer Author

Thanks for the long and detailed answer !

I think the zero-permission mask could be upstreamed since it still provides sufficient permissions to the running process to go through extracted content.

Identification of root filesystem is out-of-scope for unblob. So is HPC, but it would be interesting to know how fast unblob does go since we did our best with Rust, mmaped files, and Hyperscan to be as fast as possible.

You definitely forced us to look at an ugly part of unblob with the symlinks bugs you found. We knew we had it coming but were too busy working on other things. Let's work together on a PR to get this fixed properly. Don't hesitate to open a PR with a branch that break every test by the way, we will provide guidance on how to fix them if needed. symlinks can definitely be an inspiration since it's MIT licensed.

Regarding results comparison, it would be interesting for us to know which files unblob can't handle, specifically file formats. Is there a way you could provide us with the "50,789 files that unblob didn't [find]" ? It can simply be a CSV file with full path so we can kind of guess the file format we were supposed to find. We're aware of some limitations related to custom squashfs, especially with old filesystems (e.g. onekey-sec/sasquatch#19), so this could be one of the cause.

Are you using a public dataset of firmwares to run your experiments or is it something you built internally ?

AndrewFasano
Feb 13, 2024

I love projects that have a good scope and stick to it. It makes since that you wouldn't be interested in those use cases. I can collect some performance data and share it next time I run on the full corpus, I'd guess a median time for unblob is ~2s vs ~10s for binwalk. In one extreme case unblob extracted a filesystem in 10s that binwalk took 300s to do.

These symlinks bugs might be above my pay grade, I opened a PR with some fixes in #763, but I don't love what I built. In testing I found the symlinks utility (with patches to support rewriting links relative to a directory) wasn't working well - dangling symlinks weren't updated and would then point outside the extraction directory. I certainly won't be offended if you want to throw away all my code and build your own fix - a more unified interface + comprehensive tests for it would probably be a better design. If you're going to go that direction let me know and I can share some unit tests and minimal inputs I created while hacking on that PR.

As for the comparison between binwalk and unblob, I'm looking within identified root directories and comparing the files that are present. I don't have a list of inputs blobs that unblob didn't extract, just files that are/aren't present in the output. With all the changes on my branch (#755, #756 + an additional fix for directories, and #763) unblob is looking quite good - there are a bunch of files only present in the unblob extractions (15,527 in my last run) and the files produced by binwalk that don't map directly to a file in the unblob extraction are either:

Binwalk extracts files into the filesystem root directory while unblob has the file in a directory - spot checking makes me think unblob is correct here
Binwalk extracts files with non-printable characters in the name. Looks like junk.

Without #755, #756 + my additional fix for it, a few extractors were sometimes failing and many files were missing when that happened. Without #763 a large number of valid symlinks were missing.

I have a few non-public firmware corpora, but unfortunately none of them are mine so I can't redistribute them. I'm currently testing with the corpus from Greenhouse. But when I find failures I'm usually able to find the firmware online somewhere and share links to individual files.

2 replies

@qkaiser

qkaiser Feb 13, 2024
Maintainer Author

I'll answer in details but "Binwalk extracts files into the filesystem root directory while unblob has the file in a directory" has yaffshiv written all over it. See third paragraph at #513 (comment)

@qkaiser

qkaiser Feb 13, 2024
Maintainer Author

I certainly won't be offended if you want to throw away all my code and build your own fix - a more unified interface + comprehensive tests for it would probably be a better design. If you're going to go that direction let me know and I can share some unit tests and minimal inputs I created while hacking on that PR.

All of this will be handled by one of my teammates in the PR itself. If we open separate PRs we'll make sure to put you as co-author in the commit so you show up in contributors :)

AndrewFasano
Feb 14, 2024

Just wanted to say thanks for the close review of my PR and support for all the issues I'm opening. And sorry again for the confusion in that first big PR I opened, hopefully the smaller PR and issues with PoCs will be more useful for y'all.

0 replies

AndrewFasano
Mar 14, 2024

I know you all have some different goals with extracted file permissions so I wanted to ask about this before I bother cleaning up code and opening PRs: if I'm expanding extractors and the filesystem class to better preserve permissions, would you be interested in getting any of those changes into unblob? Of course the permission bits you're adding would change the final permissions, but if other bits (e.g., o+w) were set in the original filesystem, that would produce a difference if the extractor handled it correctly.

For example, rehosting@2e4f43a expands Filesystem.write_chunks to take a mode arg and rehosting@5a538ed changes the yaffs extractor to create files with the mode as specified in the yaffs filesystem.

Unrelated to that, I also wanted to share some results from large-scale comparisons with Binwalk - in my analysis I'm running both a slightly forked Binwalk) and Unblob with my changes*, looking for directories that seem to be the root of a linux filesystem (e.g., checking for standard directories and at least a few executable files), then selecting the largest good-looking directory found by each extractor.

With this approach, I get:

19,946 byte-for-byte identical extractions
2,841 no filesystem identified in either extraction
2,358 filesystem only identified in unblob extraction
599 same files, distinct metadata (ownership, permissions, symlink destinations)
96 same files extracted but file sizes are different
95 filesystem only identified in binwalk extraction
32 different number of files between extractions (Unblob finding more)
23 different number of files (Binwalk finding more)

* My unblob changes now include swapping out a few 7z extractors for the format-specific extractors used by Binwalk as 7z unfortunately can't preserve file permissions

0 replies

qkaiser
Mar 15, 2024
Maintainer Author

For example, rehosting@2e4f43a expands Filesystem.write_chunks to take a mode arg and rehosting@5a538ed changes the yaffs extractor to create files with the mode as specified in the yaffs filesystem.

I think both of these changes make sense. What's your take @e3krisztian ? We expose the mode for carve so we might as well for write_chunks.

My unblob changes now include swapping out a few 7z extractors for the format-specific extractors used by Binwalk as 7z unfortunately can't preserve file permissions

Can you expand a bit ? Which format-specific extractors are you using ?

95 filesystem only identified in binwalk extraction

Do you know the kind of filesystems making these 95 entries ? I would suppose it's custom squashfs that our sasquatch fork can't cover but would be happy to know the exact details.

2,841 no filesystem identified in either extraction

Is it because they're encrypted or you're observing some filesystems that are not supported by the general public version of unblob ?

5 replies

@qkaiser

qkaiser Mar 15, 2024
Maintainer Author

Also, what's the corpus size ? It'll help me interpret these numbers better :)

@AndrewFasano

AndrewFasano Mar 15, 2024

We expose the mode for carve so we might as well for write_chunks.

I think you might've seen that in my fork, not in the main Unblob repo - that's something else I added (unless I missed something on the main repo).

To preserve permissions I'm using cramfsck for cramfs files over 7z (code) and I added the -k argument to ubireader_extract_files in the ubi extractor (code). I thought I had run into 7z issues in more than one extractor, but I guess it's just with cramfs for now.

My corpus size is just under 25,000 firmware images from 69 vendors. In my analysis I only consider directories that contain executable files (and I turned off your logic for explicitly setting permission bits), so in practice most of the time I have an "Unblob fails but Binwalk succeeds" label it's caused by Unblob not setting permissions while Binwalk does.

@AndrewFasano

AndrewFasano Mar 15, 2024

As for the FW where neither extractor succeeds - we scraped this dataset from the internet and have no guarantees that there are Linux-based filesystems inside them so I've just been assuming they're out of scope and not caused by flaws in the extractors

@AndrewFasano

AndrewFasano Mar 15, 2024

And since you asked, here are a few of the filenames where Binwalk seemed to work over Unblob. But again, this is probably caused by files getting extracted with bad permissions due to changes in my fork - so don't consider this a bug report! If I find any specific extractions to fail when I think they shouldn't, I'll test with the main unblob repo and open issues.

Conceptronic_C54APRB_Router_Firmware_1.00B02T02.CT/C04-047_C54APRB_Firmware_Update_v1.00B02T02.CT
isco_RV042_Router_Firmware_4.2.1.02/RV0XX-v4.2.1.02-20120118-code
CenturyLink_ZyXEL_PK5001Z_WiFi_Modem_Firmware_4.3.009.35/CZP003-4.3.009.35
Ubiquiti_EdgeSwitch_ES-48-500W_Switch_Firmware_1.9.3/ESWH.v1.9.3-lite.5434939
MikroTik_RouterOS_SMIPS_Firmware_6.44_RC_61/all_packages-smips-6.44beta61
Buffalo_WHR-600D_Router_Firmware_2.10_JP/whr600d-210Linksys_WRT54GS_Router_TinyPEAP_Firmware_2.12/TinyPEAP_gs_2.12
TOTOLINK_N3GR_Router_Firmware_1.0/TOTOLINK N3GR_V1.0
Abicom_Freedom_FWA_Rev_3_4_Firmware_04.02.09.69_Experimental/Abicom_V04.02.09.69
NETGEAR_GSM7252Sv1h1_Switch_Firmware_11.0.0.31/M5300_V11.0.0.31

@qkaiser

qkaiser Mar 25, 2024
Maintainer Author

Thanks for the details, I'll have a look at this.

Open exchange on fwtar #757

Uh oh!

qkaiser Feb 11, 2024 Maintainer

Replies: 5 comments · 8 replies

Uh oh!

Uh oh!

AndrewFasano Feb 12, 2024

Uh oh!

qkaiser Feb 12, 2024 Maintainer Author

Uh oh!

Uh oh!

AndrewFasano Feb 13, 2024

Uh oh!

qkaiser Feb 13, 2024 Maintainer Author

Uh oh!

qkaiser Feb 13, 2024 Maintainer Author

Uh oh!

AndrewFasano Feb 14, 2024

Uh oh!

AndrewFasano Mar 14, 2024

Uh oh!

qkaiser Mar 15, 2024 Maintainer Author

Uh oh!

qkaiser Mar 15, 2024 Maintainer Author

Uh oh!

AndrewFasano Mar 15, 2024

Uh oh!

AndrewFasano Mar 15, 2024

Uh oh!

AndrewFasano Mar 15, 2024

Uh oh!

qkaiser Mar 25, 2024 Maintainer Author

qkaiser
Feb 11, 2024
Maintainer

Replies: 5 comments 8 replies

AndrewFasano
Feb 12, 2024

qkaiser Feb 12, 2024
Maintainer Author

AndrewFasano
Feb 13, 2024

qkaiser Feb 13, 2024
Maintainer Author

qkaiser Feb 13, 2024
Maintainer Author

AndrewFasano
Feb 14, 2024

AndrewFasano
Mar 14, 2024

qkaiser
Mar 15, 2024
Maintainer Author

qkaiser Mar 15, 2024
Maintainer Author

qkaiser Mar 25, 2024
Maintainer Author