I have a single tar file containing about 19 million files (no folders)
0000107b869682826003b04a40e6394.txt
00029237482s8923789423ud8923892.txt
2c002y8378723887292377a79237649.txt
f598238209237408238742308374038.txt
how do I untar all the files such that they appear in subdirectories named after the first four chars of the files. So for the example above, it would create 0000, 2c02, and f599 directories and each would have the following files.
00000000107円b869682826003b04a40e6394.txt
000200029237482円s8923789423ud8923892.txt
2c022円c002y8378723887292377a79237649.txt
f598\f598238209237408238742308374038.txt
I've already tried creating a script that goes through the files in the tar file, creates a directory and extracts that file from the tar and puts it in the directory. This works for small number of files, but when the tar has millions, extracting takes a really long time.
2 Answers 2
With GNU tar and its s command with syntax from sed. I switched from s/// to s|||.
tar -xvf file.tar --transform 's|\(....\).*|1円/&|' --show-transformed-names
-
LOL - a more elegant solution indeed :-D But talking about 19mil files it's still going to take time :-)Peregrino69– Peregrino692021年09月03日 22:02:29 +00:00Commented Sep 3, 2021 at 22:02
-
I created a test tarball, no directories:
pg@TREX:~/test$ tar -tvf test.tar | rev | cut -c -8 | rev
0001.txt
0002.txt
0003.txt
0004.txt
0005.txt
0011.txt
0012.txt
0013.txt
0014.txt
0015.txt
0021.txt
0022.txt
0023.txt
0024.txt
0025.txt
I run this script (tartest.sh):
#!/bin/bash
tar -xf tarfile.tar
i=$(ls *.txt | cut -c -3 | sort | uniq)
echo "$i" >> directory_list
mkdir $i
while read line; do mv $line*.txt $line/; done < directory_list
Result:
pg@TREX:~/test$ tree
.
├── 000
│ ├── 0001.txt
│ ├── 0002.txt
│ ├── 0003.txt
│ ├── 0004.txt
│ └── 0005.txt
├── 001
│ ├── 0011.txt
│ ├── 0012.txt
│ ├── 0013.txt
│ ├── 0014.txt
│ └── 0015.txt
├── 002
│ ├── 0021.txt
│ ├── 0022.txt
│ ├── 0023.txt
│ ├── 0024.txt
│ └── 0025.txt
├── directory_list
├── tartest.sh
└── test.tar
I'm sure this'll take a bit of time with 19mil files, and I'm sure more elegant solutions exist... but seems to do what you asked :-)
-
1This could fail with 19 million files:
ls *.txtCyrus– Cyrus2021年09月03日 22:23:44 +00:00Commented Sep 3, 2021 at 22:23 -
Yeah, if all the files don't have .txt extension, as described on OP:s question. However it should work without the extension, creating a direcory "tar" and moving tarfile itself there - along with any other file beginning with those characters :-) Or do you see a problem with using ls in the first place? Anyway, I prefer your one-liner - I didn't even know sed can be tied with tar that way :-)Peregrino69– Peregrino692021年09月03日 22:33:21 +00:00Commented Sep 3, 2021 at 22:33
-
I assume that if
bashreplaces*.txtwith 19 million file names, thatbashwill then outputargument list too long.Cyrus– Cyrus2021年09月03日 22:44:05 +00:00Commented Sep 3, 2021 at 22:44 -
Oh, that didn't occur to me at all. And in keeping with Mr Murphy's legacy, that'd be bound to happen around file 18 699 453 :-DPeregrino69– Peregrino692021年09月03日 22:59:57 +00:00Commented Sep 3, 2021 at 22:59
-
1You can just replace it with
ls | egrep '\.txt$' | ...slebetman– slebetman2021年09月04日 06:56:23 +00:00Commented Sep 4, 2021 at 6:56