I have a root folder Products
and then bunch of sub-folders inside it. Each of those sub-folder has bunch of files as of now. Just for simplicity I came up with sub-folders name as folder{number}
and files name as files{number}.json
but in general they have different names.
In general I have 20 different sub-folders inside root folder and each sub-folder has around 30 files max.
(figure 1)
Products
├── folder1
│ ├── files1.json
│ ├── files2.json
│ └── files3.json
├── folder2
│ ├── files4.json
│ ├── files5.json
│ └── files6.json
└── folder3
├── files10.json
├── files7.json
├── files8.json
└── files9.json
Now I am compressing all this into a tar.gz
file by running below command -
tar cvzf ./products.tgz Products
Question:-
I got a new design as shown below where each sub-folder inside Products
root folder has three environment folders in it - dev
, stage
and prod
.
(figure 2)
Products
├── folder1
│ ├── dev
│ │ └── files1.json
│ ├── files1.json
│ ├── files2.json
│ ├── files3.json
│ ├── prod
│ │ └── files1.json
│ └── stage
│ └── files1.json
├── folder2
│ ├── dev
│ │ └── files5.json
│ ├── files4.json
│ ├── files5.json
│ ├── files6.json
│ ├── prod
│ │ └── files5.json
│ └── stage
│ └── files5.json
└── folder3
├── files10.json
├── files7.json
├── files8.json
└── files9.json
For example - Inside folder1
sub-folder there are three more sub-folders dev
, stage
and prod
and exactly same thing for other sub-folders folder2
and folder3
. Each of those dev
, stage
and prod
sub-folder inside folder{number}
sub-folder will have files which are overridden for them.
I need to generate three different tar.gz
file now - one for each dev
, stage
and prod
from the above structure.
- Whatever files I have inside
dev
,stage
andprod
they will override their subfolder files if it is present in their sub-folder (folder1, folder2 or folder3) also. - So if
files1.json
is present infolder1
sub-folder and same file also present inside any ofdev
,stage
andprod
then while packaging I need to use whatever it is present in their environment folder and override their sub-folder files otherwise just use whatever is present in their sub-folder(s).
At the end I will have 3 different structures like this - one for dev
, one for stage
and other for prod
where folder1 (or 2 and 3) will have files accordingly what I have in their environment as first preference since they are overridden and other files which are not overridden.
(figure 3)
Products
├── folder1
│ ├── files1.json
│ ├── files2.json
│ └── files3.json
├── folder2
│ ├── files4.json
│ ├── files5.json
│ └── files6.json
└── folder3
├── files10.json
├── files7.json
├── files8.json
└── files9.json
And I need to generate products-dev.gz
, products-stage.gz
and products-prod.gz
from the figure 2
which will have data like figure 3
but specific to each environment. Only difference is each sub-folder folder1 (2 or 3) will have files which are overridden for them as first preference from their particular environment folder and rest will use from their sub-folder only.
Is this possible to do through some linux commands? Only confusion I have is how to overwrite specific environment files inside particular sub-folder and then generate 3 different tar.gz
file in them.
Update:
Also consider cases like the below:
Products
├── folder1
│ ├── dev
│ │ ├── files1.json
│ │ └── files5.json
│ ├── files1.json
│ ├── files2.json
│ ├── files3.json
│ ├── prod
│ │ ├── files10.json
│ │ └── files1.json
│ └── stage
│ └── files1.json
├── folder2
│ ├── dev
│ ├── prod
│ └── stage
└── folder3
├── dev
├── prod
└── stage
As you can see folder2
and folder3
has environment overriding folders but they don't have any files so in that case I want to generate empty folder2
and folder3
as well in each environment specific tar.gz
file.
-
Do you have to use that structure? Because it seems that having Production, Dev, Stage roots and then, inside each one the Product hierarchy you had up until now (with just the needed files) would make everything a lot easier to deal with.Eduardo Trápani– Eduardo Trápani2020年08月11日 15:56:51 +00:00Commented Aug 11, 2020 at 15:56
-
Yeah that can make things simpler but then I have to copy 3 different copies. Here I have a concept of default and overriden files for each environment. And also somehow my team wants this way only.cs98– cs982020年08月11日 16:34:12 +00:00Commented Aug 11, 2020 at 16:34
3 Answers 3
There can be plenty of ways, though all require some kind of complexity in order to handle the override case.
As a one-liner, though a bit long, you could do like this for one iteration i.e. one "environments" directory:
(r=Products; e=stage; (find -- "$r" -regextype posix-extended -maxdepth 2 \( -regex '^[^/]+(/[^/]+)?' -o ! -type d \) -print0; find -- "$r" -mindepth 1 -path "$r/*/$e/*" -print0) | tar --null --no-recursion -czf "$r-$e.tgz" -T- --transform=s'%^\(\([^/]\{1,\}/\)\{2\}\)[^/]\{1,\}/%1円%')
broken down to observe it better:
(
r=Products; e=stage
(
find -- "$r" -regextype posix-extended -maxdepth 2 \( -regex '^[^/]+(/[^/]+)?' -o ! -type d \) -print0
find -- "$r" -mindepth 1 -path "$r/*/$e/*" -print0
) \
| tar --null --no-recursion -czf "$r-$e.tgz" -T- \
--transform=s'%^\(\([^/]\{1,\}/\)\{2\}\)[^/]\{1,\}/%1円%'
)
Things to note:
- it shows GNU tools' syntax. For BSD
find
you must replace-regextype posix-extended
with just-E
and for BSDtar
you must replace--no-recursion
with just-n
as well as--transform=s
(<- note the finals
) with just-s
- for simplicity of demonstration the snippet assumes to be run from the directory containing
Products
, and uses the custom$e
variable for the name of the "environments" directory to archive, while$r
is just a short-named helper variable to contain theProducts
name - it is enclosed within parentheses, making it a subshell, just so as not to pollute your shell with
$r
and$e
should you run it from the command-line - it does not copy nor link/refer to the original files, it does handle any valid filename, it has no memory constraints, and it can handle any amount of names; the only assumption is about the first two levels of the directories hierarchy in that any directory directly below the first level is considered an "environments" directory and thus ignored (except the one indicated in
$e
)
You could simply enclose that snippet in a for e in dev prod stage; do ...; done
shell loop and just go. (possibly taking away the outermost parentheses and rather surround the entire for
loop).
The upside is that it is quite short and relatively simple after all.
The downside is that it always archives also all the overridden files (i.e. the base ones), the trick being just that the double find
commands feed tar
with the to-be-overridden files first, and hence during extraction they will be overwritten by the overriding files (i.e. the "environments" specific ones). This leads to a bigger archive taking more time both during creation and during extraction, and might be undesirable depending on whether such "overhead" can be negligible or not.
That pipeline described in prose is:
- (besides the outermost parentheses and the helper variables)
- the first
find
command produces the list of non-specific files (and leading directories as per your update) only, while the secondfind
produces the list of all environments-specific files only - the two
find
commands are within parentheses by themselves so that both their outputs feed the pipe totar
in sequence tar
reads such pipe in order to get the names of the files, and puts those files in the archive while also--transform
-ing their names by eliminating the "environments" component (if present) from the path-name of each file- the two
find
commands are separated instead of being just one, and they are run one after the other, so that the non-specific files are produced (fortar
to consume) before the environments-specific files, which enables the trick I described earlier
To avoid the overhead of including always all the files we need additional complexity in order to truly purge the overridden files. One way might be like below:
# still a pipeline, but this time I won't even pretend it to be a one-liner
(
r=Products; e=stage; LC_ALL=C
find -- "$r" -regextype posix-extended \( -path "$r/*/$e/*" -o \( -regex '^([^/]+/){2}[^/]+' ! -type d \) -o -regex '^[^/]+(/[^/]+)?' \) -print0 \
| sed -zE '\%^(([^/]+/){2})([^/]+/)%s%%0/3円1円%;t;s%^%1//%' \
| sort -zt/ -k 3 -k 1,1n \
| sort -zut/ -k 3 \
| sed -zE 's%^[01]/(([^/]+/)|/)(([^/]+/?){2})%3円2円%' \
| tar --null --no-recursion -czf "$r-$e.tgz" -T- \
--transform=s'%^\(\([^/]\{1,\}/\)\{2\}\)[^/]\{1,\}/%1円%'
)
Several things to note:
- everything we said earlier regarding GNU and BSD syntaxes for
find
andtar
applies here as well - like the previous solution, it has no constraints whatsoever besides the assumption about the first two levels of the directories hierarchy
- I'm using GNU
sed
here in order to deal with nul-delimited I/O (option-z
), but you could easily replace those twosed
commands with e.g. awhile read ...
shell loop (Bash version 3 or greater would be required) or another language you feel confident with, the only recommendation being that the tool you use is able to handle nul-delimited I/O (e.g. GNU'sgawk
can do it); see below for a replacement using Bash loops - I use one single
find
here, as I'm not relying on any implied behavior fromtar
- The
sed
commands manipulate the list of names, paving the way for thesort
commands - specifically, the first
sed
moves the "environments" name at the beginning of the path, also prefixing it with a helper0
number just to make it sort before the non-environments files, as I'm prefixing these latter with a leading1
for the purpose of sorting - such preparation normalizes the list of names in the "eyes" of the
sort
commands, making all names without the "environments" name and all having the same amount of slash-delimited fields at the beginning, which is important forsort
's keys definitions - the first
sort
applies a sorting based first on the files' names, thus putting same names adjacent to each other, and then by numeric value of0
or1
as marked previously by thesed
command, thus guaranteeing that any "environments" specific file, when present, comes before its non-specific counterpart - the second
sort
coalesces (option-u
) on the files' names leaving only the first of duplicate names, which due to the previous reordering is always an "environments" specific file when present - finally, a second
sed
undoes what has been done by the first one, thus reshaping the file names fortar
to archive
If you are curious to explore the intermediate pieces of such long pipeline, keep in mind that they all work with nul-delimited names, and hence do not show well on screen. You can pipe any one of the intermediate outputs (i.e. taking away at least the tar
) to a courtesy tr '0円' '\n'
to show a human-friendly output, just remember that filenames with newlines will span two lines on screen.
Several improvements could be done, certainly by making it a fully parameterized function/script, or for instance by detecting automatically any arbitrary name for "environments" directories, like below:
Important: pay attention to the comments as they may not be well accepted by an interactive shell
(
export r=Products LC_ALL=C
cd -- "$r/.." || exit
# make arguments out of all directories lying at the second level of the hierarchy
set -- "$r"/*/*/
# then expand all such paths found, take their basenames only, uniquify them, and pass them along xargs down to a Bash pipeline the same as above
printf %s\0円 "${@#*/*/}" \
| sort -zu \
| xargs -0I{} sh -c '
e="${1%/}"
echo --- "$e" ---
find -- "$r" -regextype posix-extended \( -path "$r/*/$e/*" -o \( -regex '\''^([^/]+/){2}[^/]+'\'' ! -type d \) -o -regex '\''^[^/]+(/[^/]+)?'\'' \) -print0 \
| sed -zE '\''\%^(([^/]+/){2})([^/]+/)%s%%0/3円1円%;t;s%^%1//%'\'' \
| sort -zt/ -k 3 -k 1,1n \
| sort -zut/ -k 3 \
| sed -zE '\''s%^[01]/(([^/]+/)|/)(([^/]+/?){2})%3円2円%'\'' \
| tar --null --no-recursion -czf "$r-$e.tgz" -T- \
--transform=s'\''%^\(\([^/]\{1,\}/\)\{2\}\)[^/]\{1,\}/%1円%'\''
' packetizer {}
)
Example replacement for the first sed
command with a Bash loop:
(IFS=/; while read -ra parts -d $'0円'; do
if [ "${#parts[@]}" -gt 3 ]; then
env="${parts[2]}"; unset parts[2]
printf 0/%s/%s\0円 "$env" "${parts[*]}"
else
printf 1//%s\0円 "${parts[*]}"
fi
done)
For the second sed
command:
(IFS=/; while read -ra parts -d $'0円'; do
printf %s "${parts[*]:2:2}" "/${parts[1]:+${parts[1]}/}" "${parts[*]:4}"
printf \0円
done)
Both snippets require the surrounding parentheses in order to be drop-in replacements for their respective sed
commands within the pipeline above, and of course the sh -c
piece after xargs
needs to be turned into bash -c
.
-
1@alecxs Unfortunately there simply is no POSIX tool that can deal with nul-terminated input, neither can BSD's
awk
andsed
. However I've now turned those shell loops into a couple ofsed -z
commands. Probably even more cryptic than before but at least more compactLL3– LL32020年08月13日 15:25:17 +00:00Commented Aug 13, 2020 at 15:25 -
@LL3 thanks it works fine now. One last question - Is there any way by which we can print total number of files count in each subfolder like
folder1
,folder2
and etc.? It doesn't need to be in same one liner script you had so I am ok doing it in separate line?cs98– cs982020年08月17日 16:22:48 +00:00Commented Aug 17, 2020 at 16:22 -
@cs98 In the extended pipeline there should be all the trickiest operations needed to let you count the names going into the archive. It may be a matter of inserting a
cut
/sed
towards a finaluniq -c
. Try researching U&L about counting names as there are plenty of excellent answers on that very common task. Else it may make material for another good question. BTW: please consider upvoting and/or accepting answer(s) you found useful, it is a "concrete" way to say thank you and gives first glance hints to future readers having similar problems.LL3– LL32020年08月17日 21:13:30 +00:00Commented Aug 17, 2020 at 21:13
General solution
- Make a copy of the directory tree. Hardlink the files to save space.
- Modify the copy. (In case of hardlinks, you need to know what you can do safely. See below.)
- Archive the copy.
- Remove the copy.
- Repeat (modifying differently) if needed.
Example
Limitations:
- this example uses non-POSIX options (tested on Debian 10),
- it makes some assumptions about the directory tree,
- it can fail if there are too many files.
Treat it as a proof of concept, adjust it to your needs.
Making a copy
cd
to the parent directory ofProducts
. This directory,Products
and everything within should belong to a single filesystem. Make a temporary directory and recreateProducts
there:mkdir -p tmp cp -la Products/ tmp/
Modifying the copy
Files in the two directory trees are hardlinked. If you modify their content then you will alter the original data. Operations that modify information held by directories are safe, they will not alter the original data if performed in the other tree. These are:
- removing files,
- renaming files,
- moving files around (this includes moving a file over another file with
mv
), - creating totally independent files.
In your case for every directory named
dev
at the right depth move its contents one level up:cd tmp/Products dname=dev find . -mindepth 2 -maxdepth 2 -type d -name "$dname" -exec sh -c 'cd "1ドル" && mv -f -- * ../' sh {} \;
Notes:
mv -- * ../
is prone toargument list too long
,- by default
*
does not match dotfiles.
Then remove directories:
find . -mindepth 2 -maxdepth 2 -type d -exec rm -rf {} +
Note this removes the now empty
dev
and unneededprod
,stage
; and any other directory at this depth.Archiving the copy
# still in tmp/Products because of the previous step cd .. tar cvzf "products-$dname.tgz" Products
Removing the copy
# now in tmp because of the previous step rm -rf Products
Repeating
Go back to the right directory and start over, this time with
dname=stage
; and so on.
Example script (quick and dirty)
#!/bin/bash
dir=Products
[ -d "$dir" ] || exit 1
mkdir -p tmp
for dname in dev prod stage; do
(
cp -la "$dir" tmp/
cd "tmp/$dir"
[ "$?" -eq 0 ] || exit 1
find . -mindepth 2 -maxdepth 2 -type d -name "$dname" -exec sh -c 'cd "1ドル" && mv -f -- * ../' sh {} \;
find . -mindepth 2 -maxdepth 2 -type d -exec rm -rf {} +
cd ..
[ "$?" -eq 0 ] || exit 1
tar cvzf "${dir,,}-$dname.tgz" "$dir"
rm -rf "$dir" || exit 1
) || exit "$?"
done
I made that bit more generic and working on non-trivial file names without actually changing the source directories
Products
is given as argument. keywords dev prod stage
are hard-coded inside script (but can easily changed)
Note: this is GNU specific --transform
and -print0
-z
extension
run the script
./script Products
#!/bin/sh
# environment
subdirs="dev prod stage"
# script requires arguments
[ -n "1ドル" ] || exit 1
# remove trailing /
while [ ${i:-0} -le $# ]
do
i=$((i+1))
dir="1ドル"
while [ "${dir#"${dir%?}"}" = "/" ]
do
dir="${dir%/}"
done
set -- "$@" "$dir"
shift
done
# search string
for sub in $subdirs
do
[ -n "$search" ] && search="$search -o -name $sub" || search="( -name $sub"
done
search="$search )"
# GNU specific zero terminated handling for non-trivial directory names
excludes="$excludes $(find -L "$@" -type d $search -print0 | sed -z 's,[^/]*/,*/,g' | sort -z | uniq -z | xargs -0 printf '--exclude=%s\n')"
# for each argument
for dir in "$@"
do
# for each environment
[ -e "$dir" ] || continue
for sub in $subdirs
do
# exclude other subdirs
exclude=$(echo "$excludes" | grep -v "$sub")
# # exclude files that exist in subdir (at least stable against newlines and spaces in file names)
# include=$(echo "$excludes" | grep "$sub" | cut -d= -f2)
# [ -n "$include" ] && files=$(find $include -mindepth 1 -maxdepth 1 -print0 | tr '\n[[:space:]]' '?' | sed -z "s,/$sub/,/," | xargs -0 printf '--exclude=%s\n')
# exclude="$exclude $files"
# create tarball archive
archive="${dir##*/}-${sub}.tgz"
[ -f "$archive" ] && echo "WARNING: '$archive' is overwritten"
tar --transform "s,/$sub,,ドル" --transform "s,/$sub/,/," $exclude -czhf "$archive" "$dir"
done
done
You might notice duplicates inside archive. tar
will recursively descend directories, on restore the deeper files will overwrite files on parent directory
However, that needs some more testing against consistent behavior (not sure about that). the proper way would be exlude files1.json
+ files5.json
unfortunately -X
doesn't work with --null
if you don't trust that behavior or don't want duplicate files in archives you can add some exclude for simple file names. uncomment the code above tar
. newlines and whitespaces allowed in file names but will be excluded with wildcard ?
in exclude pattern, which could in theory exclude more files than expected (if there are similar files matching that pattern)
you can place a echo
before tar
and you will see the script generates the following commands
tar --transform 's,/dev,,ドル' --transform 's,/dev/,/,' --exclude=*/*/prod --exclude=*/*/stage -czhf Products-dev.tgz Products
tar --transform 's,/prod,,ドル' --transform 's,/prod/,/,' --exclude=*/*/dev --exclude=*/*/stage -czhf Products-prod.tgz Products
tar --transform 's,/stage,,ドル' --transform 's,/stage/,/,' --exclude=*/*/dev --exclude=*/*/prod -czhf Products-stage.tgz Products
-
if you uncomment the exclude files block you might get Argument list too long for too many files. pass excludes via
-X
index file in that casealecxs– alecxs2020年08月12日 20:17:07 +00:00Commented Aug 12, 2020 at 20:17