Getting an array of files built from include array and exclude array containing globs

Question 1

I am wanting to do the following:

Define an array of globs that specify a base collection of files to include in a process.
Define an array of globs that specify files to exclude from that process. It doesn't matter to me if this array of globs specifies files not even in the above collection.
Build an array of files (not globs) that takes all files specified by the include glob array with any file belonging to the exclude glob array removed.

I have been struggling with this. Just to show some explicit examples of progress and what I have attempted, I have tried something like:

# List all files to potentially include in the process
files_to_include=(
 'utils/*.txt'
)
# List any files here that should be excluded from the above list
files_to_exclude=(
 '*dont-use.txt'
 'utils/README.md'
)
# Empty array of files
files=()
for file in ${files_to_exclude[@]}; do
 temp=find $files_to_include -type f \( -name '*.txt' -and -not -name $file \)
 files+=$temp
done
# I want this to be the total collection of files that I care about
echo ${files[@]}

Obviously, this for loop logic doesn't work, but it's at least something that's got me started, but I'm still struggling with the appropriate way to do this. (I also get weird permission denied messages only when trying to assign the output of find to temp that I don't know why they occur.)

I like find because from what I understand, its performance is going to be much better than grep. That is an actual concern here because there are a lot of files in my real use case. There are likely several different ways to do this, but I would like to have as little "magic" in my script as possible. So please help to make the script performant but also very understandable.

As far as I can tell, I need a process that expands all the globs in the include array, expands all the globs in the exclude array, and then subtracts the exclude from the include array. This is at a high-level though, and implementing this has been a challenge for me.

Thank you!

Question 2

you'd do something like find . -type f $ -path "$include1" -o -path "$include2" ... $ ! $ -path "$exclude1" -o -path "$exclude2" $, though building the list of args to find is a pain (but there's posts on that on the site).

Question 3

I'm not sure why you'd think grep would be slow, though. You need to do the comparisons in each case, either in find or in grep, so something like find . -type f | grep -e ... -e ... | grep -v -e ... -e ... shouldnt be too bad (barring issues with newlines in filenames and the fact that grep takes regexes instead of globs)

Question 4

Related: Bash - How to find all files NOT in array

Question 5

Looks like you want files_to_include to be globs while files_to_exclude should be just patterns as otherwise as a glob *dont-use.txt would not generate (filename generation or pathname expansion being other names for globbing) a utils/whatev-dont-use.txt so wouldn't exclude that file, and if utils/*.txt was just a pattern, it would also match on utils/.git/foo/bar/.txt for instance.

zsh has a ~ exclude by pattern glob operator, so there, you could do

set -o extendedglob
globs_to_include=(
 'utils/*.txt'
)
patterns_to_exclude=(
 '*dont-use.txt'
 'utils/README.md'
)
typeset -U files=(
 $~^globs_to_include~(${(j[|])~patterns_to_exclude})(ND.)
)

Or without the need for extendedglob, do the filtering of the patterns_to_exclude afterwards using the ${array:#pattern} parameter expansion operator:

typeset -U files=( $~^globs_to_include(N.) )
files=( ${files:#(${(j[|])~patterns_to_exclude})} )

If both arrays were meant to be patterns and you wanted to match them against the paths of every regular file in or below the current working directory, then that could be:

() {
 files=( ${${(M)@:#(${(j[|])~patterns_to_include})}:#(${(j[|])~patterns_to_exclude})} )
} **/*(ND.)

Or in separate steps to make it more legible:

pattern_to_include="(${(j[|])patterns_to_include})"
pattern_to_exclude="(${(j[|])patterns_to_exclude})"
files=( **/*(ND.) )
files=( ${(M)files:#$~pattern_to_include} )
files=( ${files:#$~pattern_to_exclude} )

If they're both meant to be globs, that would just be:

typeset -U files_to_include=(
 utils/*.txt(ND.)
)
typeset -U files_to_exclude=(
 *don-use.txt(ND.)
 utils/README.md(ND.)
)
files=( ${files_to_include:|files_to_exclude} )

using the ${A:|B} array subtraction operator.

Explanation of some of the zsh-specific syntax in there:

array=( elements ): array declaration, as copied by a few shells since including bash when it eventually added array support in 2.0. Similar to the set -A array -- elements of the Korn shell.
**/: any level of directory for recursive globbing.
extendedglob option: needed for the ~ operator
typeset -U array: makes the array elements unique
$~var: makes the contents of $var considered as a pattern
$^array/more: makes so that the expansion becomes element1/more element2/more in csh-style {element1,element2}/more fashion
${(...)param} those are parameter expansion flags. j[|] to join the elements of the array with |.
(ND.): those are glob qualifiers, N to enable nullglob for that glob, D dot globdot, . to restrict to files of type regular.
${array:#pattern} to filter out the elements matching the pattern. With the (M) flag, that becomes filter in.
() { body; } args: anonymous function being passed some arguments (available in the body in $@ aka $argv and 1ドル, 2ドル... as in regular named functions).

Question 6

Let the quoting work for you rather than against you. Don't quote globs but let the shell try to expand them. Do double-quote variables to prevent them being treated as globs. Do remember to put array specials involving @ in double quotes:

includes=( utils/*.txt )
excludes=( *dont-use.txt utils/README.md )
# Convert array to hash so we can easily index it
declare -A excludes_hash
for i in "${excludes[@]}"
do
 excludes_hash["$i"]=1
done
# Build list of files
files=()
for i in "${includes[@]}"
do
 [ -z "${excludes_hash[$i]}" ] && files+=("$i")
done
# Total collection of files that I care about
printf "%s\n" "${files[@]}"

score 1 · Accepted Answer · 2023-06-15 05:58:47Z

Looks like you want files_to_include to be globs while files_to_exclude should be just patterns as otherwise as a glob *dont-use.txt would not generate (filename generation or pathname expansion being other names for globbing) a utils/whatev-dont-use.txt so wouldn't exclude that file, and if utils/*.txt was just a pattern, it would also match on utils/.git/foo/bar/.txt for instance.

zsh has a ~ exclude by pattern glob operator, so there, you could do

set -o extendedglob
globs_to_include=(
 'utils/*.txt'
)
patterns_to_exclude=(
 '*dont-use.txt'
 'utils/README.md'
)
typeset -U files=(
 $~^globs_to_include~(${(j[|])~patterns_to_exclude})(ND.)
)

Or without the need for extendedglob, do the filtering of the patterns_to_exclude afterwards using the ${array:#pattern} parameter expansion operator:

typeset -U files=( $~^globs_to_include(N.) )
files=( ${files:#(${(j[|])~patterns_to_exclude})} )

If both arrays were meant to be patterns and you wanted to match them against the paths of every regular file in or below the current working directory, then that could be:

() {
 files=( ${${(M)@:#(${(j[|])~patterns_to_include})}:#(${(j[|])~patterns_to_exclude})} )
} **/*(ND.)

Or in separate steps to make it more legible:

pattern_to_include="(${(j[|])patterns_to_include})"
pattern_to_exclude="(${(j[|])patterns_to_exclude})"
files=( **/*(ND.) )
files=( ${(M)files:#$~pattern_to_include} )
files=( ${files:#$~pattern_to_exclude} )

If they're both meant to be globs, that would just be:

typeset -U files_to_include=(
 utils/*.txt(ND.)
)
typeset -U files_to_exclude=(
 *don-use.txt(ND.)
 utils/README.md(ND.)
)
files=( ${files_to_include:|files_to_exclude} )

using the ${A:|B} array subtraction operator.

Explanation of some of the zsh-specific syntax in there:

array=( elements ): array declaration, as copied by a few shells since including bash when it eventually added array support in 2.0. Similar to the set -A array -- elements of the Korn shell.
**/: any level of directory for recursive globbing.
extendedglob option: needed for the ~ operator
typeset -U array: makes the array elements unique
$~var: makes the contents of $var considered as a pattern
$^array/more: makes so that the expansion becomes element1/more element2/more in csh-style {element1,element2}/more fashion
${(...)param} those are parameter expansion flags. j[|] to join the elements of the array with |.
(ND.): those are glob qualifiers, N to enable nullglob for that glob, D dot globdot, . to restrict to files of type regular.
${array:#pattern} to filter out the elements matching the pattern. With the (M) flag, that becomes filter in.
() { body; } args: anonymous function being passed some arguments (available in the body in $@ aka $argv and 1ドル, 2ドル... as in regular named functions).

Stack Exchange Network

Getting an array of files built from include array and exclude array containing globs

2 Answers 2

You must log in to answer this question.

Linked

Hot Network Questions