I'm trying to find a simple way to just test an array for duplicate values. It would be nice, but not completely necessary, to be able to identify the specific lines that have duplicates, but the important point is simply being able to see that there's a duplicate.
I have an array, $key_array
, which contains some numbers:
# echo ${key_array[@]}
1 2 3 4 3 3
This array could have an arbitrary number of numbers, some of which could be duplicates of others. They will be integer numbers only. (Numbers beginning with a 0
, such as 03
, should not make into the array at all, but in the off-chance that it happens, catching 3
and 03
as a duplicate of each other would be better than treating them as different numbers.)
I need to determine if any of these numbers are duplicates. I was thinking this could be done with an exit code if nothing else. What I was after was something like this:
if $(some command); then
echo "Array contains duplicates."
exit 1
fi
$(commands to run after duplicate check)
The idea being in the end that the script informs the user and exits if there are duplicates (not super important to identify where the duplicates are, just telling the user to check for duplicates is enough), or if there aren't any duplicates, it proceeds and runs a bunch of other stuff.
How would I best accomplish this?
4 Answers 4
In the zsh
shell:
array=(1 2 3 4 3 3)
if (($#array != ${#${(u)array}})); then
print -u2 array contains duplicates
exit 1
fi
Where ${(u)array}
expands to the unique elements of the array, so we're just comparing the number of elements with the number of unique elements.
The bash
shell doesn't have an equivalent, but as its arrays can't contain NUL bytes anyway, if you're on a GNU system, you could do something like:
readarray -td '' dups < <(
(( ${#array[@]} == 0 )) ||
printf '%s0円' "${array[@]}" |
LC_ALL=C sort -z |
LC_ALL=C uniq -zd
)
if ((${#dups[@]} > 0)); then
echo >&2 "array has duplicates:"
printf >&2 ' - "%s"\n' "${dups[@]}"
exit 1
fi
In those, elements are considered duplicate if they are byte-to-byte identical, not if their numeric value if any is the same (1
, 01
, 0x1
, 1e0
, 2-1
, $'1\n'
, ' 1'
are all considered different).
Assuming arr
contains only integers and that zero padded numbers should be considered duplicates (e.g., 01
is a duplicate of 1
), we can use a second array to keep the values already "seen" when parsing each element of the first array arr
.
#!/bin/bash
arr=(1 2 3 4 3 3)
seen=()
for i in "${arr[@]}"; do
#Remove padding zeroes, if any
i=$((10#$i))
# If element of arr is not in seen, add it as a key to seen
if [ -z "${seen[i]}" ]; then
seen[i]=1
else
echo "Array contains a duplicate."
break
fi
done
-
3Note that that assumes array elements are decimal integers (there's also the question of whether 01 and 1 should be deemed duplicate of each other).Stéphane Chazelas– Stéphane Chazelas2020年08月25日 15:55:10 +00:00Commented Aug 25, 2020 at 15:55
-
1Since Bash 4.0 you may use an associative array to not have subscripts treated as arithmetic expressions. (That wouldn't properly treat them as "numbers" either, of course).fra-san– fra-san2020年08月25日 16:24:42 +00:00Commented Aug 25, 2020 at 16:24
-
1Note that since
bash
interprets numbers with leading 0s as octal,010
will be considered the same as 8, not 10.Stéphane Chazelas– Stéphane Chazelas2020年08月25日 16:43:59 +00:00Commented Aug 25, 2020 at 16:43 -
1Note that the
i=$((10#$i))
work around doesn't work for negative numbers, where10#-010
is interpreted as10#0 - 010
, so0 - 8
so-8
.Stéphane Chazelas– Stéphane Chazelas2020年08月25日 18:17:32 +00:00Commented Aug 25, 2020 at 18:17 -
1@ilkkachu, with the caveat that bash associative arrays don't support empty strings in their keys (which can be worked around by adding a prefix for instance).Stéphane Chazelas– Stéphane Chazelas2020年08月25日 18:18:43 +00:00Commented Aug 25, 2020 at 18:18
If you need it to work in Bash 3.X, you could use uniq
:
IFS=$'\n' sort <<<"${key_array[*]}" | uniq -d; unset IFS
This will return with, and only with, all duplicate elements of the array.
Description
IFS=$'\n'
sets the internal field separator to a new line character, ensuring that"${key_array[*]}"
will expand into a single line per array element.<<<
is a here string that feeds the output of"${key_array[*]}"
into the standard input ofsort
.sort
well, sorts.uniq -d
outputs "...a single copy of each line that is repeated in the input." (fromman uniq
).unset IFS
is just good business, and resetsIFS
back to its default.
-
1(1) You should explain exactly you believe that why the Bash version matters. (2) This is very similar to the Bash part of Stéphane Chazelas’s answer. (2b) As his answer foreshadows,
sort
anduniq
may produce unexpected results in some locales. (3) You say "unset IFS
is just good business ..." I disagree. This snippet may be used in a large script whereIFS
has been changed to something non-standard. ... (Cont’d)G-Man Says 'Reinstate Monica'– G-Man Says 'Reinstate Monica'2022年04月14日 05:11:20 +00:00Commented Apr 14, 2022 at 5:11 -
1(Cont’d) ... (See also this.) (3b) And worst of all, you don’t need to reset
IFS
. Your command modifiesIFS
only for the scope of thesort
command. (4) The question asks for a test that can produce a yes-or-no answer in a script. How does this produce a yes-or-no answer in a script? ... ... ... (5) P.S. Thanks for including the explanation.G-Man Says 'Reinstate Monica'– G-Man Says 'Reinstate Monica'2022年04月14日 05:11:23 +00:00Commented Apr 14, 2022 at 5:11
Assuming that your key_array
array only ever contains whole numbers (positive integers), we may use the fact that ordinary arrays are sparse in the bash
shell. The following code is looping over the array of keys while instantiating elements in a regular array until we find a key that we have already processed:
key_array=( '09' 1 2 3 4 3 3 '04' '001' '07' )
has_dupes () (
unset -v a
for key do
${a[10#$key]+'return'} # execute "return" if a[10#$key] is set
a[10#$key]= # set a[10#$key] to empty string
done
return 1
)
if has_dupes "${key_array[@]}"; then
echo 'array has dupes'
else
echo 'array has no dupes'
fi
This introduces a utility function, has_dupes
, that takes a list of whole numbers, and returns zero if there is a duplicate value in the list and non-zero if there are no duplicated values.
The standard parameter expansion ${variable+word}
is used to insert the word return
if a[10#$key]
was previously set. When return
is substituted, it terminates the function's execution and returns a zero exit status to the caller signifying that we found a duplicate value. The index 10#$key
means "the value $key
interpreted as a base 10 integer" and allows us to equate keys like 03
and 3
.
$()
inif $(some command); then
is not necessary. The if construct by default takes a list (a sequence of one or more pipelines) and executes them, testing their exit status. Thus,if some command; then
is sufficient.