Testing for presence of duplicate values in array

Question 1

I'm trying to find a simple way to just test an array for duplicate values. It would be nice, but not completely necessary, to be able to identify the specific lines that have duplicates, but the important point is simply being able to see that there's a duplicate.

I have an array, $key_array, which contains some numbers:

# echo ${key_array[@]}
1 2 3 4 3 3

This array could have an arbitrary number of numbers, some of which could be duplicates of others. They will be integer numbers only. (Numbers beginning with a 0, such as 03, should not make into the array at all, but in the off-chance that it happens, catching 3 and 03 as a duplicate of each other would be better than treating them as different numbers.)

I need to determine if any of these numbers are duplicates. I was thinking this could be done with an exit code if nothing else. What I was after was something like this:

if $(some command); then
 echo "Array contains duplicates."
 exit 1
fi
$(commands to run after duplicate check)

The idea being in the end that the script informs the user and exits if there are duplicates (not super important to identify where the duplicates are, just telling the user to check for duplicates is enough), or if there aren't any duplicates, it proceeds and runs a bunch of other stuff.

How would I best accomplish this?

Question 2

Should 3 be considered a duplicate of 03? Is the array of integers only?

Question 3

@Quasímodo it's integers only. numbers beginning with a 0 such as 03 should not make into the array at all, but in the off-chance that it happens, catching them as a duplicate would be better than treating them as different numbers.

Question 4

$() in if $(some command); then is not necessary. The if construct by default takes a list (a sequence of one or more pipelines) and executes them, testing their exit status. Thus, if some command; then is sufficient.

Question 5

In the zsh shell:

array=(1 2 3 4 3 3)
if (($#array != ${#${(u)array}})); then
 print -u2 array contains duplicates
 exit 1
fi

Where ${(u)array} expands to the unique elements of the array, so we're just comparing the number of elements with the number of unique elements.

The bash shell doesn't have an equivalent, but as its arrays can't contain NUL bytes anyway, if you're on a GNU system, you could do something like:

readarray -td '' dups < <(
 (( ${#array[@]} == 0 )) ||
 printf '%s0円' "${array[@]}" |
 LC_ALL=C sort -z |
 LC_ALL=C uniq -zd
)
if ((${#dups[@]} > 0)); then
 echo >&2 "array has duplicates:"
 printf >&2 ' - "%s"\n' "${dups[@]}"
 exit 1
fi

In those, elements are considered duplicate if they are byte-to-byte identical, not if their numeric value if any is the same (1, 01, 0x1, 1e0, 2-1, $'1\n', ' 1' are all considered different).

Question 6

Assuming arr contains only integers and that zero padded numbers should be considered duplicates (e.g., 01 is a duplicate of 1), we can use a second array to keep the values already "seen" when parsing each element of the first array arr.

#!/bin/bash
arr=(1 2 3 4 3 3)
seen=()
for i in "${arr[@]}"; do
 #Remove padding zeroes, if any
 i=$((10#$i))
 # If element of arr is not in seen, add it as a key to seen
 if [ -z "${seen[i]}" ]; then
 seen[i]=1
 else
 echo "Array contains a duplicate."
 break
 fi
done

Question 7

Note that that assumes array elements are decimal integers (there's also the question of whether 01 and 1 should be deemed duplicate of each other).

Question 8

Since Bash 4.0 you may use an associative array to not have subscripts treated as arithmetic expressions. (That wouldn't properly treat them as "numbers" either, of course).

Question 9

Note that since bash interprets numbers with leading 0s as octal, 010 will be considered the same as 8, not 10.

Question 10

Note that the i=$((10#$i)) work around doesn't work for negative numbers, where 10#-010 is interpreted as 10#0 - 010, so 0 - 8 so -8.

Question 11

@ilkkachu, with the caveat that bash associative arrays don't support empty strings in their keys (which can be worked around by adding a prefix for instance).

Question 12

If you need it to work in Bash 3.X, you could use uniq:

IFS=$'\n' sort <<<"${key_array[*]}" | uniq -d; unset IFS

This will return with, and only with, all duplicate elements of the array.

Description

IFS=$'\n' sets the internal field separator to a new line character, ensuring that "${key_array[*]}" will expand into a single line per array element.
<<< is a here string that feeds the output of "${key_array[*]}" into the standard input of sort.
sort well, sorts.
uniq -d outputs "...a single copy of each line that is repeated in the input." (from man uniq).
unset IFS is just good business, and resets IFS back to its default.

Question 13

(1) You should explain exactly you believe that why the Bash version matters. (2) This is very similar to the Bash part of Stéphane Chazelas’s answer. (2b) As his answer foreshadows, sort and uniq may produce unexpected results in some locales. (3) You say "unset IFS is just good business ..." I disagree. This snippet may be used in a large script where IFS has been changed to something non-standard. ... (Cont’d)

Question 14

(Cont’d) ... (See also this.) (3b) And worst of all, you don’t need to reset IFS. Your command modifies IFS only for the scope of the sort command. (4) The question asks for a test that can produce a yes-or-no answer in a script. How does this produce a yes-or-no answer in a script? ... ... ... (5) P.S. Thanks for including the explanation.

Question 15

Assuming that your key_array array only ever contains whole numbers (positive integers), we may use the fact that ordinary arrays are sparse in the bash shell. The following code is looping over the array of keys while instantiating elements in a regular array until we find a key that we have already processed:

key_array=( '09' 1 2 3 4 3 3 '04' '001' '07' )
has_dupes () (
 unset -v a
 for key do
 ${a[10#$key]+'return'} # execute "return" if a[10#$key] is set
 a[10#$key]= # set a[10#$key] to empty string
 done
 return 1
)
if has_dupes "${key_array[@]}"; then
 echo 'array has dupes'
else
 echo 'array has no dupes'
fi

This introduces a utility function, has_dupes, that takes a list of whole numbers, and returns zero if there is a duplicate value in the list and non-zero if there are no duplicated values.

The standard parameter expansion ${variable+word} is used to insert the word return if a[10#$key] was previously set. When return is substituted, it terminates the function's execution and returns a zero exit status to the caller signifying that we found a duplicate value. The index 10#$key means "the value $key interpreted as a base 10 integer" and allows us to equate keys like 03 and 3.

score 5 · Accepted Answer · 2020-08-25 15:46:41Z

In the zsh shell:

array=(1 2 3 4 3 3)
if (($#array != ${#${(u)array}})); then
 print -u2 array contains duplicates
 exit 1
fi

Where ${(u)array} expands to the unique elements of the array, so we're just comparing the number of elements with the number of unique elements.

The bash shell doesn't have an equivalent, but as its arrays can't contain NUL bytes anyway, if you're on a GNU system, you could do something like:

readarray -td '' dups < <(
 (( ${#array[@]} == 0 )) ||
 printf '%s0円' "${array[@]}" |
 LC_ALL=C sort -z |
 LC_ALL=C uniq -zd
)
if ((${#dups[@]} > 0)); then
 echo >&2 "array has duplicates:"
 printf >&2 ' - "%s"\n' "${dups[@]}"
 exit 1
fi

In those, elements are considered duplicate if they are byte-to-byte identical, not if their numeric value if any is the same (1, 01, 0x1, 1e0, 2-1, $'1\n', ' 1' are all considered different).

Stack Exchange Network

Testing for presence of duplicate values in array

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Testing for presence of duplicate values in array

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions