A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).
What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left.
I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?
UPDATE: the awk '!seen[0ドル]++' filename
solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn't work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don't feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.
12 Answers 12
An awk
solution seen on #bash (Freenode):
awk '!seen[0ドル]++' filename
If you want to edit the file in-place, you can use the following command (provided that you use a GNU awk version that implements this extension):
awk -i inplace '!seen[0ドル]++' filename
-
2Just tried this on a 2G file and it took three minutes on my notebook. Not bad. I also tried uniq filename | awk '!seen[0ドル]++', but it wasn't any faster.mgjk– mgjk2012年01月27日 19:27:11 +00:00Commented Jan 27, 2012 at 19:27
-
1@HashWizard: this command does not sort, but eliminates every next occurrence of the same lineenzotib– enzotib2017年05月14日 15:51:34 +00:00Commented May 14, 2017 at 15:51
-
2Wondering how this command works? -- See here: unix.stackexchange.com/questions/159695/how-does-awk-a0-worksupergra– supergra2017年10月24日 19:13:32 +00:00Commented Oct 24, 2017 at 19:13
-
2@MaxWilliams yes, it works is they are randomly distributed.setholopolus– setholopolus2018年01月19日 19:58:24 +00:00Commented Jan 19, 2018 at 19:58
-
3preserve newlines or lines with spaces
awk '/^\s*?$/||!seen[0ドル]++'
James O'Brien– James O'Brien2020年03月13日 00:46:51 +00:00Commented Mar 13, 2020 at 0:46
sort -u big-csv-file.csv > duplicates-removed.csv
Note that the output file will be sorted.
-
4Not as fast as the
awk
command in other answers, but conceptually simple!Johann– Johann2015年03月31日 23:11:05 +00:00Commented Mar 31, 2015 at 23:11 -
@Johann I am doing this pretty often on files with hundreds of thousands (even million) of short newline terminated strings. I get the results pretty quick for the experiments I am doing. It can be more important if used in scripts which are run again and again, savings in time can be considerable.Vladislavs Dovgalecs– Vladislavs Dovgalecs2015年03月31日 23:13:14 +00:00Commented Mar 31, 2015 at 23:13
-
4Use
sort -u
to remove duplicates during the sort, rather than after. (And saves memory bandwidth) piping it to another program). This is only better than theawk
version if you want your output sorted, too. (The OP on this question wants his original ordering preserved, so this is a good answer for a slightly different use-case.)Peter Cordes– Peter Cordes2015年09月14日 15:39:31 +00:00Commented Sep 14, 2015 at 15:39 -
Took about a minute, for me, for a 5.5 million line file (1.8 GB in total). Brilliant.Max Williams– Max Williams2018年01月04日 11:23:25 +00:00Commented Jan 4, 2018 at 11:23
There's a simple (which is not to say obvious) method using standard utilities which doesn't require a large memory except to run sort
, which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.
<input nl -b a -s : | # number the lines
sort -t : -k 2 -u | # sort and uniquify ignoring the line numbers
sort -t : -k 1n | # sort according to the line numbers
cut -d : -f 2- >output # remove the line numbers
If all lines begin with a non-whitespace character, you can dispense with some of the options:
<input nl | sort -k 2 -u | sort -k 1n | cut -f 2- >output
For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there's a very concise awk script for that (already posted by enzotib):
<input awk '!seen[0ドル]++'
Less concisely: !seen[0ドル] {print} {seen[0ドル] += 1}
, i.e. print the current line if it hasn't been seen yet, then increment the seen
counter for this line (uninitialized variables or array elements have the numerical value 0).
For long lines, you can save memory by keeping only a non-spoofable checksum (e.g. a cryptographic digest) of each line. For example, using SHA-1, you only need 20 bytes plus a constant overhead per line. But computing digests is rather slow; this method will only win if you have a fast CPU (especially one with a hardware accelerator to compute the digests) and not a lot of memory relative to the size of the file and sufficiently long lines. No basic utility lets you compute a checksum for each line; you'd have to bear the interpretation overhead of Perl/Python/Ruby/... or write a dedicated compiled program.
<input perl -MDigest::MD5 -ne '$seen{Digest::MD5::md5($_)}++ or print' >output
-
@Gilles Based on your explanation of
awk '!seen[0ドル]++'
, does it mean that if awk sees 2 duplicate lines, it will keep the always first one and ignore all subsequent ones? (Or it will keep the last one?)user779159– user7791592017年05月03日 11:12:17 +00:00Commented May 3, 2017 at 11:12 -
1@user779159 It keeps the first one: each input line is either printed immediately (first occurrence) or not at all (repeat occurrence).Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2017年05月03日 11:30:50 +00:00Commented May 3, 2017 at 11:30
-
But how does that compare to sort -u ...?HashWizard– HashWizard2017年05月13日 21:37:52 +00:00Commented May 13, 2017 at 21:37
-
@HashWizard A plain
sort -u
changes the order. My answer shows solutions that preserve the order (the order of first occurrences, to be precise).Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2017年05月13日 21:42:13 +00:00Commented May 13, 2017 at 21:42 -
@Gilles would you say that it is faster than sort -u for large files (10G) with 50% duplicates ?HashWizard– HashWizard2017年05月13日 21:43:46 +00:00Commented May 13, 2017 at 21:43
Assuming you can afford to keep as much as the de-duplicated file in memory (if your data is indeed duplicated by a factor of 100, that should be about 20MiB + overhead), you can do this very easily with Perl.
$ perl -ne 'print unless $dup{$_}++;' input_file > output_file
This preserves the order too.
You could extract the number of occurrences of each line from the %dup
hash if you so wished, as an added free bonus.
If you prefer awk
, this should do it too (same logic as the perl version, same ordering, same data gathered in the dup
variable):
$ awk '{if (++dup[0ドル] == 1) print 0ドル;}' input_file > output_file
-
This is too good @Mat, I was about to slurp the file, lol ;-).Nikhil Mulley– Nikhil Mulley2012年01月27日 16:10:44 +00:00Commented Jan 27, 2012 at 16:10
-
Now waiting for @ManAtWork for his sed and awk magic weavery too :-)Nikhil Mulley– Nikhil Mulley2012年01月27日 16:11:45 +00:00Commented Jan 27, 2012 at 16:11
-
awesome again for the awk tip :-)Nikhil Mulley– Nikhil Mulley2012年01月27日 16:18:06 +00:00Commented Jan 27, 2012 at 16:18
-
1Is it possible to change the perl script to only remove duplicate adjacent lines?dumbledad– dumbledad2016年03月10日 00:11:39 +00:00Commented Mar 10, 2016 at 0:11
-
3@dumbledad:
uniq
does that all by itselfMat– Mat2016年03月10日 05:50:23 +00:00Commented Mar 10, 2016 at 5:50
The Ultimate Line De-duper
gawk -i inplace 'FNR==1{delete a} !a[0ドル]++' SOME_FILE [OTHER_FILES...]
* The command requires GNU AWK version 4.1 from 2013, or newer.
- ✅ Maintains Order
- ✅ Edits files directly
- ✅ Runs on multiple files
How It Works
The command uses an associative array to track unique lines and prints each line only once.
Here’s the algorithm in pseudocode:
if current_file_line_number is 1:
clear array a
while read line into 0ドル:
if a does not have key 0ドル:
a[0ドル] = 0
if a[0ドル] is 0:
print 0ドル
a[0ドル] = a[0ドル] + 1
Alternative versions
awk '!a[0ドル]++' < FILE_TO_DEDUP > DEDUPED_FILE
SOME_COMMAND | awk '!a[0ドル]++' | SOME_OTHER_COMMAND
-
Does this preserve the order? By the way, this did not work for me. My version is:
GNU Awk 4.0.2
Leonid– Leonid2017年02月16日 10:31:23 +00:00Commented Feb 16, 2017 at 10:31 -
1@Leonid yes, it does. It prints the first occurrence of any unique line. The inplace support was first introduced in version 4.1, which was released in 2013.rindeal– rindeal2017年02月16日 12:49:27 +00:00Commented Feb 16, 2017 at 12:49
-
This should be the answer. It's actually delete the duplicated string in the existing or current file where the top answer and most of the answers here only printout the uniq / duplicated strings and doing nothing and we have to create another output to store the result.MaXi32– MaXi322020年06月06日 08:33:24 +00:00Commented Jun 6, 2020 at 8:33
-
How does
gawk
differs fromawk
?alper– alper2022年02月13日 23:51:51 +00:00Commented Feb 13, 2022 at 23:51 -
@alper
gawk
has way more features, such as the in-place editing support, for examplerindeal– rindeal2024年08月30日 00:08:29 +00:00Commented Aug 30, 2024 at 0:08
You can use uniq
http://www.computerhope.com/unix/uuniq.htm
uniq
reports or filters out repeated lines in a file.
-
When giving an answer it is preferable to give some explanation as to WHY your answer is the one. So, how does this answer differ from several of the previous answers?Stephen Rauch– Stephen Rauch2017年03月24日 04:08:24 +00:00Commented Mar 24, 2017 at 4:08
-
7From the uniq man page:Note:
'uniq' does not detect repeated lines unless they are adjacent.
So you have to first sort it and loose the order of the non duplicate lines.Vindolin– Vindolin2018年11月06日 07:27:34 +00:00Commented Nov 6, 2018 at 7:27
SOLUTION WITHOUT MAINTAINING THE ORIGINAL SEQUENCE ORDER
I did it with the following code piece.
sort duplicates.txt | uniq > noDuplicates.txt
The sort
command sorts the lines alphabetically, and the uniq
command removes the duplicates.
NOTE: Why we sorted the lines first is that uniq
does not detect duplicate lines unless they are adjacent.
-
2The question asks for a method (preferably) which maintains the input order; could you edit your answer to address that? Note that there are existing answers using
sort
which maintain input order, and one answer usingsort
without maintaining input order but in a more efficient manner than piping touniq
.Stephen Kitt– Stephen Kitt2020年09月07日 14:01:17 +00:00Commented Sep 7, 2020 at 14:01 -
@StephenKitt Edited. I inspected other answers, but couldn't find anything only with basic commands. Thanks for your feedback.Caglayan Dokme– Caglayan Dokme2020年09月07日 14:18:39 +00:00Commented Sep 7, 2020 at 14:18
-
I gave you a link to an answer with only basic commands, in fact only one command,
sort -u
(which is part of POSIX) ;-).Stephen Kitt– Stephen Kitt2020年09月07日 14:25:35 +00:00Commented Sep 7, 2020 at 14:25 -
@StephenKitt I saw that answer. Mine is also a way to handle the problem. What do you want me to do more? Should I delete the answer?Caglayan Dokme– Caglayan Dokme2020年09月07日 15:48:31 +00:00Commented Sep 7, 2020 at 15:48
-
No, don’t delete your answer; I just wanted to make sure you were aware of the other answer, given that you said you "couldn't find anything only with basic commands".Stephen Kitt– Stephen Kitt2020年09月07日 15:59:55 +00:00Commented Sep 7, 2020 at 15:59
Python One liners :
python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile
-
this causes the entire file to be slurped into memory and may not be a good fit for the OP's problem. Also not guaranteed to retain orderiruvar– iruvar2013年09月15日 14:50:14 +00:00Commented Sep 15, 2013 at 14:50
-
Thanks for the suggestion, I've been just learning python.. just tried this for learning purpose.. :)Rahul Patil– Rahul Patil2013年09月15日 19:52:49 +00:00Commented Sep 15, 2013 at 19:52
-
-
Thanks @1_CR I have something learn today :)
OrderedDict
Rahul Patil– Rahul Patil2013年09月16日 16:39:28 +00:00Commented Sep 16, 2013 at 16:39
Note that sort
can write its output to any one of the files that it was given as input:
LC_ALL=C sort -u -o input input
That's fine as sort
needs to have read all its input before it can start outputting anything (before it can tell which is the line that sorts first which could very well be the last line of the input).
sort
will (intelligently) use temporary files so as to avoid loading the whole input in memory. You'll need enough space in $TMPDIR
(or /tmp
if that variable is not set). Some sort
implementations can compress the temp files (like with --compress-program=lzop
with GNU sort
) which can help if you're short on disk space or have slow disks.
With LC_ALL=C
sort order is by byte value which should speed things up and also guarantee a total and deterministic order (which you don't always get otherwise especially on GNU systems).
-
does
sort -u
maintain input order?iruvar– iruvar2025年02月23日 20:25:43 +00:00Commented Feb 23 at 20:25
None of the answers here worked for me on my Mac so I wrote a simple python script that works for me. I am ignoring leading/trailing whitespace and also don't care about memory consumption.
import sys
inputfile = sys.argv[1]
outputfile = sys.argv[2]
with open(inputfile) as f:
content = f.readlines()
content = [x.strip() for x in content]
my_list = list(set(content))
with open(outputfile, 'w') as output:
for item in my_list:
output.write("%s\n" % item)
Save the above to unique.py and run like this:
python unique.py inputfile.txt outputfile.txt
Using Raku (formerly known as Perl_6)
~$ raku -ne 'BEGIN my %dup; .put unless %dup{$_}++;' input_file > output_file
OR:
~$ raku -e '.put for lines.unique;' input_file > output_file
[ Note: compare first answer here with excellent Perl answer by @Mat ].
With bash 4, a pure-bash solution that takes advantage of associative arrays can be used. Here is an example
unset llist; declare -A llist;
while read -r line; do
if [[ ${llist[$line]} ]]; then
continue
else
printf '%s\n' "$line"
llist[$line]="x"
fi
done < file.txt
-
2Don't use
read
loops to process big text files. bash has to read one-byte-at-a-time to avoid overshooting a newline. Bash is also not very fast at text processing in general compared to awk. If you do use this,read -ra
will avoid eating backslashes in your input. Also, don't forget tounset llist
after the loop, if you put this in a shell function or use it interactively.Peter Cordes– Peter Cordes2015年09月14日 15:44:30 +00:00Commented Sep 14, 2015 at 15:44 -
2
sort -u
will probably be faster.