Fast elimination of duplicate lines across multiple files

Question 1

I have a huge amount of data in which each (data-)line should be unique.

There are a lot of files in one folder in which this is already true. It is about 15GB splitted into roughly 170 files with 1000000 lines. Let's call that folder foo.

Now there is a second folder (bar) with even more data: In each file, there are no multiple entries. The intersection of two files in bar is not necessarily empty. There are roughly 15k lines in each of the files there (and there are several thousands of files in bar).

Right now I'm using

awk 'NR==FNR{a[0ドル]=0ドル;next}!a[0ドル]' foo/file bar/file > tmp
mv tmp bar/file

and a loop over all files in foo and a loop over all files in bar. I break the loop over foo if the bar/file is empty. I have parallelized this by locking (for use on several nodes) and parallel execution (on each node). But still, this needs a heck of a long time.

What are possibilities of improving performance? What is the ideal file size of files in foo? Of course this is machine dependent (RAM/CPU/storage), but what is a good rule of thumbs here?

tl;dr: foo contains unique data lines, bar contains data lines which can appear multiple times in barand foo. Eliminate duplicates in bar such that they can be merged with foo

[Update] There are no empty lines [/Update]

Question 2

Have you considered sorting your file contents, and then running uniq? I've no idea if that would be faster or not, but an idea nonetheless.

Question 3

@ire_and_curses: Indeed, this is a new idea. But how to approach that with multiple files? Joining them seems to be a bad idea. I don't see a "multiple file" option in uniq (but I only had a quick lock at the man page)

Question 4

When you deal with large files, you must not use AWK arrays. Running 'cat foo/file bar/file | sort | uniq' looks like a good solution (if that is what you want to achive). Why do you think its a bad idea?

Question 5

@EranBen-Natan : There is no need (and its a performance drain) to do cat foo bar|sort|uniq when you can do sort -u foo bar .

Question 6

Just a shot in the dark, but would you be better off loading them into a database and using an SQL query to process this?

Question 7

I'm not sure I understand your question, but your code can be optimised to:

awk '!x{a[0ドル];next}; !(0ドル in a)' foo/file x=1 bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

Question 8

I'm using your comm -13 approach now and there is a noticeable speed up. Thanks!

Question 9

i had multiple files each of few MB in size and i have tried this which works for me:

sort *.csv | uniq -d

This will give you duplicate records from your file and then you can redirect output to a single file to get duplicate record and removing -d will give you all unique record.

Question 10

So what is your answer to this question? sort * | uniq? That was suggested n a comment over 2½ years ago.

score 6 · Accepted Answer · 2012-09-11 12:09:58Z

I'm not sure I understand your question, but your code can be optimised to:

awk '!x{a[0ドル];next}; !(0ドル in a)' foo/file x=1 bar/file > tmp

(yours had issues for empty lines or lines resolving to "0" in them I think)

If the files are sorted, you could do:

comm -13 foo/file bar/file > tmp

If they're not (ksh93. zsh or bash syntax):

comm -13 <(sort foo/file) <(sort bar/file) > tmp

(not necessarily faster than the awk solution)

Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:

LC_ALL=C awk ...

I'm using your comm -13 approach now and there is a noticeable speed up. Thanks!

Stack Exchange Network

Fast elimination of duplicate lines across multiple files

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Fast elimination of duplicate lines across multiple files

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions