I have a huge amount of data in which each (data-)line should be unique.
There are a lot of files in one folder in which this is already true. It is about 15GB splitted into roughly 170 files with 1000000 lines. Let's call that folder foo
.
Now there is a second folder (bar
) with even more data: In each file, there are no multiple entries. The intersection of two files in bar
is not necessarily empty. There are roughly 15k lines in each of the files there (and there are several thousands of files in bar
).
Right now I'm using
awk 'NR==FNR{a[0ドル]=0ドル;next}!a[0ドル]' foo/file bar/file > tmp
mv tmp bar/file
and a loop over all files in foo
and a loop over all files in bar
. I break the loop over foo
if the bar/file
is empty. I have parallelized this by locking (for use on several nodes) and parallel execution (on each node). But still, this needs a heck of a long time.
What are possibilities of improving performance? What is the ideal file size of files in foo
? Of course this is machine dependent (RAM/CPU/storage), but what is a good rule of thumbs here?
tl;dr: foo
contains unique data lines, bar
contains data lines which can appear multiple times in bar
and foo
. Eliminate duplicates in bar
such that they can be merged with foo
[Update] There are no empty lines [/Update]
2 Answers 2
I'm not sure I understand your question, but your code can be optimised to:
awk '!x{a[0ドル];next}; !(0ドル in a)' foo/file x=1 bar/file > tmp
(yours had issues for empty lines or lines resolving to "0" in them I think)
If the files are sorted, you could do:
comm -13 foo/file bar/file > tmp
If they're not (ksh93. zsh or bash syntax):
comm -13 <(sort foo/file) <(sort bar/file) > tmp
(not necessarily faster than the awk solution)
Also, especially with GNU awk, you may get better performance by setting the locale to C/POSIX:
LC_ALL=C awk ...
-
I'm using your
comm -13
approach now and there is a noticeable speed up. Thanks!stefan– stefan2012年09月11日 12:38:20 +00:00Commented Sep 11, 2012 at 12:38
i had multiple files each of few MB in size and i have tried this which works for me:
sort *.csv | uniq -d
This will give you duplicate records from your file and then you can redirect output to a single file to get duplicate record and removing -d
will give you all unique record.
-
So what is your answer to this question?
sort * | uniq
? That was suggested n a comment over 2½ years ago.G-Man Says 'Reinstate Monica'– G-Man Says 'Reinstate Monica'2015年04月29日 07:00:30 +00:00Commented Apr 29, 2015 at 7:00
uniq
? I've no idea if that would be faster or not, but an idea nonetheless.cat foo bar|sort|uniq
when you can dosort -u foo bar
.