Grep huge number of patterns from huge file

Question 1

I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:

1358726575123 # key
 Joseph Muller # name
 carpenter # job
9973834728345
 Andres Smith
 student
7836472098652
 Mariah Anthony
 dentist

Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123. Then I run a for loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:

for number in $(grep -o '[0-9]\{12\}' file2); do # finds about 10.000 keys
 if ! grep -q ^$number$ file1; then # file1 is a huge file
 printf "$number\n" >>file3 # we'll process file3 later
 fi
done

The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.

Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help...

Question 2

See this post: Fastest way to find lines of a file from another larger file in Bash

Question 3

The problem, of course, is that you run grep on the big file 10,000 times. You should read both files only once. If you want to stay outside scripting languages, you can do it this way:

Extract all numbers from file 1 and sort them
Extract all numbers from file 2 and sort them
Run comm on the sorted lists to get what's only on the second list

Something like this:

$ grep -o '^[0-9]\{12\}$' file1 | sort -u -o file1.sorted
$ grep -o '[0-9]\{12\}' file2 | sort -u -o file2.sorted
$ comm -13 file1.sorted file2.sorted > file3

See man comm.

If you could truncate the big file every day (like a log file) you could keep a cache of sorted numbers and wouldn't need to parse it whole every time.

Question 4

Neat! 2 seconds (on not particularly fast drives) with 200,000 random lines entries in mainfile (ie. 600,000 lines) and 143,000 random keys (that's just how my test data ended up)... tested, and it works (but you knew that :) ... I do wonder about the {12}.. OP has used 12, but the example keys are 13 long...

Question 5

Just a little note, you can do it without dealing with temporary files by using <(grep...sort) where the file names are.

Question 6

Thank you, but grepping and sorting the files takes much longer than my previous loop (+2min.).

Question 7

@Teresa e Junior. How big is your main file? ... You have mentioned that it grows at 200,000 lines a day, but not how big it is... To reduce the amount of data you nee to process, you can read just the current days' 200,000 lines by taking a note of the last line number processed (yesterday) and using tail -n +$linenum to output only the latest data. That way you will be processing only approximately 200,000 lines each day.. I just tested it with 6 million lines in main-file and 10 thousand keys... time: real 0m0.016s, user 0m0.008s, sys 0m0.008s

Question 8

I'm really quite puzzled/curious about how you can grep your main file 10,000 times and find it faster than this method which only greps it once (and once for the much smaller file1)... Even if your sort takes longer than my test, I just can't get my head around the idea that reading a big file that many times doesn't outweigh a single sort (timewise)

Question 9

This answer is based on the awk answer posted by potong..
It is twice as fast as the comm method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)

Although awk is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...

awk '{ if (/^[^0-9]/) { next } # Skip lines which do not hold key values
 if (FNR==NR) { main[0ドル]=1 } # Process keys from file "mainfile"
 else if (main[0ドル]==0) { keys[0ドル]=1 } # Process keys from file "keys"
 } END { for(key in keys) print key }' \
 "mainfile" "keys" >"keys.not-in-main"

# For 6 million lines in "mainfile" and 10 thousand keys in "keys"
# The awk method
# time:
# real 0m14.495s
# user 0m14.457s
# sys 0m0.044s
# The comm method
# time:
# real 0m27.976s
# user 0m28.046s
# sys 0m0.104s

Question 10

This is fast, but I don't understand much of awk: what the file names should look like? I tried file1 -> mainfile and file2 -> keys with gawk and mawk, and it outputs wrong keys.

Question 11

file1 has keys, names and jobs.

Question 12

'mainfile' is the big file (with keys, names and jobs). I've just called it "mainfile' because I kept getting mixed up which file was which (file1 vs file2).. 'keys' contains only the 10 thousand, or however many, keys.. For your situaton do NOT redirect anyting ... just use file1 EOF file2 They are the names of your files.. "EOF" is a 1-line file creadte by the script to indicate the end of the first file (main data file) and the start of teh second file (keys). awk allow you to read in a series of files.. In this case that series has 3 files in it. The output goes to stdout

Question 13

This script will print any keys which are present in mainfile, AND it will also print any keys from the keys file which are NOT in mainfile... That's probably what is happening... (I'll look a bit further into it...

Question 14

Thank you, @Peter.O! Since the files are confidential, I'm trying to create sample files with $RANDOM for uploading.

Question 15

Yes, definitely do use a database. They're made exactly for tasks like this.

Question 16

Thanks! I don't have much experience with databases. Which database do you recommend? I've got MySQL and the sqlite3 command installed.

Question 17

They are both fine for this, sqlite is simpler because it's basically just a file and an SQL API to access it. With MySQL you need to set up a MySQL server in order to use it. While that's not very difficult either, sqlite might be best to start with.

Question 18

This might work for you:

 awk '/^[0-9]/{a[0ドル]++}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3

EDIT:

Amended script to allow for duplicates and unknown keys in both files, still produces keys from the first file not present in the second:

 awk '/^[0-9]/{if(FNR==NR){a[0ドル]=1;next};if(0ドル in a){a[0ドル]=2}}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3

Question 19

This will miss new keys which occur more than once in the main file (and for that matter, which occurr more than once in the keys file) It seems to require that the array count incrementing of the main file is not alowed to exceed 1, or some equivalent workaround (+1 because it is pretty close to the mark)

Question 20

I tried with gawk and mawk, and it outputs wrong keys...

Question 21

@Peter.O I assumed the main file had unique keys and that file 2 was a subset of the main file.

Question 22

@potong The second one works good and very fast! Thank you!

Question 23

@Teresa e Junior Are you sure it is working correctly yet?.. Using the test data you provided, which should output 5000 keys, when I run it, it produces 136703 keys, just as I got until I finally understood what your requirements were... @potong Of course! FNR==NR (I've never used it before :)

Question 24

With that much data, you should really switch to a database. In the meantime, one thing you must do to get anywhere near decent performance is not to search file1 separately for each key. Run a single grep to extract all the non-excluded keys at once. Since that grep also returns lines that don't contain a key, filter those away.

grep -o '[0-9]\{12\}' file2 |
grep -Fxv -f - file1 |
grep -vx '[0-9]\{12\}' >file3

(-Fx means to search whole lines, literally. -f - means to read a list of patterns from standard input.)

Question 25

Unless I'm mistaken, this doesn't address the problem of storing keys that aren't in the big file, it will store the keys that are in it.

Question 26

@Kevin exactly, and this has forced me to use the loop.

Question 27

@TeresaeJunior: adding -v (-Fxv) may take care of that.

Question 28

@DennisWilliamson That would pick all the lines in the big file that don't match any in the key file, including names, jobs, etc.

Question 29

@Kevin Thanks, I'd misread the question. I've added a filter for non-key lines, though my preference now goes to using comm.

Question 30

Permit me to reinforce what others have said, "Get thee to a database!"

There are MySQL binaries freely available for most platforms.

Why not SQLite? It's memory-based, loading a flat-file when you start it, then closing it when you're done. This means that if your computer crashes or the SQLite process goes away, so does all the data.

Your problem looks like just a couple lines of SQL, and will run in milliseconds!

After installing MySQL (which I recommend over other choices), I'd shell out 40ドル for O'Reilly's SQL Cookbook by Anthony Molinaro, which has lots of problem patterns, starting with simple SELECT * FROM table queries, and going through aggregates and multiple joins.

Question 31

Yes, I'll start migrating my data to SQL in a few days, thank you! The awk scripts have been helping me a lot until I get it all done, though!

Question 32

I'm not sure if this is the exact output that you are looking for, but probably the easiest way is:

grep -o '[0-9]\{12\}' file2 | sed 's/.*/^&$/' > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

You could also use:

sed -ne '/.*\([0-9]\{12\}.*/^1円$/p' file2 > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep

Each of these creates a temporary pattern file which is used to glean out the numbers from the large file (file1).

Question 33

I believe this too finds numbers that are in the big file, not those that aren't.

Question 34

Correct, I didn't see the '!' in the OP. Just need to use grep -vf instead of grep -f.

Question 35

No @arcege, grep -vf won't display keys not matched, it will display everything including names and jobs.

Question 36

I fully agree with you getting a database (MySQL is fairly easy to use). Until you get that running, I like Angus's comm solution, but so many people are trying with grep and getting it wrong that I thought I'd show the (or at least one) correct way to do it with grep.

grep -o '[0-9]\{12\}' keyfile | grep -v -f <(grep -o '^[0-9]\{12\}' bigfile)

The first grep gets the keys. The third grep (in the <(...)) takes all the keys used in the big file, and the <(...) passes it like a file as an argument to -f in the second grep. That causes the second grep to use it as a list of lines to match. It then uses this to match its input (the list of keys) from the pipe (first grep), and prints any keys extracted from the key file and not (-v) the big file.

Of course you can do this with temporary files you have to keep track of and remember to delete:

grep -o '[0-9]\{12\}' keyfile >allkeys
grep -o '^[0-9]\{12\}' bigfile >usedkeys
grep -v -f usedkeys allkeys

This prints all lines in allkeys that don't appear in usedkeys.

Question 37

Unfortunately it is slow, and I get a memory error after 40 seconds: grep: Memory exhausted

Question 38

@Peter.O But it's correct. Anyway, that's why I'd suggest a database or comm, in that order.

Question 39

Yes that works, but is much slower than the loop.

Question 40

The keyfile does not change? Then you should avoid searching the old entries again and again.

With tail -f you can get the output of a growing file.

tail -f growingfile | grep -f keyfile

grep -f reads the patterns from a file, one line as a pattern.

Question 41

That would be good, but the key file is always different.

Question 42

Wasn't going to post my answer because I thought that such amount of data shouldn't be processed with a shell script, and the right answer to use a database was already given. But since now there are 7 other approaches...

Reads the first file in memory, then greps the second file for numbers and checks if values are stored in memory. It should be faster than multiple greps, if you have enough memory to load the whole file, that is.

declare -a record
while read key
do
 read name
 read job
 record[$key]="$name:$job"
done < file1
for number in $(grep -o '[0-9]\{12\}' file2)
do
 [[ -n ${mylist[$number]} ]] || echo $number >> file3
done

Question 43

I've got enough memory, but I found this one even slower. Thanks though!

Question 44

I agree with @jan-steinman that you should use a database for this kind of task. There are lots of ways to hack together a solution with a shell script as the other answers show, but doing it that way will lead to a lot of misery if you're going to use and maintain the code for any length of time more than just a one-day throw-away project.

Assuming you're on a Linux box then you most likely have Python installed by default which includes the sqlite3 library as of Python v2.5. You can check your Python version with:

% python -V
Python 2.7.2+

I recommend using sqlite3 library because it is a simple file-based solution that exists for all platforms (including inside your web browser!) and it does not require a server to be installed. Essentially zero-configuration and zero-maintenance.

Below is a simple python script that will parse the file format that you gave as an example and then does a simple "select all" query and outputs everything it stored in the db.

#!/usr/bin/env python
import sqlite3
import sys
dbname = '/tmp/simple.db'
filename = '/tmp/input.txt'
with sqlite3.connect(dbname) as conn:
 conn.execute('''create table if not exists people (key integer primary key, name text, job text)''')
 with open(filename) as f:
 for key in f:
 key = key.strip()
 name = f.next().strip()
 job = f.next().strip()
 try:
 conn.execute('''insert into people values (?,?,?)''', (key, name, job))
 except sqlite3.IntegrityError:
 sys.stderr.write('record already exists: %s, %s, %s\n' % (key, name, job))
 cur = conn.cursor()
 # get all people
 cur.execute('''select * from people''')
 for row in cur:
 print row
 # get just two specific people
 person_list = [1358726575123, 9973834728345]
 cur.execute('''select * from people where key in (?,?)''', person_list)
 for row in cur:
 print row
 # a more general way to get however many people are in the list
 person_list = [1358726575123, 9973834728345]
 template = ','.join(['?'] * len(person_list))
 cur.execute('''select * from people where key in (%s)''' % (template), person_list)
 for row in cur:
 print row

Yes, this means that you'll need to learn some SQL, but it will be well worth it in the long run. Also, instead of parsing your log files, maybe you could write data directly to your sqlite database.

Question 45

Thank you for the python script! I think /usr/bin/sqlite3 works the same way for shell scripts (packages.debian.org/squeeze/sqlite3), though I've never used it.

Question 46

Yes, you can use /usr/bin/sqlite3 with shell scripts, however I recommend avoiding shell scripts except for simple throw-away programs and instead use a language like python which has better error handling and is easier to maintain and grow.

angus angus 12.7k3 gold badges46 silver badges40 bronze badges · Accepted Answer · 2012-01-21 14:27:20Z

17

The problem, of course, is that you run grep on the big file 10,000 times. You should read both files only once. If you want to stay outside scripting languages, you can do it this way:

Extract all numbers from file 1 and sort them
Extract all numbers from file 2 and sort them
Run comm on the sorted lists to get what's only on the second list

Something like this:

$ grep -o '^[0-9]\{12\}$' file1 | sort -u -o file1.sorted
$ grep -o '[0-9]\{12\}' file2 | sort -u -o file2.sorted
$ comm -13 file1.sorted file2.sorted > file3

See man comm.

If you could truncate the big file every day (like a log file) you could keep a cache of sorted numbers and wouldn't need to parse it whole every time.

Share

Improve this answer

edited Jan 21, 2012 at 22:59

answered Jan 21, 2012 at 14:27

angus's user avatar

angus angus

12.7k3 gold badges46 silver badges40 bronze badges

10

1

Neat! 2 seconds (on not particularly fast drives) with 200,000 random lines entries in mainfile (ie. 600,000 lines) and 143,000 random keys (that's just how my test data ended up)... tested, and it works (but you knew that :) ... I do wonder about the {12}.. OP has used 12, but the example keys are 13 long...

Peter.O
– Peter.O

2012年01月21日 15:20:25 +00:00
Commented Jan 21, 2012 at 15:20
2

Just a little note, you can do it without dealing with temporary files by using <(grep...sort) where the file names are.

Kevin
– Kevin

2012年01月21日 17:20:20 +00:00
Commented Jan 21, 2012 at 17:20
Thank you, but grepping and sorting the files takes much longer than my previous loop (+2min.).

admirabilis
– admirabilis

2012年01月21日 19:36:01 +00:00
Commented Jan 21, 2012 at 19:36
@Teresa e Junior. How big is your main file? ... You have mentioned that it grows at 200,000 lines a day, but not how big it is... To reduce the amount of data you nee to process, you can read just the current days' 200,000 lines by taking a note of the last line number processed (yesterday) and using tail -n +$linenum to output only the latest data. That way you will be processing only approximately 200,000 lines each day.. I just tested it with 6 million lines in main-file and 10 thousand keys... time: real 0m0.016s, user 0m0.008s, sys 0m0.008s

Peter.O
– Peter.O

2012年01月21日 19:56:29 +00:00
Commented Jan 21, 2012 at 19:56
1

I'm really quite puzzled/curious about how you can grep your main file 10,000 times and find it faster than this method which only greps it once (and once for the much smaller file1)... Even if your sort takes longer than my test, I just can't get my head around the idea that reading a big file that many times doesn't outweigh a single sort (timewise)

Peter.O
– Peter.O

2012年01月21日 20:16:58 +00:00
Commented Jan 21, 2012 at 20:16

| Show 5 more comments

Stack Exchange Network

Grep huge number of patterns from huge file

11 Answers 11

You must log in to answer this question.

Linked

Hot Network Questions

Grep huge number of patterns from huge file

11 Answers 11

You must log in to answer this question.

Linked

Related

Hot Network Questions