I have a file that's growing about 200,000 lines a day, and it is all formed with blocks of three lines as such:
1358726575123 # key
Joseph Muller # name
carpenter # job
9973834728345
Andres Smith
student
7836472098652
Mariah Anthony
dentist
Now, I have another file from which I extract about 10,000 key patterns, such as 1358726575123
. Then I run a for
loop with these patterns and have to check them against the first file. If the file doesn't contain such pattern, I save the pattern in a third file for further processing:
for number in $(grep -o '[0-9]\{12\}' file2); do # finds about 10.000 keys
if ! grep -q ^$number$ file1; then # file1 is a huge file
printf "$number\n" >>file3 # we'll process file3 later
fi
done
The example code greps a huge file 10,000 times, and I run this loop about once a minute, during the whole day.
Since the huge file keeps growing, what can I do to make all this faster and save some CPU? I wonder whether sorting the file somehow by its key (if so, how?) or using a db instead of plain text would help...
-
See this post: Fastest way to find lines of a file from another larger file in Bashcodeforester– codeforester2018年03月03日 18:43:20 +00:00Commented Mar 3, 2018 at 18:43
11 Answers 11
The problem, of course, is that you run grep on the big file 10,000 times. You should read both files only once. If you want to stay outside scripting languages, you can do it this way:
- Extract all numbers from file 1 and sort them
- Extract all numbers from file 2 and sort them
- Run
comm
on the sorted lists to get what's only on the second list
Something like this:
$ grep -o '^[0-9]\{12\}$' file1 | sort -u -o file1.sorted
$ grep -o '[0-9]\{12\}' file2 | sort -u -o file2.sorted
$ comm -13 file1.sorted file2.sorted > file3
See man comm
.
If you could truncate the big file every day (like a log file) you could keep a cache of sorted numbers and wouldn't need to parse it whole every time.
-
1Neat! 2 seconds (on not particularly fast drives) with 200,000 random lines entries in mainfile (ie. 600,000 lines) and 143,000 random keys (that's just how my test data ended up)... tested, and it works (but you knew that :) ... I do wonder about the
{12}
.. OP has used 12, but the example keys are 13 long...Peter.O– Peter.O2012年01月21日 15:20:25 +00:00Commented Jan 21, 2012 at 15:20 -
2Just a little note, you can do it without dealing with temporary files by using
<(grep...sort)
where the file names are.Kevin– Kevin2012年01月21日 17:20:20 +00:00Commented Jan 21, 2012 at 17:20 -
Thank you, but grepping and sorting the files takes much longer than my previous loop (+2min.).admirabilis– admirabilis2012年01月21日 19:36:01 +00:00Commented Jan 21, 2012 at 19:36
-
@Teresa e Junior. How big is your main file? ... You have mentioned that it grows at 200,000 lines a day, but not how big it is... To reduce the amount of data you nee to process, you can read just the current days' 200,000 lines by taking a note of the last line number processed (yesterday) and using
tail -n +$linenum
to output only the latest data. That way you will be processing only approximately 200,000 lines each day.. I just tested it with 6 million lines in main-file and 10 thousand keys... time: real 0m0.016s, user 0m0.008s, sys 0m0.008sPeter.O– Peter.O2012年01月21日 19:56:29 +00:00Commented Jan 21, 2012 at 19:56 -
1I'm really quite puzzled/curious about how you can grep your main file 10,000 times and find it faster than this method which only greps it once (and once for the much smaller file1)... Even if your sort takes longer than my test, I just can't get my head around the idea that reading a big file that many times doesn't outweigh a single sort (timewise)Peter.O– Peter.O2012年01月21日 20:16:58 +00:00Commented Jan 21, 2012 at 20:16
This answer is based on the awk
answer posted by potong..
It is twice as fast as the comm
method (on my system), for the same 6 million lines in main-file and 10 thousand keys... (now updated to use FNR,NR)
Although awk
is faster than your current system, and will give you and your computer(s) some breathing space, be aware that when data processing is as intense as you've described, you will get best overall results by switching to a dedicated database; eg. SQlite, MySQL...
awk '{ if (/^[^0-9]/) { next } # Skip lines which do not hold key values
if (FNR==NR) { main[0ドル]=1 } # Process keys from file "mainfile"
else if (main[0ドル]==0) { keys[0ドル]=1 } # Process keys from file "keys"
} END { for(key in keys) print key }' \
"mainfile" "keys" >"keys.not-in-main"
# For 6 million lines in "mainfile" and 10 thousand keys in "keys"
# The awk method
# time:
# real 0m14.495s
# user 0m14.457s
# sys 0m0.044s
# The comm method
# time:
# real 0m27.976s
# user 0m28.046s
# sys 0m0.104s
-
This is fast, but I don't understand much of awk: what the file names should look like? I tried
file1 -> mainfile
andfile2 -> keys
with gawk and mawk, and it outputs wrong keys.admirabilis– admirabilis2012年01月22日 06:39:54 +00:00Commented Jan 22, 2012 at 6:39 -
file1 has keys, names and jobs.admirabilis– admirabilis2012年01月22日 06:41:25 +00:00Commented Jan 22, 2012 at 6:41
-
'mainfile' is the big file (with keys, names and jobs). I've just called it "mainfile' because I kept getting mixed up which file was which (file1 vs file2).. 'keys' contains only the 10 thousand, or however many, keys.. For your situaton do NOT redirect anyting ... just use file1 EOF file2 They are the names of your files.. "EOF" is a 1-line file creadte by the script to indicate the end of the first file (main data file) and the start of teh second file (keys).
awk
allow you to read in a series of files.. In this case that series has 3 files in it. The output goes tostdout
Peter.O– Peter.O2012年01月22日 07:40:59 +00:00Commented Jan 22, 2012 at 7:40 -
This script will print any keys which are present in
mainfile
, AND it will also print any keys from thekeys
file which are NOT inmainfile
... That's probably what is happening... (I'll look a bit further into it...Peter.O– Peter.O2012年01月22日 08:01:36 +00:00Commented Jan 22, 2012 at 8:01 -
Thank you, @Peter.O! Since the files are confidential, I'm trying to create sample files with
$RANDOM
for uploading.admirabilis– admirabilis2012年01月22日 08:38:38 +00:00Commented Jan 22, 2012 at 8:38
Yes, definitely do use a database. They're made exactly for tasks like this.
-
Thanks! I don't have much experience with databases. Which database do you recommend? I've got MySQL and the sqlite3 command installed.admirabilis– admirabilis2012年01月21日 09:47:13 +00:00Commented Jan 21, 2012 at 9:47
-
1They are both fine for this, sqlite is simpler because it's basically just a file and an SQL API to access it. With MySQL you need to set up a MySQL server in order to use it. While that's not very difficult either, sqlite might be best to start with.Mika Fischer– Mika Fischer2012年01月21日 14:07:05 +00:00Commented Jan 21, 2012 at 14:07
This might work for you:
awk '/^[0-9]/{a[0ドル]++}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3
EDIT:
Amended script to allow for duplicates and unknown keys in both files, still produces keys from the first file not present in the second:
awk '/^[0-9]/{if(FNR==NR){a[0ドル]=1;next};if(0ドル in a){a[0ドル]=2}}END{for(x in a)if(a[x]==1)print x}' file{1,2} >file3
-
This will miss new keys which occur more than once in the main file (and for that matter, which occurr more than once in the keys file) It seems to require that the array count incrementing of the main file is not alowed to exceed 1, or some equivalent workaround (+1 because it is pretty close to the mark)Peter.O– Peter.O2012年01月22日 04:40:56 +00:00Commented Jan 22, 2012 at 4:40
-
1I tried with gawk and mawk, and it outputs wrong keys...admirabilis– admirabilis2012年01月22日 06:42:14 +00:00Commented Jan 22, 2012 at 6:42
-
@Peter.O I assumed the main file had unique keys and that file 2 was a subset of the main file.potong– potong2012年01月22日 15:11:14 +00:00Commented Jan 22, 2012 at 15:11
-
@potong The second one works good and very fast! Thank you!admirabilis– admirabilis2012年01月22日 16:55:45 +00:00Commented Jan 22, 2012 at 16:55
-
@Teresa e Junior Are you sure it is working correctly yet?.. Using the test data you provided, which should output 5000 keys, when I run it, it produces 136703 keys, just as I got until I finally understood what your requirements were... @potong Of course! FNR==NR (I've never used it before :)Peter.O– Peter.O2012年01月22日 22:17:58 +00:00Commented Jan 22, 2012 at 22:17
With that much data, you should really switch to a database. In the meantime, one thing you must do to get anywhere near decent performance is not to search file1
separately for each key. Run a single grep
to extract all the non-excluded keys at once. Since that grep
also returns lines that don't contain a key, filter those away.
grep -o '[0-9]\{12\}' file2 |
grep -Fxv -f - file1 |
grep -vx '[0-9]\{12\}' >file3
(-Fx
means to search whole lines, literally. -f -
means to read a list of patterns from standard input.)
-
Unless I'm mistaken, this doesn't address the problem of storing keys that aren't in the big file, it will store the keys that are in it.Kevin– Kevin2012年01月21日 17:15:16 +00:00Commented Jan 21, 2012 at 17:15
-
@Kevin exactly, and this has forced me to use the loop.admirabilis– admirabilis2012年01月21日 18:07:31 +00:00Commented Jan 21, 2012 at 18:07
-
@TeresaeJunior: adding
-v
(-Fxv
) may take care of that.Dennis Williamson– Dennis Williamson2012年01月21日 18:28:27 +00:00Commented Jan 21, 2012 at 18:28 -
@DennisWilliamson That would pick all the lines in the big file that don't match any in the key file, including names, jobs, etc.Kevin– Kevin2012年01月22日 03:26:41 +00:00Commented Jan 22, 2012 at 3:26
-
@Kevin Thanks, I'd misread the question. I've added a filter for non-key lines, though my preference now goes to using
comm
.Gilles 'SO- stop being evil'– Gilles 'SO- stop being evil'2012年01月22日 19:12:12 +00:00Commented Jan 22, 2012 at 19:12
Permit me to reinforce what others have said, "Get thee to a database!"
There are MySQL binaries freely available for most platforms.
Why not SQLite? It's memory-based, loading a flat-file when you start it, then closing it when you're done. This means that if your computer crashes or the SQLite process goes away, so does all the data.
Your problem looks like just a couple lines of SQL, and will run in milliseconds!
After installing MySQL (which I recommend over other choices), I'd shell out 40ドル for O'Reilly's SQL Cookbook by Anthony Molinaro, which has lots of problem patterns, starting with simple SELECT * FROM table
queries, and going through aggregates and multiple joins.
-
Yes, I'll start migrating my data to SQL in a few days, thank you! The awk scripts have been helping me a lot until I get it all done, though!admirabilis– admirabilis2012年01月27日 07:59:56 +00:00Commented Jan 27, 2012 at 7:59
I'm not sure if this is the exact output that you are looking for, but probably the easiest way is:
grep -o '[0-9]\{12\}' file2 | sed 's/.*/^&$/' > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep
You could also use:
sed -ne '/.*\([0-9]\{12\}.*/^1円$/p' file2 > /tmp/numpatterns.grep
grep -vf /tmp/numpatterns.grep file1 > file3
rm -f /tmp/numpatterns.grep
Each of these creates a temporary pattern file which is used to glean out the numbers from the large file (file1
).
-
I believe this too finds numbers that are in the big file, not those that aren't.Kevin– Kevin2012年01月21日 17:22:33 +00:00Commented Jan 21, 2012 at 17:22
-
Correct, I didn't see the '!' in the OP. Just need to use
grep -vf
instead ofgrep -f
.Arcege– Arcege2012年01月21日 17:40:25 +00:00Commented Jan 21, 2012 at 17:40 -
2No @arcege, grep -vf won't display keys not matched, it will display everything including names and jobs.admirabilis– admirabilis2012年01月21日 18:11:06 +00:00Commented Jan 21, 2012 at 18:11
I fully agree with you getting a database (MySQL is fairly easy to use). Until you get that running, I like Angus's comm
solution, but so many people are trying with grep
and getting it wrong that I thought I'd show the (or at least one) correct way to do it with grep
.
grep -o '[0-9]\{12\}' keyfile | grep -v -f <(grep -o '^[0-9]\{12\}' bigfile)
The first grep
gets the keys. The third grep
(in the <(...)
) takes all the keys used in the big file, and the <(...)
passes it like a file as an argument to -f
in the second grep. That causes the second grep
to use it as a list of lines to match. It then uses this to match its input (the list of keys) from the pipe (first grep
), and prints any keys extracted from the key file and not (-v
) the big file.
Of course you can do this with temporary files you have to keep track of and remember to delete:
grep -o '[0-9]\{12\}' keyfile >allkeys
grep -o '^[0-9]\{12\}' bigfile >usedkeys
grep -v -f usedkeys allkeys
This prints all lines in allkeys
that don't appear in usedkeys
.
-
Unfortunately it is slow, and I get a memory error after 40 seconds:
grep: Memory exhausted
Peter.O– Peter.O2012年01月22日 04:59:17 +00:00Commented Jan 22, 2012 at 4:59 -
@Peter.O But it's correct. Anyway, that's why I'd suggest a database or
comm
, in that order.Kevin– Kevin2012年01月22日 05:43:15 +00:00Commented Jan 22, 2012 at 5:43 -
Yes that works, but is much slower than the loop.admirabilis– admirabilis2012年01月22日 06:43:36 +00:00Commented Jan 22, 2012 at 6:43
The keyfile does not change? Then you should avoid searching the old entries again and again.
With tail -f
you can get the output of a growing file.
tail -f growingfile | grep -f keyfile
grep -f reads the patterns from a file, one line as a pattern.
-
That would be good, but the key file is always different.admirabilis– admirabilis2012年01月22日 17:49:03 +00:00Commented Jan 22, 2012 at 17:49
Wasn't going to post my answer because I thought that such amount of data shouldn't be processed with a shell script, and the right answer to use a database was already given. But since now there are 7 other approaches...
Reads the first file in memory, then greps the second file for numbers and checks if values are stored in memory. It should be faster than multiple grep
s, if you have enough memory to load the whole file, that is.
declare -a record
while read key
do
read name
read job
record[$key]="$name:$job"
done < file1
for number in $(grep -o '[0-9]\{12\}' file2)
do
[[ -n ${mylist[$number]} ]] || echo $number >> file3
done
-
I've got enough memory, but I found this one even slower. Thanks though!admirabilis– admirabilis2012年01月22日 20:05:03 +00:00Commented Jan 22, 2012 at 20:05
I agree with @jan-steinman that you should use a database for this kind of task. There are lots of ways to hack together a solution with a shell script as the other answers show, but doing it that way will lead to a lot of misery if you're going to use and maintain the code for any length of time more than just a one-day throw-away project.
Assuming you're on a Linux box then you most likely have Python installed by default which includes the sqlite3 library as of Python v2.5. You can check your Python version with:
% python -V
Python 2.7.2+
I recommend using sqlite3 library because it is a simple file-based solution that exists for all platforms (including inside your web browser!) and it does not require a server to be installed. Essentially zero-configuration and zero-maintenance.
Below is a simple python script that will parse the file format that you gave as an example and then does a simple "select all" query and outputs everything it stored in the db.
#!/usr/bin/env python
import sqlite3
import sys
dbname = '/tmp/simple.db'
filename = '/tmp/input.txt'
with sqlite3.connect(dbname) as conn:
conn.execute('''create table if not exists people (key integer primary key, name text, job text)''')
with open(filename) as f:
for key in f:
key = key.strip()
name = f.next().strip()
job = f.next().strip()
try:
conn.execute('''insert into people values (?,?,?)''', (key, name, job))
except sqlite3.IntegrityError:
sys.stderr.write('record already exists: %s, %s, %s\n' % (key, name, job))
cur = conn.cursor()
# get all people
cur.execute('''select * from people''')
for row in cur:
print row
# get just two specific people
person_list = [1358726575123, 9973834728345]
cur.execute('''select * from people where key in (?,?)''', person_list)
for row in cur:
print row
# a more general way to get however many people are in the list
person_list = [1358726575123, 9973834728345]
template = ','.join(['?'] * len(person_list))
cur.execute('''select * from people where key in (%s)''' % (template), person_list)
for row in cur:
print row
Yes, this means that you'll need to learn some SQL, but it will be well worth it in the long run. Also, instead of parsing your log files, maybe you could write data directly to your sqlite database.
-
Thank you for the python script! I think
/usr/bin/sqlite3
works the same way for shell scripts (packages.debian.org/squeeze/sqlite3), though I've never used it.admirabilis– admirabilis2012年01月27日 08:04:48 +00:00Commented Jan 27, 2012 at 8:04 -
Yes, you can use
/usr/bin/sqlite3
with shell scripts, however I recommend avoiding shell scripts except for simple throw-away programs and instead use a language like python which has better error handling and is easier to maintain and grow.aculich– aculich2012年01月27日 16:46:26 +00:00Commented Jan 27, 2012 at 16:46