I have two csv files with some log information.
I need to compare line to line the field1 (file1) with field2 (file2) with a range of 1 minute (-00:00:01 < time < 00:00:01) and the field5 (file1) with field3 (file2). If they match, copy field3 (input2) to output and print the next fields from file1, else write "NOACT".
I tried with awk
, join
and python
but didn't work.
file1 (55000000 lines):
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1
File2 (25000 lines):
19:15:30,187.173.121.63,42347,NOT
19:15:30,187.173.121.63,52364,OK
19:15:30,187.173.121.63,52364,OK
19:15:32,145.246.158,44444,NOT
Expected output:
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT
Example
file1:
A11 A12 A13 A14 A15 A16 A17 A18 A19 A110
A21 A22 A23 A24 A25 A26 A27 A28 A29 A210
A31 A32 A33 A34 A35 A36 A37 A38 A39 A310
file2:
B11 B12 B13
B21 B22 B23
B31 B32 B33
I need to see if B11 matches A11 and, if so, see if B12 matches A15. If it does, write the first line of output (outline1= inputAline1 && B13) else go to next line of B. If no matches are found, write the first line of A && "NOACT".
3 Answers 3
You could use sqlite
for joining your data:
$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT
which gives:
19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT
-
You are not comparing times with a range of one second, you are just rounding numbers down to the nearest second. For this reason your fourth line is wrong: the fourth line of the first file should match the fourth line of the second file (because |19:15:31.049687-19:15:32|<00:00:01). The field appended to the five lines should be, in order, NOACT OK NOACT NOT NOACT, as mentioned in the comments to the original question.lgpasquale– lgpasquale2015年05月21日 11:32:13 +00:00Commented May 21, 2015 at 11:32
-
@lgpasquale you're correct. Fixed that.FloHimself– FloHimself2015年05月21日 13:13:48 +00:00Commented May 21, 2015 at 13:13
The following script should do what you are looking for in python:
#!/usr/bin/env python3
import sys
import csv
import bisect
import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result
if (len(sys.argv) != 4):
print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
csvReader = csv.reader(inputFile2)
for row in csvReader:
if len(row) == 4:
if not timestampRegex.match(row[0]):
continue
time = timeToFloat(row[0])
file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
if len(row) == 11:
if not timestampRegex.match(row[0]):
# We can't possibly find a match for this line
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
continue
time = timeToFloat(row[0])
# Find the first and last entries of file2 within a range of 1 second
a = bisect.bisect_right(file2Entries, (time - timeRange,))
b = bisect.bisect_left(file2Entries, (time + timeRange,))
# Loop on the possible matches (those within the given range)
for entry in file2Entries[a:b]:
if entry[1][2] == row[4]:
outputFile.write(','.join(row + [entry[1][3]]))
outputFile.write('\n')
break;
else:
# We haven't found a match in file2
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
You can save it as compare.py
and run it as:
./compare.py input1.txt input2.txt output.txt
Note that I'm not entirely sure on the correctness of the range extracted, you should probably make sure it works correctly.
UPDATE: Lines of the second input file that don't contain a valid timestamp are now ignored, while lines of the first input file that don't containe a valid timestamp are now copied to the output file with NOACT appended at the end
-
Hi thanks a lot for the script I tried to run the script and the computer show this error root@localhost:~/Python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 39, in <module> time = timeToFloat(row[0]) File "./compare.py", line 11, in timeToFloat (h, m, s) = time.split(':') ValueError: need more than 2 values to unpack I look for the output and it's not complete. There is another problem, he also say NOACT all the times Thanksbeginner– beginner2015年05月20日 16:31:10 +00:00Commented May 20, 2015 at 16:31
-
Can you provide a minimal example which is not working? The one currently in your question works fine for me. Sincethe scripts only parses lines that have the right number of fields, I suspect you have some line in which the time is not formatted as
hh:mm:ss
lgpasquale– lgpasquale2015年05月20日 16:40:52 +00:00Commented May 20, 2015 at 16:40 -
I've edited my answer to work if hours and minutes are missing (i.e.
mm:ss
andss
)lgpasquale– lgpasquale2015年05月20日 20:45:50 +00:00Commented May 20, 2015 at 20:45
The new edited script it's ok but stops on a new line
here is the error
root@localhost:~/python# ./compare.py input1.txt input2.txt output.txt
Traceback (most recent call last):
File "./compare.py", line 46, in <module>
for row in csvReader:
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 2099: invalid start byte``
I looked for the next line in input1 and it's similar to the others, so I tried to solve the problem and found this:
I changed the script to
#!/usr/bin/env python3
import sys
import csv
import bisect
import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result
if (len(sys.argv) != 4):
print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2, encoding="utf8") as inputFile2:
csvReader = csv.reader(inputFile2)
for row in csvReader:
if len(row) == 4:
if not timestampRegex.match(row[0]):
continue
time = timeToFloat(row[0])
file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1, encoding="utf8") as inputFile1, open(outputFileName, 'w', encoding="utf8") as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
if len(row) == 11:
if not timestampRegex.match(row[0]):
# We can't possibly find a match for this line
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
continue
time = timeToFloat(row[0])
# Find the first and last entries of file2 within a range of 1 second
a = bisect.bisect_right(file2Entries, (time - timeRange,))
b = bisect.bisect_left(file2Entries, (time + timeRange,))
# Loop on the possible matches (those within the given range)
for entry in file2Entries[a:b]:
if entry[1][2] == row[4]:
outputFile.write(','.join(row + [entry[1][3]]))
outputFile.write('\n')
break;
else:
# We haven't found a match in file2
outputFile.write(','.join(row + ['NOACT']))
outputFile.write('\n')
but continues give the some error
-
I think python was already trying to use utf8. Your file, however, must be using another encodinglgpasquale– lgpasquale2015年05月21日 14:58:44 +00:00Commented May 21, 2015 at 14:58
-
do you have any idea how can I solve this? I don't understand because he makes 8668262 lines and then stop. thank you very muchTSTeles– TSTeles2015年05月21日 15:07:11 +00:00Commented May 21, 2015 at 15:07
-
You could try python-chardet, which might be already available for your distribution. You can run it on your file as
chardetect.py file1
and use the encoding it suggests in place of utf8lgpasquale– lgpasquale2015年05月21日 15:45:49 +00:00Commented May 21, 2015 at 15:45 -
The script it's correct the problem was a strange line that I deleted. SOLVED PS. I tried use the sql script too but some lines have more "," so to use sql I probably should correct that first. Thanks a lot for the help!TSTeles– TSTeles2015年05月22日 19:20:45 +00:00Commented May 22, 2015 at 19:20
awk
would indeed be suited for the task (although it wouldn't be a one-liner), but it's hard to wrap ones head around your description of what you want to achieve, tbh. For example is one supposed to match the first three fields of what seems like a time stamp or are the records running in parallel, so one could also go by the line number from the start?