Compare two files with four columns

Question 1

I have two csv files with some log information. I need to compare line to line the field1 (file1) with field2 (file2) with a range of 1 minute (-00:00:01 < time < 00:00:01) and the field5 (file1) with field3 (file2). If they match, copy field3 (input2) to output and print the next fields from file1, else write "NOACT". I tried with awk, join and python but didn't work.

file1 (55000000 lines):

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1

File2 (25000 lines):

19:15:30,187.173.121.63,42347,NOT
19:15:30,187.173.121.63,52364,OK
19:15:30,187.173.121.63,52364,OK
19:15:32,145.246.158,44444,NOT

Expected output:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

Example

file1:

A11 A12 A13 A14 A15 A16 A17 A18 A19 A110 
A21 A22 A23 A24 A25 A26 A27 A28 A29 A210 
A31 A32 A33 A34 A35 A36 A37 A38 A39 A310

file2:

B11 B12 B13 
B21 B22 B23 
B31 B32 B33

I need to see if B11 matches A11 and, if so, see if B12 matches A15. If it does, write the first line of output (outline1= inputAline1 && B13) else go to next line of B. If no matches are found, write the first line of A && "NOACT".

Question 2

So you don't actually want to compare but rather join two data sets on a given column? Would it be admissible to do this in, say, SQL? Also awk would indeed be suited for the task (although it wouldn't be a one-liner), but it's hard to wrap ones head around your description of what you want to achieve, tbh. For example is one supposed to match the first three fields of what seems like a time stamp or are the records running in parallel, so one could also go by the line number from the start?

Question 3

they are ordered by time, but input1 have more lines from the same connection. EX. inputA A11 A12 A13 A14 A15 A16 A17 A18 A19 A110 A21 A22 A23 A24 A25 A26 A27 A28 A29 A210 A31 A32 A33 A34 A35 A36 A37 A38 A39 A310 inputB B11 B12 B13 B21 B22 B23 B31 B32 B33 I need to see if B11 match with A11 if so see if B12 match with A15 if so write first line of inputA && B13 else go to next line of B. If don't find any match write first line of A && "NOACT" Thanks for help

Question 4

If I'm not mistaken the output of your example should be: NOACT OK NOACT NOT NOACT. I also think you meant to say you want a range of 1 second, not 1 minute.

Question 5

Just to clarify, you need to check each line of fileA against all lines of fileB?

Question 6

Yes it's 1 second and should be NOACT OK NOACT NOT NOTACT. Thank for the help, I will try the script

Question 7

You could use sqlite for joining your data:

$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT

which gives:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

Question 8

You are not comparing times with a range of one second, you are just rounding numbers down to the nearest second. For this reason your fourth line is wrong: the fourth line of the first file should match the fourth line of the second file (because |19:15:31.049687-19:15:32|<00:00:01). The field appended to the five lines should be, in order, NOACT OK NOACT NOT NOACT, as mentioned in the comments to the original question.

Question 9

@lgpasquale you're correct. Fixed that.

Question 10

The following script should do what you are looking for in python:

#!/usr/bin/env python3
import sys
import csv
import bisect
import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
 # We add hours and minutes in case they are missinge
 # We reverse the list
 # We only take the first three elements of the list anyway
 (s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
 result = float(h) * 3600 + float(m) * 60 + float(s)
 return result
if (len(sys.argv) != 4):
 print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
 exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2) as inputFile2:
 csvReader = csv.reader(inputFile2)
 for row in csvReader:
 if len(row) == 4:
 if not timestampRegex.match(row[0]):
 continue
 time = timeToFloat(row[0])
 file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1) as inputFile1, open(outputFileName, 'w') as outputFile:
 csvReader = csv.reader(inputFile1)
 # For each row in file1 we look for a match among the entries of file2
 for row in csvReader:
 if len(row) == 11:
 if not timestampRegex.match(row[0]):
 # We can't possibly find a match for this line
 outputFile.write(','.join(row + ['NOACT']))
 outputFile.write('\n')
 continue
 time = timeToFloat(row[0])
 # Find the first and last entries of file2 within a range of 1 second
 a = bisect.bisect_right(file2Entries, (time - timeRange,))
 b = bisect.bisect_left(file2Entries, (time + timeRange,))
 # Loop on the possible matches (those within the given range)
 for entry in file2Entries[a:b]:
 if entry[1][2] == row[4]:
 outputFile.write(','.join(row + [entry[1][3]]))
 outputFile.write('\n')
 break;
 else:
 # We haven't found a match in file2
 outputFile.write(','.join(row + ['NOACT']))
 outputFile.write('\n')

You can save it as compare.py and run it as:

./compare.py input1.txt input2.txt output.txt

Note that I'm not entirely sure on the correctness of the range extracted, you should probably make sure it works correctly.

UPDATE: Lines of the second input file that don't contain a valid timestamp are now ignored, while lines of the first input file that don't containe a valid timestamp are now copied to the output file with NOACT appended at the end

Question 11

Hi thanks a lot for the script I tried to run the script and the computer show this error root@localhost:~/Python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 39, in <module> time = timeToFloat(row[0]) File "./compare.py", line 11, in timeToFloat (h, m, s) = time.split(':') ValueError: need more than 2 values to unpack I look for the output and it's not complete. There is another problem, he also say NOACT all the times Thanks

Question 12

Can you provide a minimal example which is not working? The one currently in your question works fine for me. Sincethe scripts only parses lines that have the right number of fields, I suspect you have some line in which the time is not formatted as hh:mm:ss

Question 13

I've edited my answer to work if hours and minutes are missing (i.e. mm:ss and ss)

Question 14

The new edited script it's ok but stops on a new line
here is the error

root@localhost:~/python# ./compare.py input1.txt input2.txt output.txt Traceback (most recent call last): File "./compare.py", line 46, in <module> for row in csvReader: File "/usr/lib/python3.2/codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 2099: invalid start byte``

I looked for the next line in input1 and it's similar to the others, so I tried to solve the problem and found this:

https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character

I changed the script to

#!/usr/bin/env python3
 import sys
 import csv
 import bisect
 import re
timeRange = 1 # seconds
# Timestamps should match the following regex
timestampRegex = re.compile('[0-9]*:*[0-9]*:[0-9][0-9]\.*[0-9]*')
# Convert the time from a string 'hh:mm:ss' to a float
def timeToFloat(time):
# We add hours and minutes in case they are missinge
# We reverse the list
# We only take the first three elements of the list anyway
(s, m, h) = (['00', '00'] + time.split(':'))[::-1][0:3]
result = float(h) * 3600 + float(m) * 60 + float(s)
return result
if (len(sys.argv) != 4):
 print('Usage: {} <input file 1> <input file 2> <output file>'.format(sys.argv[0]))
exit(1)
inputFileName1 = sys.argv[1]
inputFileName2 = sys.argv[2]
outputFileName = sys.argv[3]
# Each entry will be a tuple with the time as the first element
# and the row of file2 as the second element
file2Entries = []
with open(inputFileName2, encoding="utf8") as inputFile2:
 csvReader = csv.reader(inputFile2)
 for row in csvReader:
 if len(row) == 4:
 if not timestampRegex.match(row[0]):
 continue
 time = timeToFloat(row[0])
 file2Entries.append((time, row))
file1Entries = []
with open(inputFileName1, encoding="utf8") as inputFile1, open(outputFileName, 'w', encoding="utf8") as outputFile:
csvReader = csv.reader(inputFile1)
# For each row in file1 we look for a match among the entries of file2
for row in csvReader:
 if len(row) == 11:
 if not timestampRegex.match(row[0]):
 # We can't possibly find a match for this line
 outputFile.write(','.join(row + ['NOACT']))
 outputFile.write('\n')
 continue
 time = timeToFloat(row[0])
 # Find the first and last entries of file2 within a range of 1 second
 a = bisect.bisect_right(file2Entries, (time - timeRange,))
 b = bisect.bisect_left(file2Entries, (time + timeRange,))
 # Loop on the possible matches (those within the given range)
 for entry in file2Entries[a:b]:
 if entry[1][2] == row[4]:
 outputFile.write(','.join(row + [entry[1][3]]))
 outputFile.write('\n')
 break;
 else:
 # We haven't found a match in file2
 outputFile.write(','.join(row + ['NOACT']))
 outputFile.write('\n')

but continues give the some error

Question 15

I think python was already trying to use utf8. Your file, however, must be using another encoding

Question 16

do you have any idea how can I solve this? I don't understand because he makes 8668262 lines and then stop. thank you very much

Question 17

You could try python-chardet, which might be already available for your distribution. You can run it on your file as chardetect.py file1 and use the encoding it suggests in place of utf8

Question 18

The script it's correct the problem was a strange line that I deleted. SOLVED PS. I tried use the sql script too but some lines have more "," so to use sql I probably should correct that first. Thanks a lot for the help!

FloHimself FloHimself 11.8k3 gold badges24 silver badges24 bronze badges · Accepted Answer · 2015-05-21 08:34:11Z

You could use sqlite for joining your data:

$ sqlite3 <<EOT
.mode csv
CREATE TABLE file1 (A11,A12,A13,A14,A15,A16,A17,A18,A19,A110,A111);
CREATE TABLE file2 (B11,B12,B13,B14);
.import file1 file1
.import file2 file2
SELECT DISTINCT file1.*, ifnull(file2.B14,"NOACT") FROM file1 LEFT JOIN file2 ON abs(julianday(file1.A11) - julianday(file2.B11))*86400.0 < 1 AND A15 = B13;
EOT

which gives:

19:15:30.047315,184.125.142.179,87492,174.145.246.178,87961,.,7648919765,145,767942442,896450764,1,NOACT
19:15:30.047578,184.125.142.185,87492,174.145.246.132,52364,.,7648919765,145,767942442,896450764,1,OK
19:15:31.047315,184.125.142.145,87492,174.145.246.158,52364,.,7648919765,145,767942442,896450764,1,NOACT
19:15:31.049687,184.125.142.145,87492,174.145.246.158,44444,.,7648919765,145,767942442,896450764,1,NOT
19:15:32.049687,184.125.142.145,87492,174.145.246.158,44454,.,7648919765,145,767942442,896450764,1,NOACT

You are not comparing times with a range of one second, you are just rounding numbers down to the nearest second. For this reason your fourth line is wrong: the fourth line of the first file should match the fourth line of the second file (because |19:15:31.049687-19:15:32|<00:00:01). The field appended to the five lines should be, in order, NOACT OK NOACT NOT NOACT, as mentioned in the comments to the original question.

Stack Exchange Network

Compare two files with four columns

Example

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Compare two files with four columns

Example

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions