I want to write a python script to sort a huge file, say 2 GB in size, which contains logs in the following format -
Jan 1 02:32:40 other strings but may or may not unique in all those lines
Jan 1 02:32:40 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Feb 1 03:52:26 other strings but may or may not unique in all those lines
Feb 1 03:52:26 other strings but may or may not unique in all those lines
Jan 1 02:46:40 other strings but may or may not unique in all those lines
Jan 1 02:44:40 other strings but may or may not unique in all those lines
Jan 1 02:40:40 other strings but may or may not unique in all those lines
Feb 10 03:52:26 other strings but may or may not unique in all those lines
I want to sort them basted on the timestamp.
I was able to get this working but for my code to succeed, I need to load the WHOLE file into a list.. this means it will be extremely inefficient from a memory utilization point of view.
Can you please suggest if there is a more efficient way where I can sort this by reading the file line by line or perhaps some other approach that I am not aware of ?
Here is my code -
# convert the log into a list of strings
with open("log.txt", 'r') as f:
lines = f.read().splitlines()
# writing the method which will be fed as a key for sorting
def convert_time(logline):
# extracting hour, minute and second from each log entry
h, m, s = map(int, logline.split()[2].split(':'))
time_in_seconds = h * 3600 + m * 60 + s
return time_in_seconds
sorted_log_list = sorted(lines, key=convert_time)
''' sorted_log_list is a "list of lists". Each list within it is a representation of one log entry. We will use print and join to print it out as a readable log entry'''
for lines in sorted_log_list:
print lines
1 Answer 1
You're ignoring the Date-part of the timestamps; it doesn't sound like that's on purpose. (Also, the year is missing altogether, which should make us quite nervous.) Also, let's use explicit datetime utilities and regexes.
import datetime
import re
timestamp_regex = re.compile("[^:]+:\d\d:\d\d")
def convert_time(logline):
stamp = timestamp_regex.match(logline).group() #this will error if there's no match.
d = datetime.strptime(stamp, "%b %e %H:%M:%S")
return int(d.timestamp())
As for the rest, the comments are right that we can't do much unless we know exactly what it would mean for the solution to be improved.
If the concern is just to handle the biggest file with the least ram, something like this might work:
def save_where_we_can_find_it(line, temp_file):
retval = temp_file.tell()
temp_file.write(line)
return retval
def fetch_line(location, temp_file):
temp_file.seek(location)
return temp_file.readline()
items = []
with open("log.txt", 'r') as original, open(".temp.log.txt", 'w') as temp:
for line in original:
items.append((convert_time(line), save_where_we_can_find_it(line, temp)))
items.sort(key = lambda pair: pair[0]) #sort-in-place isn't necessarily a good idea; whatever.
with open(".temp.log.txt", 'r') as temp:
for (stamp, location) in items:
print(fetch_line(location, temp))
import os
os.remove(".temp.log.txt")
But this is just a really inefficient way using a scratch-file. Better to register scratch-space in the OS, and then do your file manipulation "in memory".
Explore related questions
See similar questions with these tags.
sort -M <logfile>
not working in this case? \$\endgroup\$