Python script for sorting a huge log file based on timestamps

Question 1

I want to write a python script to sort a huge file, say 2 GB in size, which contains logs in the following format -

Jan 1 02:32:40 other strings but may or may not unique in all those lines
Jan 1 02:32:40 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:55 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:56 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Mar 31 23:31:57 other strings but may or may not unique in all those lines
Feb 1 03:52:26 other strings but may or may not unique in all those lines
Feb 1 03:52:26 other strings but may or may not unique in all those lines
Jan 1 02:46:40 other strings but may or may not unique in all those lines
Jan 1 02:44:40 other strings but may or may not unique in all those lines
Jan 1 02:40:40 other strings but may or may not unique in all those lines
Feb 10 03:52:26 other strings but may or may not unique in all those lines

I want to sort them basted on the timestamp.

I was able to get this working but for my code to succeed, I need to load the WHOLE file into a list.. this means it will be extremely inefficient from a memory utilization point of view.

Can you please suggest if there is a more efficient way where I can sort this by reading the file line by line or perhaps some other approach that I am not aware of ?

Here is my code -

# convert the log into a list of strings
with open("log.txt", 'r') as f:
 lines = f.read().splitlines()
# writing the method which will be fed as a key for sorting
def convert_time(logline):
 # extracting hour, minute and second from each log entry
 h, m, s = map(int, logline.split()[2].split(':'))
 time_in_seconds = h * 3600 + m * 60 + s
 return time_in_seconds
sorted_log_list = sorted(lines, key=convert_time)
''' sorted_log_list is a "list of lists". Each list within it is a representation of one log entry. We will use print and join to print it out as a readable log entry'''
for lines in sorted_log_list:
 print lines

Question 2

Is sort -M <logfile> not working in this case?

Question 3

Hi @yuri ! Sorry forgot to mention that I am trying to write a python script for this. Updated the title and description.

Question 4

@BadAtGeometry - Won't the list take the same amount of memory as the file data? My code at present is doing what you are suggesting already :/

Question 5

@OhMyGoodness - 2 GB is a random size picked up. The problem statement is "how to sort a HUGE file in python in the most efficient way". It could be something with millions of records or more. I found this article that so far seems to be the closest match for what I am looking to do . It is written in python 3 - neopythonic.blogspot.com/2008/10/…

Question 6

the most efficient way is to load the file into memory and sort it there. If that's infeasible, you'll get better answers if you provide concrete details, like how big the log files are, how long the records are, and how much memory is available.

Question 7

You're ignoring the Date-part of the timestamps; it doesn't sound like that's on purpose. (Also, the year is missing altogether, which should make us quite nervous.) Also, let's use explicit datetime utilities and regexes.

import datetime
import re
timestamp_regex = re.compile("[^:]+:\d\d:\d\d")
def convert_time(logline):
 stamp = timestamp_regex.match(logline).group() #this will error if there's no match.
 d = datetime.strptime(stamp, "%b %e %H:%M:%S")
 return int(d.timestamp())

As for the rest, the comments are right that we can't do much unless we know exactly what it would mean for the solution to be improved.

If the concern is just to handle the biggest file with the least ram, something like this might work:

def save_where_we_can_find_it(line, temp_file):
 retval = temp_file.tell()
 temp_file.write(line)
 return retval
def fetch_line(location, temp_file):
 temp_file.seek(location)
 return temp_file.readline()
items = []
with open("log.txt", 'r') as original, open(".temp.log.txt", 'w') as temp:
 for line in original:
 items.append((convert_time(line), save_where_we_can_find_it(line, temp)))
items.sort(key = lambda pair: pair[0]) #sort-in-place isn't necessarily a good idea; whatever.
with open(".temp.log.txt", 'r') as temp:
 for (stamp, location) in items:
 print(fetch_line(location, temp))
import os
os.remove(".temp.log.txt")

But this is just a really inefficient way using a scratch-file. Better to register scratch-space in the OS, and then do your file manipulation "in memory".

ShapeOfMatter ShapeOfMatter 4,4377 silver badges25 bronze badges · Answer 1 · 2019-05-13 17:21:18Z

You're ignoring the Date-part of the timestamps; it doesn't sound like that's on purpose. (Also, the year is missing altogether, which should make us quite nervous.) Also, let's use explicit datetime utilities and regexes.

import datetime
import re
timestamp_regex = re.compile("[^:]+:\d\d:\d\d")
def convert_time(logline):
 stamp = timestamp_regex.match(logline).group() #this will error if there's no match.
 d = datetime.strptime(stamp, "%b %e %H:%M:%S")
 return int(d.timestamp())

As for the rest, the comments are right that we can't do much unless we know exactly what it would mean for the solution to be improved.

If the concern is just to handle the biggest file with the least ram, something like this might work:

def save_where_we_can_find_it(line, temp_file):
 retval = temp_file.tell()
 temp_file.write(line)
 return retval
def fetch_line(location, temp_file):
 temp_file.seek(location)
 return temp_file.readline()
items = []
with open("log.txt", 'r') as original, open(".temp.log.txt", 'w') as temp:
 for line in original:
 items.append((convert_time(line), save_where_we_can_find_it(line, temp)))
items.sort(key = lambda pair: pair[0]) #sort-in-place isn't necessarily a good idea; whatever.
with open(".temp.log.txt", 'r') as temp:
 for (stamp, location) in items:
 print(fetch_line(location, temp))
import os
os.remove(".temp.log.txt")

But this is just a really inefficient way using a scratch-file. Better to register scratch-space in the OS, and then do your file manipulation "in memory".

Stack Exchange Network

Python script for sorting a huge log file based on timestamps

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script for sorting a huge log file based on timestamps

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions