Parse a log file into separate column files

Question 1

This is a follow-up to https://stackoverflow.com/questions/71194832/converting-json-based-log-into-column-format-i-e-one-file-per-column .

My task is to optimize this code whose function is to convert the content of this JSON-formatted log file to columnar-formatted .txt file for each column. An example for a log file is:

{"timestamp": "2022年01月14日T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022年01月18日T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}

It will generate 5 files:

timestamp.column
Field1.column
Field_Doc.f1.column
Field_Doc.f2.column
Field_Doc.f3.column

The column file format is as follows:

string fields are separated by a new line '\n' character. Assume that no string value contains new line characters, so no need to worry about escaping them
double, integer & boolean fields are represented as a single value per line
null, undefined & empty strings are represented as an empty line

Example content of timestamp.column:

2022年01月14日T00:12:21.000
2022年01月18日T00:15:51.000

Note: The fields in the log will be dynamic, do not assume that these are the expected properties.

My current code for this is:

import json
import os
def flatten_dict(data, prefix=""):
 result = {}
 for key, value in data.items():
 if prefix:
 key = prefix + "." + key 
 if isinstance(value, dict):
 result.update( flatten_dict(value, key) )
 else: 
 if value is None:
 result[key] = "\n" 
 elif value is "":
 result[key] = "\n"
 else:
 result[key] = value
 return result
path = input("Enter the path for the log file: ")
# Checking if path exists to the selected log
assert os.path.exists(path), "I did not find the file at, "+str(path)
file_obj = open(path) # emulate file in memory
for line in file_obj:
 data = json.loads(line) 
 data = flatten_dict(data)
 for key, value in data.items():
 with open(key + '.column', "a") as f:
 f.write(str(value) + "\n")

This code is currently very slow. It takes about 45 min just to parse 256mb of data which should have taken about 30sec or max a minute.

Can anyone guide me how can I optimize the efficiency? Also how can I print the usage of CPU and RAM for the same? File size to parse may vary up to 2 GB.

Any help would be appreciated.

Question 2

Why are the columns separated into different files? This is a strange and inconvenient format. Do you control this?

Question 3

i don't control this, I'm just asked to do this

Question 4

Is it work, homework, an interview problem, or a programming challenge?

Question 5

programming challenge

Question 6

No, it's not: it's an interview question

Question 7

You're re-opening the column files every line. Open them once and then look the file handles up by key.

file_obj = open(path) # emulate file in memory
for line in file_obj:
 data = json.loads(line) 
 data = flatten_dict(data)
 for key, value in data.items():
 with open(key + '.column', "a") as f:
 f.write(str(value) + "\n")

becomes ( not tested )

column_files = {}
with open(path) as jsonl_file:
 for line in jsonl_file:
 data = json.loads(line) 
 data = flatten_dict(data)
 for key, value in data.items():
 if key not in column_files:
 column_files[key] = open(key+'.column','w')
 column_files[key].write(str(value) + "\n")
for f in column_files: f.close()

Question 8

Instead of using a dict of file handles, you can use a contextlib.ExitStack. Also the last line should be for f in column_files.values(): f.close().

Ted Brownlow Ted Brownlow 1,69210 silver badges12 bronze badges · Answer 1 · 2022-03-12 19:15:26Z

You're re-opening the column files every line. Open them once and then look the file handles up by key.

file_obj = open(path) # emulate file in memory
for line in file_obj:
 data = json.loads(line) 
 data = flatten_dict(data)
 for key, value in data.items():
 with open(key + '.column', "a") as f:
 f.write(str(value) + "\n")

becomes ( not tested )

column_files = {}
with open(path) as jsonl_file:
 for line in jsonl_file:
 data = json.loads(line) 
 data = flatten_dict(data)
 for key, value in data.items():
 if key not in column_files:
 column_files[key] = open(key+'.column','w')
 column_files[key].write(str(value) + "\n")
for f in column_files: f.close()

Instead of using a dict of file handles, you can use a contextlib.ExitStack. Also the last line should be for f in column_files.values(): f.close().

Stack Exchange Network

Parse a log file into separate column files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse a log file into separate column files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions