This is a follow-up to https://stackoverflow.com/questions/71194832/converting-json-based-log-into-column-format-i-e-one-file-per-column .
My task is to optimize this code whose function is to convert the content of this JSON-formatted log file to columnar-formatted .txt
file for each column. An example for a log file is:
{"timestamp": "2022年01月14日T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022年01月18日T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}
It will generate 5 files:
timestamp.column
Field1.column
Field_Doc.f1.column
Field_Doc.f2.column
Field_Doc.f3.column
The column file format is as follows:
- string fields are separated by a new line
'\n'
character. Assume that no string value contains new line characters, so no need to worry about escaping them - double, integer & boolean fields are represented as a single value per line
- null, undefined & empty strings are represented as an empty line
Example content of timestamp.column:
2022年01月14日T00:12:21.000
2022年01月18日T00:15:51.000
Note: The fields in the log will be dynamic, do not assume that these are the expected properties.
My current code for this is:
import json
import os
def flatten_dict(data, prefix=""):
result = {}
for key, value in data.items():
if prefix:
key = prefix + "." + key
if isinstance(value, dict):
result.update( flatten_dict(value, key) )
else:
if value is None:
result[key] = "\n"
elif value is "":
result[key] = "\n"
else:
result[key] = value
return result
path = input("Enter the path for the log file: ")
# Checking if path exists to the selected log
assert os.path.exists(path), "I did not find the file at, "+str(path)
file_obj = open(path) # emulate file in memory
for line in file_obj:
data = json.loads(line)
data = flatten_dict(data)
for key, value in data.items():
with open(key + '.column', "a") as f:
f.write(str(value) + "\n")
This code is currently very slow. It takes about 45 min just to parse 256mb of data which should have taken about 30sec or max a minute.
Can anyone guide me how can I optimize the efficiency? Also how can I print the usage of CPU and RAM for the same? File size to parse may vary up to 2 GB.
Any help would be appreciated.
-
1\$\begingroup\$ Why are the columns separated into different files? This is a strange and inconvenient format. Do you control this? \$\endgroup\$Reinderien– Reinderien2022年03月12日 17:31:33 +00:00Commented Mar 12, 2022 at 17:31
-
\$\begingroup\$ i don't control this, I'm just asked to do this \$\endgroup\$Sparsh Saxena– Sparsh Saxena2022年03月12日 17:37:56 +00:00Commented Mar 12, 2022 at 17:37
-
\$\begingroup\$ Is it work, homework, an interview problem, or a programming challenge? \$\endgroup\$Reinderien– Reinderien2022年03月12日 17:38:28 +00:00Commented Mar 12, 2022 at 17:38
-
\$\begingroup\$ programming challenge \$\endgroup\$Sparsh Saxena– Sparsh Saxena2022年03月12日 18:25:30 +00:00Commented Mar 12, 2022 at 18:25
-
2\$\begingroup\$ No, it's not: it's an interview question \$\endgroup\$Reinderien– Reinderien2022年03月12日 21:11:35 +00:00Commented Mar 12, 2022 at 21:11
1 Answer 1
You're re-opening the column files every line. Open them once and then look the file handles up by key.
file_obj = open(path) # emulate file in memory
for line in file_obj:
data = json.loads(line)
data = flatten_dict(data)
for key, value in data.items():
with open(key + '.column', "a") as f:
f.write(str(value) + "\n")
becomes ( not tested )
column_files = {}
with open(path) as jsonl_file:
for line in jsonl_file:
data = json.loads(line)
data = flatten_dict(data)
for key, value in data.items():
if key not in column_files:
column_files[key] = open(key+'.column','w')
column_files[key].write(str(value) + "\n")
for f in column_files: f.close()
-
2\$\begingroup\$ Instead of using a dict of file handles, you can use a
contextlib.ExitStack
. Also the last line should befor f in column_files.values(): f.close()
. \$\endgroup\$Richard Neumann– Richard Neumann2022年03月13日 10:47:46 +00:00Commented Mar 13, 2022 at 10:47
Explore related questions
See similar questions with these tags.