Building DataFrames from gzipped JSON files

Question 1

This is part of my code.

List all files and read them into the list files.
Now I have a dictionary my_dict. The values are the parquet files. All files must have the same schema.
I have more than 2000 files in my folder, so files is large.
For each file, I firstly gunzip all of them.
Next I find all the JSON files.
Problem is here. My file tick_calculated_2_2020年05月27日T01-02-58.json will be read and converted to a dataframe and appended to 'tick-2.parquet'.

My code works, but the execution time is very slow. How do I get rid of one or more loops?

def gunzip(file_path, output_path):
 with gzip.open(file_path, "rb") as f_in:
 with open(output_path, "wb") as f_out:
 shutil.copyfileobj(f_in, f_out)
if __name__ == '__main__':
 files = os.listdir('/Users/milenko/mario/Json_gzips')
 files = [fi for fi in files if fi.endswith(".gz")]
 my_dict = {'ticr_calculated_2': 'ticr-2.parquet', 'ticr_calculated_3': 'ticr-3.parquet', \
 'ticr_calculated_4': 'ticr-4.parquet', 'tick_calculated_2': 'tick-2.parquet', \
 'tick_calculated_3': 'tick-3.parquet', 'tick_calculated_4': 'tick-4.parquet'}
 basic = '/Users/milenko/mario/Json_gzips/'
 for file in files:
 gunzip(file, file.replace(".gz", ""))
 json_fi = glob.glob("*.json")
 for key, value in my_dict.items():
 filepath = basic + value
 for f in json_fi:
 if key in f:
 result_df = pd.DataFrame()
 with open(f, encoding='utf-8', mode='r') as i:
 data = pd.read_json(i, lines=True)
 result_df = result_df.append(data)
 table_from_pandas = pa.Table.from_pandas(result_df)
 pq.write_table(table_from_pandas, filepath)

Question 2

There's an obvious problem that this code has the side-effect of filling the filesystem with uncompressed versions of the input files. Why do we write new files with shutil.copyfileobj() rather than simply reading the compressed files directly? And why don't we remove the uncompressed files after processing them?

I'm confused by this:

for file in files:
 gunzip(file, file.replace(".gz", ""))
 json_fi = glob.glob("*.json")

It seems that we decompress each file, but ignore the results (unless the file happens to be in the process's working directory). And then we repeat the processing of all *.json files, most of which we've already seen.

That certainly deserves an explanatory comment, as it's not clear to the causal reader why json_fi is within the for file in files loop.

This inner loop seems suspect:

 for key, value in my_dict.items():
 for f in json_fi:
 if key in f:

I'd expect to just look up the relevant part of f in the dict:

 for f in json_fi:
 prefix = '_'.join(f.split('_', 3)[:3])
 if prefix not in my_dict:
 continue
 # else use my_dict[prefix]

Question 3

Efficiency

You use 2 filters for the input files:

The first to find all .gz files in the directory
The other to find all .json files after the gzipped files are unzipped

It looks like you don't use the unzipped files unless they are also JSON files. If you have gzipped files which are not JSON files, then you unnecessarily spend time to unzip them.

If all JSON files in the directory are already gzipped, you should instead use a single filter to find only .json.gz files and only unzip those.

DRY

This directory path is repeated twice:

"/Users/milenko/mario/Json_gzips/"

You used a variable after the first usage, but you should set the variable before using it. The variable named basic does not convey much meaning. I recommend something like:

dir_path = '/Users/milenko/mario/Json_gzips/'
files = os.listdir(dir_path)

You should also consider passing the path into the code as a command-line input to make the code more reusable.

Documentation

Add a docstring at the top of the code to summarize its purpose:

"""
Building DataFrames from gzipped and JSON files
"""

Also add a docstring for the function:

def gunzip(file_path, output_path):
 """ Unzip a file """

Naming

Some of the variables have names that do not convey much meaning, such as my_dict. The name could have parquet in it somewhere.

It would be better to give more specific names to all your file variables and file list variables. For example:

files

would be better as:

json_files

This:

for file in files:

would be better as:

for json_file in json_files:

These could also use more meaningful names:

key
value
f
i

Toby Speight Toby Speight 87.1k14 gold badges104 silver badges322 bronze badges · Answer 1 · 2024-12-18 10:57:52Z

There's an obvious problem that this code has the side-effect of filling the filesystem with uncompressed versions of the input files. Why do we write new files with shutil.copyfileobj() rather than simply reading the compressed files directly? And why don't we remove the uncompressed files after processing them?

I'm confused by this:

for file in files:
 gunzip(file, file.replace(".gz", ""))
 json_fi = glob.glob("*.json")

It seems that we decompress each file, but ignore the results (unless the file happens to be in the process's working directory). And then we repeat the processing of all *.json files, most of which we've already seen.

That certainly deserves an explanatory comment, as it's not clear to the causal reader why json_fi is within the for file in files loop.

This inner loop seems suspect:

 for key, value in my_dict.items():
 for f in json_fi:
 if key in f:

I'd expect to just look up the relevant part of f in the dict:

 for f in json_fi:
 prefix = '_'.join(f.split('_', 3)[:3])
 if prefix not in my_dict:
 continue
 # else use my_dict[prefix]

toolic toolic 14.5k5 gold badges29 silver badges203 bronze badges · Answer 2 · 2024-12-17 21:28:36Z

Efficiency

You use 2 filters for the input files:

The first to find all .gz files in the directory
The other to find all .json files after the gzipped files are unzipped

It looks like you don't use the unzipped files unless they are also JSON files. If you have gzipped files which are not JSON files, then you unnecessarily spend time to unzip them.

If all JSON files in the directory are already gzipped, you should instead use a single filter to find only .json.gz files and only unzip those.

DRY

This directory path is repeated twice:

"/Users/milenko/mario/Json_gzips/"

You used a variable after the first usage, but you should set the variable before using it. The variable named basic does not convey much meaning. I recommend something like:

dir_path = '/Users/milenko/mario/Json_gzips/'
files = os.listdir(dir_path)

You should also consider passing the path into the code as a command-line input to make the code more reusable.

Documentation

Add a docstring at the top of the code to summarize its purpose:

"""
Building DataFrames from gzipped and JSON files
"""

Also add a docstring for the function:

def gunzip(file_path, output_path):
 """ Unzip a file """

Naming

Some of the variables have names that do not convey much meaning, such as my_dict. The name could have parquet in it somewhere.

It would be better to give more specific names to all your file variables and file list variables. For example:

files

would be better as:

json_files

This:

for file in files:

would be better as:

for json_file in json_files:

These could also use more meaningful names:

key
value
f
i

Stack Exchange Network

Building DataFrames from gzipped JSON files

2 Answers 2

Efficiency

DRY

Documentation

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

2 Answers 2

Efficiency

Documentation

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related