3
\$\begingroup\$

This is part of my code.

  1. List all files and read them into the list files.
  2. Now I have a dictionary my_dict. The values are the parquet files. All files must have the same schema.
  3. I have more than 2000 files in my folder, so files is large.
  4. For each file, I firstly gunzip all of them.
  5. Next I find all the JSON files.
  6. Problem is here. My file tick_calculated_2_2020年05月27日T01-02-58.json will be read and converted to a dataframe and appended to 'tick-2.parquet'.

My code works, but the execution time is very slow. How do I get rid of one or more loops?

def gunzip(file_path, output_path):
 with gzip.open(file_path, "rb") as f_in:
 with open(output_path, "wb") as f_out:
 shutil.copyfileobj(f_in, f_out)
if __name__ == '__main__':
 files = os.listdir('/Users/milenko/mario/Json_gzips')
 files = [fi for fi in files if fi.endswith(".gz")]
 my_dict = {'ticr_calculated_2': 'ticr-2.parquet', 'ticr_calculated_3': 'ticr-3.parquet', \
 'ticr_calculated_4': 'ticr-4.parquet', 'tick_calculated_2': 'tick-2.parquet', \
 'tick_calculated_3': 'tick-3.parquet', 'tick_calculated_4': 'tick-4.parquet'}
 basic = '/Users/milenko/mario/Json_gzips/'
 for file in files:
 gunzip(file, file.replace(".gz", ""))
 json_fi = glob.glob("*.json")
 for key, value in my_dict.items():
 filepath = basic + value
 for f in json_fi:
 if key in f:
 result_df = pd.DataFrame()
 with open(f, encoding='utf-8', mode='r') as i:
 data = pd.read_json(i, lines=True)
 result_df = result_df.append(data)
 table_from_pandas = pa.Table.from_pandas(result_df)
 pq.write_table(table_from_pandas, filepath)
Toby Speight
87.1k14 gold badges104 silver badges322 bronze badges
asked Jun 24, 2020 at 7:53
\$\endgroup\$
0

2 Answers 2

2
\$\begingroup\$

There's an obvious problem that this code has the side-effect of filling the filesystem with uncompressed versions of the input files. Why do we write new files with shutil.copyfileobj() rather than simply reading the compressed files directly? And why don't we remove the uncompressed files after processing them?


I'm confused by this:

for file in files:
 gunzip(file, file.replace(".gz", ""))
 json_fi = glob.glob("*.json")

It seems that we decompress each file, but ignore the results (unless the file happens to be in the process's working directory). And then we repeat the processing of all *.json files, most of which we've already seen.

That certainly deserves an explanatory comment, as it's not clear to the causal reader why json_fi is within the for file in files loop.


This inner loop seems suspect:

 for key, value in my_dict.items():
 for f in json_fi:
 if key in f:

I'd expect to just look up the relevant part of f in the dict:

 for f in json_fi:
 prefix = '_'.join(f.split('_', 3)[:3])
 if prefix not in my_dict:
 continue
 # else use my_dict[prefix]
answered Dec 18, 2024 at 10:57
\$\endgroup\$
2
\$\begingroup\$

Efficiency

You use 2 filters for the input files:

  • The first to find all .gz files in the directory
  • The other to find all .json files after the gzipped files are unzipped

It looks like you don't use the unzipped files unless they are also JSON files. If you have gzipped files which are not JSON files, then you unnecessarily spend time to unzip them.

If all JSON files in the directory are already gzipped, you should instead use a single filter to find only .json.gz files and only unzip those.

DRY

This directory path is repeated twice:

"/Users/milenko/mario/Json_gzips/"

You used a variable after the first usage, but you should set the variable before using it. The variable named basic does not convey much meaning. I recommend something like:

dir_path = '/Users/milenko/mario/Json_gzips/'
files = os.listdir(dir_path)

You should also consider passing the path into the code as a command-line input to make the code more reusable.

Documentation

Add a docstring at the top of the code to summarize its purpose:

"""
Building DataFrames from gzipped and JSON files
"""

Also add a docstring for the function:

def gunzip(file_path, output_path):
 """ Unzip a file """

Naming

Some of the variables have names that do not convey much meaning, such as my_dict. The name could have parquet in it somewhere.

It would be better to give more specific names to all your file variables and file list variables. For example:

files

would be better as:

json_files

This:

for file in files:

would be better as:

for json_file in json_files:

These could also use more meaningful names:

key
value
f
i
answered Dec 17, 2024 at 21:28
\$\endgroup\$
0

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.