This is part of my code.
- List all files and read them into the list
files
. - Now I have a dictionary
my_dict
. The values are the parquet files. All files must have the same schema. - I have more than 2000 files in my folder, so
files
is large. - For each file, I firstly
gunzip
all of them. - Next I find all the JSON files.
- Problem is here. My file
tick_calculated_2_2020年05月27日T01-02-58.json
will be read and converted to a dataframe and appended to'tick-2.parquet'
.
My code works, but the execution time is very slow. How do I get rid of one or more loops?
def gunzip(file_path, output_path):
with gzip.open(file_path, "rb") as f_in:
with open(output_path, "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
if __name__ == '__main__':
files = os.listdir('/Users/milenko/mario/Json_gzips')
files = [fi for fi in files if fi.endswith(".gz")]
my_dict = {'ticr_calculated_2': 'ticr-2.parquet', 'ticr_calculated_3': 'ticr-3.parquet', \
'ticr_calculated_4': 'ticr-4.parquet', 'tick_calculated_2': 'tick-2.parquet', \
'tick_calculated_3': 'tick-3.parquet', 'tick_calculated_4': 'tick-4.parquet'}
basic = '/Users/milenko/mario/Json_gzips/'
for file in files:
gunzip(file, file.replace(".gz", ""))
json_fi = glob.glob("*.json")
for key, value in my_dict.items():
filepath = basic + value
for f in json_fi:
if key in f:
result_df = pd.DataFrame()
with open(f, encoding='utf-8', mode='r') as i:
data = pd.read_json(i, lines=True)
result_df = result_df.append(data)
table_from_pandas = pa.Table.from_pandas(result_df)
pq.write_table(table_from_pandas, filepath)
2 Answers 2
There's an obvious problem that this code has the side-effect of filling the filesystem with uncompressed versions of the input files. Why do we write new files with shutil.copyfileobj()
rather than simply reading the compressed files directly? And why don't we remove the uncompressed files after processing them?
I'm confused by this:
for file in files: gunzip(file, file.replace(".gz", "")) json_fi = glob.glob("*.json")
It seems that we decompress each file, but ignore the results (unless the file happens to be in the process's working directory). And then we repeat the processing of all *.json
files, most of which we've already seen.
That certainly deserves an explanatory comment, as it's not clear to the causal reader why json_fi
is within the for file in files
loop.
This inner loop seems suspect:
for key, value in my_dict.items(): for f in json_fi: if key in f:
I'd expect to just look up the relevant part of f
in the dict:
for f in json_fi:
prefix = '_'.join(f.split('_', 3)[:3])
if prefix not in my_dict:
continue
# else use my_dict[prefix]
Efficiency
You use 2 filters for the input files:
- The first to find all
.gz
files in the directory - The other to find all
.json
files after the gzipped files are unzipped
It looks like you don't use the unzipped files unless they are also JSON files. If you have gzipped files which are not JSON files, then you unnecessarily spend time to unzip them.
If all JSON files in the directory are already gzipped, you should instead use a single filter to find only .json.gz
files and only unzip those.
DRY
This directory path is repeated twice:
"/Users/milenko/mario/Json_gzips/"
You used a variable after the first usage, but you should set the variable
before using it. The variable named basic
does not convey much meaning.
I recommend something like:
dir_path = '/Users/milenko/mario/Json_gzips/'
files = os.listdir(dir_path)
You should also consider passing the path into the code as a command-line input to make the code more reusable.
Documentation
Add a docstring at the top of the code to summarize its purpose:
"""
Building DataFrames from gzipped and JSON files
"""
Also add a docstring for the function:
def gunzip(file_path, output_path):
""" Unzip a file """
Naming
Some of the variables have names that do not convey much meaning, such as my_dict
.
The name could have parquet
in it somewhere.
It would be better to give more specific names to all your file variables and file list variables. For example:
files
would be better as:
json_files
This:
for file in files:
would be better as:
for json_file in json_files:
These could also use more meaningful names:
key
value
f
i