How to better process csv file with pandas and further dealing with set

Question 1

I have below working code with pandas and python, i'm looking if there is an improvement or simplification which can be done.

Can we Just wrap this up into a definition.

$ cat getcbk_srvlist_1.py
#!/python/v3.6.1/bin/python3
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import pandas as pd
import os
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
##################### END OF THE Display Settings ###################
################# PANDAS Extraction ###########
df_csv = pd.read_csv(input("Please input the CSV File Name: "), usecols=['Platform ID', 'Target system address']).dropna()
hostData = df_csv[df_csv['Platform ID'].str.startswith("CDS-Unix")]['Target system address']
hostData.to_csv('host_file1', header=None, index=None, sep=' ', mode='a')
with open('host_file1') as f1, open('host_file2') as f2:
 dataset1 = set(f1)
 dataset2 = set(f2)
for i, item in enumerate(sorted(dataset2 - dataset1)):
 print(str(item).strip())
os.unlink("host_file1")

The above code just compares the two files one is processed through pandas ie host_file1 and another is already existing host_file2.

Question 2

main guard

It is common to put the code you want to run behind an if __name__ == "__main__":, so you can later import the functions that might be reused in a different module

naming

You use both snake_case and CamelCase. Try to stick to 1 naming convention. PEP-8 advised snake_case for variables and functions, CamelCase for classes

functions

split the code in logical parts

pandas settings

def settings_pandas():
 pd.set_option("display.height", None)
 pd.set_option("display.max_rows", None)
 pd.set_option("display.max_columns", None)
 pd.set_option("display.width", None)
 pd.set_option("expand_frame_repr", True)

filename input

The way you ask the filename is very fragile. A more robust way would be to ask the filename in a different function, and then validate it

from pathlib import Path
def ask_filename(validate=True):
 """
 Asks the user for a filename.
 If `validate` is True, it checks whether the file exists and it is a file
 """
 while True:
 file = Path(input("Please input the CSV File Name: (CTRL+C to abort)"))
 if validate:
 if not file.exists() and file.is_file():
 print("Filename is invalid")
 continue
 return file

IO

def read_host_data(filename):
 """reads `filename`, filters the unix platforms, and returns the `Target system address`"""
 df = pd.read_csv(filename, usecols=["Platform ID", 'Target system address']).dropna()
 unix_platforms = df['Platform ID'].str.startswith("CDS-Unix")
 return df.loc[unix_platforms, "Target system address"]

There is no need to save the intermediary data to a file. You could use a io.StringIO. An alternative if you need a temporary file is tempfile.

But in this case, where you just need the set of the values of a pd.Series, you can do just set(host_data), without the intermediary file.

putting it together:

if __name__ == "__main__":
 settings_pandas() # needed?
 filename = ask_filename()
 host_data = set(read_host_data(filename))
 with open("hostfile2") as hostfile2:
 host_data2 = set(hostfile2)
 for item in sorted(host_data2 - host_data):
 print(item.strip())

since the i is not used, I dropped the enumerate. Since host_data2 is directly read from a file, there are no conversions, and it are all strs, so the conversion to str is dropped too.

Since I don't see any printing of pandas data, This part can be dropped apparently.

Question 3

@.Maarten, thnx for the elaborated and explicit explanation.

score 2 · Accepted Answer · 2018-12-12 08:52:46Z

main guard

It is common to put the code you want to run behind an if __name__ == "__main__":, so you can later import the functions that might be reused in a different module

naming

You use both snake_case and CamelCase. Try to stick to 1 naming convention. PEP-8 advised snake_case for variables and functions, CamelCase for classes

functions

split the code in logical parts

pandas settings

def settings_pandas():
 pd.set_option("display.height", None)
 pd.set_option("display.max_rows", None)
 pd.set_option("display.max_columns", None)
 pd.set_option("display.width", None)
 pd.set_option("expand_frame_repr", True)

filename input

The way you ask the filename is very fragile. A more robust way would be to ask the filename in a different function, and then validate it

from pathlib import Path
def ask_filename(validate=True):
 """
 Asks the user for a filename.
 If `validate` is True, it checks whether the file exists and it is a file
 """
 while True:
 file = Path(input("Please input the CSV File Name: (CTRL+C to abort)"))
 if validate:
 if not file.exists() and file.is_file():
 print("Filename is invalid")
 continue
 return file

IO

def read_host_data(filename):
 """reads `filename`, filters the unix platforms, and returns the `Target system address`"""
 df = pd.read_csv(filename, usecols=["Platform ID", 'Target system address']).dropna()
 unix_platforms = df['Platform ID'].str.startswith("CDS-Unix")
 return df.loc[unix_platforms, "Target system address"]

There is no need to save the intermediary data to a file. You could use a io.StringIO. An alternative if you need a temporary file is tempfile.

But in this case, where you just need the set of the values of a pd.Series, you can do just set(host_data), without the intermediary file.

putting it together:

if __name__ == "__main__":
 settings_pandas() # needed?
 filename = ask_filename()
 host_data = set(read_host_data(filename))
 with open("hostfile2") as hostfile2:
 host_data2 = set(hostfile2)
 for item in sorted(host_data2 - host_data):
 print(item.strip())

since the i is not used, I dropped the enumerate. Since host_data2 is directly read from a file, there are no conversions, and it are all strs, so the conversion to str is dropped too.

Since I don't see any printing of pandas data, This part can be dropped apparently.

@.Maarten, thnx for the elaborated and explicit explanation.

Stack Exchange Network

How to better process csv file with pandas and further dealing with set

1 Answer 1

main guard

naming

functions

pandas settings

filename input

IO

putting it together:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to better process csv file with pandas and further dealing with set

1 Answer 1

main guard

naming

functions

pandas settings

filename input

IO

putting it together:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions