I have below working code with pandas
and python
, i'm looking if there is an improvement or simplification which can be done.
Can we Just wrap this up into a definition.
$ cat getcbk_srvlist_1.py
#!/python/v3.6.1/bin/python3
from __future__ import print_function
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
import pandas as pd
import os
##### Python pandas, widen output display to see more columns. ####
pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
##################### END OF THE Display Settings ###################
################# PANDAS Extraction ###########
df_csv = pd.read_csv(input("Please input the CSV File Name: "), usecols=['Platform ID', 'Target system address']).dropna()
hostData = df_csv[df_csv['Platform ID'].str.startswith("CDS-Unix")]['Target system address']
hostData.to_csv('host_file1', header=None, index=None, sep=' ', mode='a')
with open('host_file1') as f1, open('host_file2') as f2:
dataset1 = set(f1)
dataset2 = set(f2)
for i, item in enumerate(sorted(dataset2 - dataset1)):
print(str(item).strip())
os.unlink("host_file1")
The above code just compares the two files one is processed through pandas ie host_file1
and another is already existing host_file2
.
1 Answer 1
main guard
It is common to put the code you want to run behind an if __name__ == "__main__":
, so you can later import the functions that might be reused in a different module
naming
You use both snake_case
and CamelCase
. Try to stick to 1 naming convention. PEP-8 advised snake_case
for variables and functions, CamelCase
for classes
functions
split the code in logical parts
pandas settings
def settings_pandas():
pd.set_option("display.height", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
pd.set_option("display.width", None)
pd.set_option("expand_frame_repr", True)
filename input
The way you ask the filename is very fragile. A more robust way would be to ask the filename in a different function, and then validate it
from pathlib import Path
def ask_filename(validate=True):
"""
Asks the user for a filename.
If `validate` is True, it checks whether the file exists and it is a file
"""
while True:
file = Path(input("Please input the CSV File Name: (CTRL+C to abort)"))
if validate:
if not file.exists() and file.is_file():
print("Filename is invalid")
continue
return file
IO
def read_host_data(filename):
"""reads `filename`, filters the unix platforms, and returns the `Target system address`"""
df = pd.read_csv(filename, usecols=["Platform ID", 'Target system address']).dropna()
unix_platforms = df['Platform ID'].str.startswith("CDS-Unix")
return df.loc[unix_platforms, "Target system address"]
There is no need to save the intermediary data to a file. You could use a io.StringIO
. An alternative if you need a temporary file is tempfile
.
But in this case, where you just need the set of the values of a pd.Series
, you can do just set(host_data)
, without the intermediary file.
putting it together:
if __name__ == "__main__":
settings_pandas() # needed?
filename = ask_filename()
host_data = set(read_host_data(filename))
with open("hostfile2") as hostfile2:
host_data2 = set(hostfile2)
for item in sorted(host_data2 - host_data):
print(item.strip())
since the i
is not used, I dropped the enumerate
. Since host_data2
is directly read from a file, there are no conversions, and it are all str
s, so the conversion to str
is dropped too.
Since I don't see any printing of pandas
data, This part can be dropped apparently.
-
\$\begingroup\$ @.Maarten, thnx for the elaborated and explicit explanation. \$\endgroup\$Karn Kumar– Karn Kumar2018年12月12日 15:15:03 +00:00Commented Dec 12, 2018 at 15:15