Loading .txt file content and filename into pandas dataframe

Question 1

Is there a better way to read in

the name of a textfile and
the content of a text file into a dataframe?

(Or is my implementation even okay?) Can I avoid storing the data in lists?

path =r'.../test_age7' 
allFiles = glob.glob(path + "/*.txt")
df_7 = pd.DataFrame() # create empty DF
stories = [] 
filenames = []
for file_ in allFiles:
 with open(file_) as f:
 textf = " ".join(line.strip() for line in f) 
 stories.append(textf) 
 filenames.append(os.path.basename(file_[0:-4])) # extract filename without .txt
df_7["filename"] = filenames
df_7["stories"] = stories
df_7["age"] = path[-1]

Question 2

Another approach is to convert .txt file to CSV, because Panda have read_csv. Here is one an example from excel to CSV.

Question 3

As mention in the comments, pandas work really really well with csv so if you are generating the data your self you might consider to save the data in csv format.
allFiles is just used once, dont define it; use glob in loop instead.
Replace stories and filenames with just one DataFrame, and use pandas.concat()
If you are just updating the script evertime you run it, you can just have a age variable.
Never use file_[0:-4] to remove filextensions, use os.path.splitext.
I guess you will run this code for a lot of diffrent ages, so make a function out of it.

from os.path import basename, splitext
import pandas as pd
def getDataByAge(age)
 res = pd.DataFrame()
 for file_ in glob.glob(".../test_age%d/*.txt" % (age)):
 with open(file_) as f:
 textf = " ".join(line.strip() for line in f)
 res = pd.concat([res,pd.DataFrame(data = {"filename" : [splitext(basename(file_))[0]], "stories" : [textf], "age" : [age]})])
 return res

baot baot 1756 bronze badges · Accepted Answer · 2018-06-19 14:17:51Z

As mention in the comments, pandas work really really well with csv so if you are generating the data your self you might consider to save the data in csv format.
allFiles is just used once, dont define it; use glob in loop instead.
Replace stories and filenames with just one DataFrame, and use pandas.concat()
If you are just updating the script evertime you run it, you can just have a age variable.
Never use file_[0:-4] to remove filextensions, use os.path.splitext.
I guess you will run this code for a lot of diffrent ages, so make a function out of it.

from os.path import basename, splitext
import pandas as pd
def getDataByAge(age)
 res = pd.DataFrame()
 for file_ in glob.glob(".../test_age%d/*.txt" % (age)):
 with open(file_) as f:
 textf = " ".join(line.strip() for line in f)
 res = pd.concat([res,pd.DataFrame(data = {"filename" : [splitext(basename(file_))[0]], "stories" : [textf], "age" : [age]})])
 return res

Stack Exchange Network

Loading .txt file content and filename into pandas dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Loading .txt file content and filename into pandas dataframe

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions