\$\begingroup\$
\$\endgroup\$
1
Is there a better way to read in
- the name of a textfile and
- the content of a text file into a dataframe?
(Or is my implementation even okay?) Can I avoid storing the data in lists?
path =r'.../test_age7'
allFiles = glob.glob(path + "/*.txt")
df_7 = pd.DataFrame() # create empty DF
stories = []
filenames = []
for file_ in allFiles:
with open(file_) as f:
textf = " ".join(line.strip() for line in f)
stories.append(textf)
filenames.append(os.path.basename(file_[0:-4])) # extract filename without .txt
df_7["filename"] = filenames
df_7["stories"] = stories
df_7["age"] = path[-1]
chicks
2,8593 gold badges18 silver badges30 bronze badges
-
\$\begingroup\$ Another approach is to convert .txt file to CSV, because Panda have read_csv. Here is one an example from excel to CSV. \$\endgroup\$WebOrCode– WebOrCode2018年06月19日 13:28:53 +00:00Commented Jun 19, 2018 at 13:28
1 Answer 1
\$\begingroup\$
\$\endgroup\$
- As mention in the comments, pandas work really really well with
csv
so if you are generating the data your self you might consider to save the data incsv
format. allFiles
is just used once, dont define it; useglob
in loop instead.- Replace
stories
andfilenames
with just oneDataFrame
, and usepandas.concat()
- If you are just updating the script evertime you run it, you can just have a age variable.
- Never use
file_[0:-4]
to remove filextensions, useos.path.splitext
. - I guess you will run this code for a lot of diffrent ages, so make a function out of it.
from os.path import basename, splitext
import pandas as pd
def getDataByAge(age)
res = pd.DataFrame()
for file_ in glob.glob(".../test_age%d/*.txt" % (age)):
with open(file_) as f:
textf = " ".join(line.strip() for line in f)
res = pd.concat([res,pd.DataFrame(data = {"filename" : [splitext(basename(file_))[0]], "stories" : [textf], "age" : [age]})])
return res
lang-py