This is like the reverse functionality of Text file to spreadsheet.
One or multiple .xlsx
files in the path of the script get opened and the content is split into multiple .txt
files.
For example, say you have two excel files in the folder:
file1.xlsx
file2.xlsx
The output created:
enter image description here enter image description here enter image description here
enter image description here enter image description here enter image description here
spreadsheet_to_text.py
"""
Reads in .xlsx files from path were the script is located.
Then the data of each column is split into a .txt file
"""
import glob
import openpyxl
from openpyxl.utils import get_column_letter
def get_text_filename(filename: str, column: int)->str:
"""
Creates a text filename based on .xlsx file filename and column
"""
return (filename.rstrip(".xlsx")
+ "_" + get_column_letter(column) + '.txt')
def xlsx_to_txt(filename: str):
"""
Extract data from a .xlsx file in the script folder into
multiple .txt files
"""
workbook = openpyxl.load_workbook(filename)
sheet_names = workbook.sheetnames
sheet = workbook[sheet_names[0]]
for column in range(1, sheet.max_column + 1):
if sheet.cell(row=1, column=column).value:
text_filename = get_text_filename(filename, column)
with open(text_filename, mode='w') as textfile:
for row in range(1, sheet.max_row + 1):
if sheet.cell(column=column, row=row).value:
textfile.writelines(
sheet.cell(column=column, row=row).value + '\n')
def spreadsheet_into_text():
"""main logic for split spreadsheet data into multiple text files"""
for filename in glob.iglob("*.xlsx"):
xlsx_to_txt(filename)
if __name__ == "__main__":
spreadsheet_into_text()
I already incorporated some improvements from Text file to spreadsheet. I wonder how the code can get further improved.
1 Answer 1
This looks quite clean in general. May I suggest a few minor improvements:
- I think you could just use
workbook.active
to get the sheet instead of doing the
rstrip(".xlsx")
which would also right-strip out.sslsx
orsl.xs.ss
and even grab a part of the actual filename:In [1]: "christmas.xlsx".rstrip(".xlsx") Out[1]: 'christma'
use
os
module or the beautifulpathlib
to properly extract a filename without an extension:In [1]: from pathlib import Path In [2]: Path("christmas.xlsx").resolve().stem Out[2]: 'christmas'
calculate what you can before the loop instead of inside it. For instance,
sheet.max_row
is something you could just remember in a variable at the top of your function and re-use inside. It's not a lot of savings, but attribute access still has its cost in Python :max_row = sheet.max_row
something similar is happening when you get the value of a cell twice, instead:
cell_value = sheet.cell(column=column, row=row).value if cell_value: textfile.writelines(cell_value + '\n')
it may be a good idea to keep the nestedness at a minimum ("Flat is better than nested.") and would rather check for a reverse condition and use
continue
to move to the next iteration:for column in range(1, sheet.max_column + 1): if not sheet.cell(row=1, column=column).value: continue text_filename = get_text_filename(filename, column)
Some out-of-the-box ideas:
- feels like if you do it with
pandas.read_excel()
it may just become way more easier and beautiful
-
\$\begingroup\$ thanks for the high quality answer. I really liked how it simplyfies the code. One little thing. I think in the question i already use
writelines
instead ofwrite
? \$\endgroup\$Sandro4912– Sandro49122019年01月07日 19:34:53 +00:00Commented Jan 7, 2019 at 19:34