Migrating data from multiple spreadsheets to multiple text files

Question 1

This is like the reverse functionality of Text file to spreadsheet.

One or multiple .xlsx files in the path of the script get opened and the content is split into multiple .txt files.

For example, say you have two excel files in the folder:

file1.xlsx

enter image description here

file2.xlsx

enter image description here

The output created:

enter image description here enter image description here enter image description here

spreadsheet_to_text.py

"""
Reads in .xlsx files from path were the script is located.
Then the data of each column is split into a .txt file
"""
import glob
import openpyxl
from openpyxl.utils import get_column_letter
def get_text_filename(filename: str, column: int)->str:
 """
 Creates a text filename based on .xlsx file filename and column
 """
 return (filename.rstrip(".xlsx")
 + "_" + get_column_letter(column) + '.txt')
def xlsx_to_txt(filename: str):
 """
 Extract data from a .xlsx file in the script folder into
 multiple .txt files
 """
 workbook = openpyxl.load_workbook(filename)
 sheet_names = workbook.sheetnames
 sheet = workbook[sheet_names[0]]
 for column in range(1, sheet.max_column + 1):
 if sheet.cell(row=1, column=column).value:
 text_filename = get_text_filename(filename, column)
 with open(text_filename, mode='w') as textfile:
 for row in range(1, sheet.max_row + 1):
 if sheet.cell(column=column, row=row).value:
 textfile.writelines(
 sheet.cell(column=column, row=row).value + '\n')
def spreadsheet_into_text():
 """main logic for split spreadsheet data into multiple text files"""
 for filename in glob.iglob("*.xlsx"):
 xlsx_to_txt(filename)
if __name__ == "__main__":
 spreadsheet_into_text()

I already incorporated some improvements from Text file to spreadsheet. I wonder how the code can get further improved.

Question 2

This looks quite clean in general. May I suggest a few minor improvements:

I think you could just use workbook.active to get the sheet
instead of doing the rstrip(".xlsx") which would also right-strip out .sslsx or sl.xs.ss and even grab a part of the actual filename:
```
In [1]: "christmas.xlsx".rstrip(".xlsx")
Out[1]: 'christma'
```
use os module or the beautiful pathlib to properly extract a filename without an extension:
```
In [1]: from pathlib import Path
In [2]: Path("christmas.xlsx").resolve().stem
Out[2]: 'christmas'
```
calculate what you can before the loop instead of inside it. For instance, sheet.max_row is something you could just remember in a variable at the top of your function and re-use inside. It's not a lot of savings, but attribute access still has its cost in Python :
```
max_row = sheet.max_row
```

something similar is happening when you get the value of a cell twice, instead:

cell_value = sheet.cell(column=column, row=row).value
if cell_value:
 textfile.writelines(cell_value + '\n')

it may be a good idea to keep the nestedness at a minimum ("Flat is better than nested.") and would rather check for a reverse condition and use continue to move to the next iteration:
```
for column in range(1, sheet.max_column + 1):
 if not sheet.cell(row=1, column=column).value:
 continue
 text_filename = get_text_filename(filename, column)
```

Some out-of-the-box ideas:

feels like if you do it with pandas.read_excel() it may just become way more easier and beautiful

Question 3

thanks for the high quality answer. I really liked how it simplyfies the code. One little thing. I think in the question i already use writelines instead of write ?

alecxe alecxealecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2018-12-20 00:57:23Z

This looks quite clean in general. May I suggest a few minor improvements:

I think you could just use workbook.active to get the sheet
instead of doing the rstrip(".xlsx") which would also right-strip out .sslsx or sl.xs.ss and even grab a part of the actual filename:
```
In [1]: "christmas.xlsx".rstrip(".xlsx")
Out[1]: 'christma'
```
use os module or the beautiful pathlib to properly extract a filename without an extension:
```
In [1]: from pathlib import Path
In [2]: Path("christmas.xlsx").resolve().stem
Out[2]: 'christmas'
```
calculate what you can before the loop instead of inside it. For instance, sheet.max_row is something you could just remember in a variable at the top of your function and re-use inside. It's not a lot of savings, but attribute access still has its cost in Python :
```
max_row = sheet.max_row
```

something similar is happening when you get the value of a cell twice, instead:

cell_value = sheet.cell(column=column, row=row).value
if cell_value:
 textfile.writelines(cell_value + '\n')

it may be a good idea to keep the nestedness at a minimum ("Flat is better than nested.") and would rather check for a reverse condition and use continue to move to the next iteration:
```
for column in range(1, sheet.max_column + 1):
 if not sheet.cell(row=1, column=column).value:
 continue
 text_filename = get_text_filename(filename, column)
```

Some out-of-the-box ideas:

feels like if you do it with pandas.read_excel() it may just become way more easier and beautiful

thanks for the high quality answer. I really liked how it simplyfies the code. One little thing. I think in the question i already use writelines instead of write ?

Stack Exchange Network

Migrating data from multiple spreadsheets to multiple text files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Migrating data from multiple spreadsheets to multiple text files

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions