I've got a simple Excel Data Manipulation Script to match one of my daily tasks, written in python 3.
Intro
Let's assume that I have 3 excel files: main.xlsx
, 1.xlsx
and 2.xlsx
. In all of them I have a column named serial numbers
. I have to:
- lookup for all serial numbers in
1.xlsx
and2.xlsx
and verify if they are inmain.xlsx
.
If a serial number is find:
- on the last column of
main.xlsx
, on the same row with the serial number that was find, writeOK
+name_of_the_file_in which_it_was_found
. Else, writeNOK
. At the same time, write in1.xlsx
and2.xlsx
ok
ornok
on the last column if the serial number was found or not.
Now, my script is simply creating new files with the last column appended (instead of appending directly to the same file), which is ok (but if you guys have a better method, shout it). What I'm looking for, is a way of making this as optimized as possible. I'm not looking for PEP8
comments as I'm aware of them, but I'll handle this part when I'll have this as optimized / improved as possible.
Code:
import petl
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one, ['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two, ['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main
def main_compare(table):
non_serial_rows = petl.select(table, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(table, lambda rec: rec['serial number'] is not None)
ok_rows = petl.join(serial_rows, petl.cut(main, ['serial number']), key='serial number')
ok_rows = petl.addfield(ok_rows, 'file', 'OK')
nok_rows = petl.antijoin(serial_rows, petl.cut(main, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
return petl.stack(ok_rows, nok_rows, non_serial_rows)
one_final = main_compare(one)
two_final = main_compare(two)
petl.toxlsx(main_final, 'mainNew.xlsx')
petl.toxlsx(one_final, '1New.xlsx')
petl.toxlsx(two_final, '2New.xlsx')
Sample files, can be downloaded from here. (for those who have time to play a lil' bit with the code).
2 Answers 2
From a practical perspective, you should use object-oriented style for calling your transformations. It will reduce the amount of intermediate variables as you will be able to chain calls.
From a conceptual perspective you want to apply the same "transformations" on all 3 files, there is not much differences whether you do it from n.xlsx
to main.xlsx
or the other way around. You need to:
- Extract serial numbers from a file;
- Associate them to a message (filename when used to fill
main.xlsx
or OK when used to filln.xlsx
); - Add a column on an other file whose matching rows contains the messages and others contains NOK.
Instead of that you:
- separate the row into categories (matching, not matching, no serial numbers);
- add a column to these categories depending on their kind;
- concatenate back those categories to get the resulting file out of them.
The problems you run into doing that are that, for one you change the order of the rows by filtering and stacking back (but that doesn't seem to be an issue), and for two your filtering rules are both clumsy and iterating over the input more than once.
The following approach iterates over each file exactly twice (one to extract the serial number/message pairs and one to add the required column) and uses functions to provide a generic approach that can easily be used to handle more files:
import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
return (table
.selectisnot(SERIAL_COLUMN, None)
.cut(SERIAL_COLUMN)
.addfield('file', message))
def create_new_column(table, allowed_serial, default_message='NOK'):
return table.leftjoin(
allowed_serial,
key=SERIAL_COLUMN,
missing=default_message)
if __name__ == '__main__':
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
files_serial = petl.stack(
map_serial_to_message(one, 'OK, 1.xlsx'),
map_serial_to_message(two, 'OK, 2.xlsx'),
# other files if need be
)
main_serial = map_serial_to_message(main)
petl.toxlsx(create_new_column(main, files_serial), 'mainNew.xlsx')
petl.toxlsx(create_new_column(one, main_serial), '1New.xlsx')
petl.toxlsx(create_new_column(two, main_serial), '2New.xlsx')
As regard to writing to the original file, you could look into creating the files in memory first and then, once they all are iterated over enough time, write them back. But if files are big, you might be limited by the amount of available memory.
This is untested but it could look like:
import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
return (table
.selectisnot(SERIAL_COLUMN, None)
.cut(SERIAL_COLUMN)
.addfield('file', message))
def create_new_column(table, *args, default_message='NOK'):
try:
serials, = args
except ValueError:
serials = petl.stack(*args)
sink = petl.MemorySource()
(table
.leftjoin(serials, key=SERIAL_COLUMN, missing=default_message)
.toxlsx(sink))
return sink.getvalue()
def write_to_file(data, filename):
with open(filename, 'wb') as f:
f.write(data)
if __name__ == '__main__':
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
main_serial = map_serial_to_message(main)
# Warning, these may require large amount of memory
new_main = create_new_column(main,
map_serial_to_message(one, 'OK, 1.xlsx'),
map_serial_to_message(two, 'OK, 2.xlsx'),
# other files if need be
)
new_one = create_new_column(one, main_serial)
new_two = create_new_column(two, main_serial)
# This is important to write after all transformations
write_to_file(new_main, 'main.xlsx')
write_to_file(new_one, '1.xlsx')
write_to_file(new_two, '2.xlsx')
I also changed the way to handle stacking serial-messages pairs from n.xlsx
to provide a more automatic alternative.
-
\$\begingroup\$ @JoeWallis Yeah, it's just that I had a few extra time to dig into
petl
documentation to use native constructs instead of doing the merge in Python. The library seems interesting to analyze data, a bit less to modify it. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年05月13日 18:09:06 +00:00Commented May 13, 2016 at 18:09 -
\$\begingroup\$ @MathiasEttinger thanks, these are some awesome changes. However, any ideas why do I get:
AttributeError: 'ValuesView' object has no attribute 'addfield'
? \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年05月15日 14:53:58 +00:00Commented May 15, 2016 at 14:53 -
\$\begingroup\$ @MathiasEttinger if I do that modification I get a big traceback \$\endgroup\$Grajdeanu Alex– Grajdeanu Alex2016年05月15日 15:07:18 +00:00Commented May 15, 2016 at 15:07
-
\$\begingroup\$ @Dex'ter The most relevant part is
AttributeError: 'MemorySource' object has no attribute 'write'
. As said, I don't really know how source objects do work. I'll try to dig into it a bit latter, for now, you should stick with the first version, then. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2016年05月15日 15:11:04 +00:00Commented May 15, 2016 at 15:11
I've not seen or used petl
before, but there are some ways to improve your code.
- Move things out of the global scope.
- Make a function for your most common functions.
- Change
petl.select
topetl.selectis
orpetl.selectisnot
.
Since you do:
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number') nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
So many times, you should change it to a function.
This function could take a fn
of petl.antijoin
, a left
of serial_rows
, a right
of stacked_joins
and a field
of 'NOK'
.
And so you can use:
def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
return petl.addfield(rows, 'file', field)
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')
Adding this, with the changes to petl.select
, to main_compare
can result in something like:
def main_compare(table):
serial_rows = petl.selectisnot(table, 'serial number', None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, 'serial number', None))
There's not much more that I think could do with improving apart from having nothing in your global scope.
And then adding a global constant for the serial column, say SERIAL_COLUMN = 'serial number'
.
This significantly reduces the amount of magic strings, to ones that aren't repeated (except 'NOK').
This can result in (Not pep8 compliant to remove a scroll bar):
import petl
SERIAL_COLUMN = 'serial number'
def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
return petl.addfield(rows, 'file', field)
def update_main(main, one, two):
serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
stacked_joins = petl.stack(
change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
return petl.stack(
stacked_joins,
petl.selectis(main, SERIAL_COLUMN, None),
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))
def main_compare(main, table):
serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, SERIAL_COLUMN, None))
def main():
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
petl.toxlsx(main_compare(main, one), '1New.xlsx')
petl.toxlsx(main_compare(main, two), '2New.xlsx')
if __name__ == '__main__':
main()
Explore related questions
See similar questions with these tags.