Excel Data Manipulation - parse, match and create

Question 1

I've got a simple Excel Data Manipulation Script to match one of my daily tasks, written in python 3.

Intro

Let's assume that I have 3 excel files: main.xlsx, 1.xlsx and 2.xlsx. In all of them I have a column named serial numbers. I have to:

lookup for all serial numbers in 1.xlsx and 2.xlsx and verify if they are in main.xlsx.

If a serial number is find:

on the last column of main.xlsx, on the same row with the serial number that was find, write OK + name_of_the_file_in which_it_was_found. Else, write NOK. At the same time, write in 1.xlsx and 2.xlsx ok or nok on the last column if the serial number was found or not.

Now, my script is simply creating new files with the last column appended (instead of appending directly to the same file), which is ok (but if you guys have a better method, shout it). What I'm looking for, is a way of making this as optimized as possible. I'm not looking for PEP8 comments as I'm aware of them, but I'll handle this part when I'll have this as optimized / improved as possible.

Code:

import petl
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one, ['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two, ['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main
def main_compare(table):
 non_serial_rows = petl.select(table, lambda rec: rec['serial number'] is None)
 serial_rows = petl.select(table, lambda rec: rec['serial number'] is not None)
 ok_rows = petl.join(serial_rows, petl.cut(main, ['serial number']), key='serial number')
 ok_rows = petl.addfield(ok_rows, 'file', 'OK')
 nok_rows = petl.antijoin(serial_rows, petl.cut(main, ['serial number']), key='serial number')
 nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
 return petl.stack(ok_rows, nok_rows, non_serial_rows)
one_final = main_compare(one)
two_final = main_compare(two)
petl.toxlsx(main_final, 'mainNew.xlsx')
petl.toxlsx(one_final, '1New.xlsx')
petl.toxlsx(two_final, '2New.xlsx')

Sample files, can be downloaded from here. (for those who have time to play a lil' bit with the code).

Question 2

From a practical perspective, you should use object-oriented style for calling your transformations. It will reduce the amount of intermediate variables as you will be able to chain calls.

From a conceptual perspective you want to apply the same "transformations" on all 3 files, there is not much differences whether you do it from n.xlsx to main.xlsx or the other way around. You need to:

Extract serial numbers from a file;
Associate them to a message (filename when used to fill main.xlsx or OK when used to fill n.xlsx);
Add a column on an other file whose matching rows contains the messages and others contains NOK.

Instead of that you:

separate the row into categories (matching, not matching, no serial numbers);
add a column to these categories depending on their kind;
concatenate back those categories to get the resulting file out of them.

The problems you run into doing that are that, for one you change the order of the rows by filtering and stacking back (but that doesn't seem to be an issue), and for two your filtering rules are both clumsy and iterating over the input more than once.

The following approach iterates over each file exactly twice (one to extract the serial number/message pairs and one to add the required column) and uses functions to provide a generic approach that can easily be used to handle more files:

import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
 return (table
 .selectisnot(SERIAL_COLUMN, None)
 .cut(SERIAL_COLUMN)
 .addfield('file', message))
def create_new_column(table, allowed_serial, default_message='NOK'):
 return table.leftjoin(
 allowed_serial,
 key=SERIAL_COLUMN,
 missing=default_message)
if __name__ == '__main__':
 main = petl.fromxlsx('main.xlsx')
 one = petl.fromxlsx('1.xlsx', row_offset=1)
 two = petl.fromxlsx('2.xlsx')
 files_serial = petl.stack(
 map_serial_to_message(one, 'OK, 1.xlsx'),
 map_serial_to_message(two, 'OK, 2.xlsx'),
 # other files if need be
 )
 main_serial = map_serial_to_message(main)
 petl.toxlsx(create_new_column(main, files_serial), 'mainNew.xlsx')
 petl.toxlsx(create_new_column(one, main_serial), '1New.xlsx')
 petl.toxlsx(create_new_column(two, main_serial), '2New.xlsx')

As regard to writing to the original file, you could look into creating the files in memory first and then, once they all are iterated over enough time, write them back. But if files are big, you might be limited by the amount of available memory.

This is untested but it could look like:

import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
 return (table
 .selectisnot(SERIAL_COLUMN, None)
 .cut(SERIAL_COLUMN)
 .addfield('file', message))
def create_new_column(table, *args, default_message='NOK'):
 try:
 serials, = args
 except ValueError:
 serials = petl.stack(*args)
 sink = petl.MemorySource()
 (table
 .leftjoin(serials, key=SERIAL_COLUMN, missing=default_message)
 .toxlsx(sink))
 return sink.getvalue()
def write_to_file(data, filename):
 with open(filename, 'wb') as f:
 f.write(data)
if __name__ == '__main__':
 main = petl.fromxlsx('main.xlsx')
 one = petl.fromxlsx('1.xlsx', row_offset=1)
 two = petl.fromxlsx('2.xlsx')
 main_serial = map_serial_to_message(main)
 # Warning, these may require large amount of memory
 new_main = create_new_column(main,
 map_serial_to_message(one, 'OK, 1.xlsx'),
 map_serial_to_message(two, 'OK, 2.xlsx'),
 # other files if need be
 )
 new_one = create_new_column(one, main_serial)
 new_two = create_new_column(two, main_serial)
 # This is important to write after all transformations
 write_to_file(new_main, 'main.xlsx')
 write_to_file(new_one, '1.xlsx')
 write_to_file(new_two, '2.xlsx')

I also changed the way to handle stacking serial-messages pairs from n.xlsx to provide a more automatic alternative.

Question 3

@JoeWallis Yeah, it's just that I had a few extra time to dig into petl documentation to use native constructs instead of doing the merge in Python. The library seems interesting to analyze data, a bit less to modify it.

Question 4

@MathiasEttinger thanks, these are some awesome changes. However, any ideas why do I get: AttributeError: 'ValuesView' object has no attribute 'addfield' ?

Question 5

@MathiasEttinger if I do that modification I get a big traceback

Question 6

@Dex'ter The most relevant part is AttributeError: 'MemorySource' object has no attribute 'write'. As said, I don't really know how source objects do work. I'll try to dig into it a bit latter, for now, you should stick with the first version, then.

Question 7

I've not seen or used petl before, but there are some ways to improve your code.

Move things out of the global scope.
Make a function for your most common functions.
Change petl.select to petl.selectis or petl.selectisnot.

Since you do:

nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')

So many times, you should change it to a function. This function could take a fn of petl.antijoin, a left of serial_rows, a right of stacked_joins and a field of 'NOK'.

And so you can use:

def change_row_file(fn, left, right, field):
 rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
 return petl.addfield(rows, 'file', field)
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')

Adding this, with the changes to petl.select, to main_compare can result in something like:

def main_compare(table):
 serial_rows = petl.selectisnot(table, 'serial number', None)
 return petl.stack(
 change_row_file(petl.join, serial_rows, main, 'OK'),
 change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
 petl.selectis(table, 'serial number', None))

There's not much more that I think could do with improving apart from having nothing in your global scope. And then adding a global constant for the serial column, say SERIAL_COLUMN = 'serial number'. This significantly reduces the amount of magic strings, to ones that aren't repeated (except 'NOK').

This can result in (Not pep8 compliant to remove a scroll bar):

import petl
SERIAL_COLUMN = 'serial number'
def change_row_file(fn, left, right, field):
 rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
 return petl.addfield(rows, 'file', field)
def update_main(main, one, two):
 serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
 stacked_joins = petl.stack(
 change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
 change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
 return petl.stack(
 stacked_joins,
 petl.selectis(main, SERIAL_COLUMN, None),
 change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))
def main_compare(main, table):
 serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
 return petl.stack(
 change_row_file(petl.join, serial_rows, main, 'OK'),
 change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
 petl.selectis(table, SERIAL_COLUMN, None))
def main():
 main = petl.fromxlsx('main.xlsx')
 one = petl.fromxlsx('1.xlsx', row_offset=1)
 two = petl.fromxlsx('2.xlsx')
 petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
 petl.toxlsx(main_compare(main, one), '1New.xlsx')
 petl.toxlsx(main_compare(main, two), '2New.xlsx')
if __name__ == '__main__':
 main()

score 2 · Accepted Answer · 2016-05-13 11:19:57Z

From a practical perspective, you should use object-oriented style for calling your transformations. It will reduce the amount of intermediate variables as you will be able to chain calls.

From a conceptual perspective you want to apply the same "transformations" on all 3 files, there is not much differences whether you do it from n.xlsx to main.xlsx or the other way around. You need to:

Extract serial numbers from a file;
Associate them to a message (filename when used to fill main.xlsx or OK when used to fill n.xlsx);
Add a column on an other file whose matching rows contains the messages and others contains NOK.

Instead of that you:

separate the row into categories (matching, not matching, no serial numbers);
add a column to these categories depending on their kind;
concatenate back those categories to get the resulting file out of them.

The problems you run into doing that are that, for one you change the order of the rows by filtering and stacking back (but that doesn't seem to be an issue), and for two your filtering rules are both clumsy and iterating over the input more than once.

The following approach iterates over each file exactly twice (one to extract the serial number/message pairs and one to add the required column) and uses functions to provide a generic approach that can easily be used to handle more files:

import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
 return (table
 .selectisnot(SERIAL_COLUMN, None)
 .cut(SERIAL_COLUMN)
 .addfield('file', message))
def create_new_column(table, allowed_serial, default_message='NOK'):
 return table.leftjoin(
 allowed_serial,
 key=SERIAL_COLUMN,
 missing=default_message)
if __name__ == '__main__':
 main = petl.fromxlsx('main.xlsx')
 one = petl.fromxlsx('1.xlsx', row_offset=1)
 two = petl.fromxlsx('2.xlsx')
 files_serial = petl.stack(
 map_serial_to_message(one, 'OK, 1.xlsx'),
 map_serial_to_message(two, 'OK, 2.xlsx'),
 # other files if need be
 )
 main_serial = map_serial_to_message(main)
 petl.toxlsx(create_new_column(main, files_serial), 'mainNew.xlsx')
 petl.toxlsx(create_new_column(one, main_serial), '1New.xlsx')
 petl.toxlsx(create_new_column(two, main_serial), '2New.xlsx')

As regard to writing to the original file, you could look into creating the files in memory first and then, once they all are iterated over enough time, write them back. But if files are big, you might be limited by the amount of available memory.

This is untested but it could look like:

import petl
SERIAL_COLUMN = 'serial_number'
def map_serial_to_message(table, message='OK'):
 return (table
 .selectisnot(SERIAL_COLUMN, None)
 .cut(SERIAL_COLUMN)
 .addfield('file', message))
def create_new_column(table, *args, default_message='NOK'):
 try:
 serials, = args
 except ValueError:
 serials = petl.stack(*args)
 sink = petl.MemorySource()
 (table
 .leftjoin(serials, key=SERIAL_COLUMN, missing=default_message)
 .toxlsx(sink))
 return sink.getvalue()
def write_to_file(data, filename):
 with open(filename, 'wb') as f:
 f.write(data)
if __name__ == '__main__':
 main = petl.fromxlsx('main.xlsx')
 one = petl.fromxlsx('1.xlsx', row_offset=1)
 two = petl.fromxlsx('2.xlsx')
 main_serial = map_serial_to_message(main)
 # Warning, these may require large amount of memory
 new_main = create_new_column(main,
 map_serial_to_message(one, 'OK, 1.xlsx'),
 map_serial_to_message(two, 'OK, 2.xlsx'),
 # other files if need be
 )
 new_one = create_new_column(one, main_serial)
 new_two = create_new_column(two, main_serial)
 # This is important to write after all transformations
 write_to_file(new_main, 'main.xlsx')
 write_to_file(new_one, '1.xlsx')
 write_to_file(new_two, '2.xlsx')

I also changed the way to handle stacking serial-messages pairs from n.xlsx to provide a more automatic alternative.

@JoeWallis Yeah, it's just that I had a few extra time to dig into petl documentation to use native constructs instead of doing the merge in Python. The library seems interesting to analyze data, a bit less to modify it.
@MathiasEttinger thanks, these are some awesome changes. However, any ideas why do I get: AttributeError: 'ValuesView' object has no attribute 'addfield' ?
@MathiasEttinger if I do that modification I get a big traceback
@Dex'ter The most relevant part is AttributeError: 'MemorySource' object has no attribute 'write'. As said, I don't really know how source objects do work. I'll try to dig into it a bit latter, for now, you should stick with the first version, then.

Stack Exchange Network

Excel Data Manipulation - parse, match and create

Intro

Code:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Excel Data Manipulation - parse, match and create

Intro

Code:

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions