Map reduce tester ported from bash to Python

Question 1

My MapReduce tester is clearly ported from Shell, short of args=None for line in args or read_input(), what's a better way of importing->testing the function outside of subprocess?

Or does it not matter, i.e.: my "hack" is fine?

test_mapreduce.py

from unittest import TestCase, main as unittest_main
from subprocess import check_output as run
from os import path
class TestMapReduce(TestCase):
 top_path = ''
 map_reduce = lambda self, mapper_name, reducer_name, datafile_name: run(
 ['python', path.join(self.top_path, reducer_name), # Reduce
 run(['sort', # Shuffle, could be replaced with python `sorted`
 run(['python', path.join(self.top_path, mapper_name), # Map
 path.join(self.top_path, 'data', datafile_name)])])])
 @classmethod
 def setUpClass(cls):
 if not path.isfile('setup.py'):
 cls.top_path = path.join('..', '..')
 if not path.isfile(path.join(cls.top_path, 'setup.py')):
 raise AssertionError("Haven't found right directory to `cd` into")
 def test_with_student_test_posts(self):
 print self.map_reduce('mapper.py', 'reducer.py', 'student_test_posts.csv')
if __name__ == '__main__':
 unittest_main()

mapper.py

#!/usr/bin/env python
from fileinput import input as read_input
def mapper():
 for line in read_input():
 data = line.strip().split('\t')
 if len(data) != 6:
 continue
 date, time, store, item, cost, payment = data
 print "{0}\t{1}".format(store, cost)
if __name__ == '__main__':
 mapper()

PS: Should I refactor to use the map and reduce inbuilt functions?

Question 2

Didn't you forget to include your reducer.py? Some sample student_test_posts.csv would be nice too

Question 3

@janos: It doesn't actually matter, what I am showing is not the testing of the input/output but rather the process

Question 4

It's strange to use map_reduce = lambda ... to define a method. This is the same, written the common way:

def map_reduce(self, mapper_name, reducer_name, datafile_name):
 run(
 ['python', path.join(self.top_path, reducer_name), # Reduce
 run(['sort', # Shuffle, could be replaced with python `sorted`
 run(['python', path.join(self.top_path, mapper_name), # Map
 path.join(self.top_path, 'data', datafile_name)])])])

And this hack of calling python -> sort -> python is not fine at all. Python can certainly sort. Then your pipeline would become python -> python -> python, and at that point it's beyond silly to call subprocesses for this. You should do the whole thing in a single Python process, instead of 3 different processes.

Refactoring with `map` and `reduce`

Here's one way to refactor mapper to use Python's map function:

def line2cols(line):
 return line.strip().split('\t')
def has6cols(cols):
 return len(cols) == 6
def cols2out(cols):
 return '{}\t{}'.format(*cols)
def mapper():
 return map(cols2out, filter(has6cols, map(line2cols, read_input())))

And here's an example reducer using Python's reduce:

def reducer(seq):
 def f(a, b):
 if len(a) > len(b):
 return a
 return b
 return reduce(f, seq, '')

This is quite stupid, it just finds the longest string in the sequence.

I hope this helps.

UPDATE

It's a bit difficult to understand what you're trying to do.

My MapReduce tester is clearly ported from Shell, short of args=None for line in args or read_input(), what's a better way of importing->testing the function outside of subprocess?

When I read this I didn't quite get what you're talking about shell. In the code I saw you're calling Python, twice, which is clearly not fine.

Let me try again, to guess what you're trying to do. Maybe you have a Python mapper script, and you have a Python reducer script, which you use in some framework? And you want to write some unit tests to check that these scripts in fact work? I mean the scripts as black boxes, as in, you want to test the complete scripts, rather than the underlying Python functions / classes? I'm really just guessing here, maybe I'm completely wrong.

If this is indeed what you want, then don't. Don't try to test the scripts, test the underlying implementation. If the implementation passes, the scripts should print correct output too. If you want to test the script outputs in addition to the underlying implementation, then you'd be just testing the basic ability to print, which seems rather pointless.

Question 5

Referenced python's sorted in my comment, and isn't lambda vs def merely a combination of personal style + one-liners vs multi-liners in Python?

Question 6

If you could show me how to rework my flow to not use subprocess at all that'd be great =)

Question 7

Just because you can do something doesn't mean you should. lambda is useful and ergonomic in some situations, but this is not one of those. If you really prefer this way, you can, but I advise against.

Question 8

To rework without subprocess, that would take a rewrite. Code Review is about reviewing code, not rewriting it. This article looks like a good starting point: mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python

Question 9

Here is the solution I touched upon in my initial question - codereview.stackexchange.com/a/60281/13407 - is that as bad a solution as I implied?

Question 10

Forgot all about this problem, only took a look at my project today, and figured out this solution:

test_mapper.py

from unittest import TestCase, main as unittest_main
from StringIO import StringIO
from map_reduce_udacity.mapper import mapper
class TestMapper(TestCase):
 def test_one_line(self):
 self.assertEqual(
 mapper(StringIO("2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex")),
 ['San Jose\t214.05']
 )
if __name__ == '__main__':
 unittest_main()

mapper.py

#!/usr/bin/env python
from fileinput import input as read_input
from os.path import abspath, join as path_join, dirname
from StringIO import StringIO
from collections import OrderedDict
def pick(r, ks, headers=None):
 headers = headers or 'date', 'time', 'store', 'item', 'cost', 'payment'
 return filter(lambda v: v is not None,
 map(lambda t: t[0] in ks and t[1] or None,
 OrderedDict(zip(headers, r)).iteritems()))
def mapper(args=None):
 out = map(lambda row: '\t'.join(pick(row, ('store', 'cost'))),
 filter(lambda data: len(data) == 6,
 map(lambda line: line.strip().split('\t'),
 args or read_input())))
 print 'out =', out
 return out

Just to show the advantage of the function written this way more explicitly:

from os.path import abspath, join as path_join, dirname
from StringIO import StringIO
if __name__ == '__main__':
 # First way of running:
 mapper(StringIO("2012-01-01 09:00 San Jose Men's Clothing 214.05 Amex"))
 # Second way of running:
 with open(abspath(path_join(dirname(__file__), '..', 'data', 'head_50_purchases.txt'))) as f:
 mapper(f.readlines())
 # Third way of running:
 # [from your e.g. cmd.exe or bash or subprocess call]
 # $ python mapper.py ../data/head_50_purchases.txt

janos janos 113k15 gold badges154 silver badges396 bronze badges · Accepted Answer · 2014-08-17 08:15:05Z

It's strange to use map_reduce = lambda ... to define a method. This is the same, written the common way:

def map_reduce(self, mapper_name, reducer_name, datafile_name):
 run(
 ['python', path.join(self.top_path, reducer_name), # Reduce
 run(['sort', # Shuffle, could be replaced with python `sorted`
 run(['python', path.join(self.top_path, mapper_name), # Map
 path.join(self.top_path, 'data', datafile_name)])])])

And this hack of calling python -> sort -> python is not fine at all. Python can certainly sort. Then your pipeline would become python -> python -> python, and at that point it's beyond silly to call subprocesses for this. You should do the whole thing in a single Python process, instead of 3 different processes.

Refactoring with `map` and `reduce`

Here's one way to refactor mapper to use Python's map function:

def line2cols(line):
 return line.strip().split('\t')
def has6cols(cols):
 return len(cols) == 6
def cols2out(cols):
 return '{}\t{}'.format(*cols)
def mapper():
 return map(cols2out, filter(has6cols, map(line2cols, read_input())))

And here's an example reducer using Python's reduce:

def reducer(seq):
 def f(a, b):
 if len(a) > len(b):
 return a
 return b
 return reduce(f, seq, '')

This is quite stupid, it just finds the longest string in the sequence.

I hope this helps.

UPDATE

It's a bit difficult to understand what you're trying to do.

My MapReduce tester is clearly ported from Shell, short of args=None for line in args or read_input(), what's a better way of importing->testing the function outside of subprocess?

When I read this I didn't quite get what you're talking about shell. In the code I saw you're calling Python, twice, which is clearly not fine.

Let me try again, to guess what you're trying to do. Maybe you have a Python mapper script, and you have a Python reducer script, which you use in some framework? And you want to write some unit tests to check that these scripts in fact work? I mean the scripts as black boxes, as in, you want to test the complete scripts, rather than the underlying Python functions / classes? I'm really just guessing here, maybe I'm completely wrong.

If this is indeed what you want, then don't. Don't try to test the scripts, test the underlying implementation. If the implementation passes, the scripts should print correct output too. If you want to test the script outputs in addition to the underlying implementation, then you'd be just testing the basic ability to print, which seems rather pointless.

Referenced python's sorted in my comment, and isn't lambda vs def merely a combination of personal style + one-liners vs multi-liners in Python?
If you could show me how to rework my flow to not use subprocess at all that'd be great =)
Just because you can do something doesn't mean you should. lambda is useful and ergonomic in some situations, but this is not one of those. If you really prefer this way, you can, but I advise against.
To rework without subprocess, that would take a rewrite. Code Review is about reviewing code, not rewriting it. This article looks like a good starting point: mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python
Here is the solution I touched upon in my initial question - codereview.stackexchange.com/a/60281/13407 - is that as bad a solution as I implied?

Stack Exchange Network

Map reduce tester ported from bash to Python

test_mapreduce.py

mapper.py

2 Answers 2

Refactoring with `map` and `reduce`

UPDATE

test_mapper.py

mapper.py

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Map reduce tester ported from bash to Python

test_mapreduce.py

mapper.py

2 Answers 2

Refactoring with map and reduce

UPDATE

test_mapper.py

mapper.py

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Refactoring with `map` and `reduce`