Background
Today at work, I had to bring a component that used to load CSV data into MySQL, and perform calculations as result of SQL queries. I was tasked with replicating that in a programming language.
The general example is like:
There's a 2D array (a table) with columns "Name", "Age", "Grade" and "Total". Produce counts like "Unique Ages", "Unique Ages & Grade", "Unique Ages for which Grade = 3" etc.
My Work
So, I chose Python to do this job. There's a class that takes two parameters: 2D array and columns, and allows for performing some very basic querying capability.
Environment: Python 2.x
class From:
def __init__(self, array, columns):
self.columns = columns
self.array = array
def where(self, column, data):
return From(filter(lambda x: x[self.columns.index(column)] == data, self.array), self.columns)
def whereNot(self, column, data):
return From(filter(lambda x: x[self.columns.index(column)] != data, self.array), self.columns)
def select(self, *columns):
return From([[arrayItem[self.columns.index(col)] for col in columns] for arrayItem in self.array], columns)
def unique(self, *columns):
# This function filters those rows which have a unique combination of the columns
# specified as parameter. If there are 2+ rows with same combination of values
# under the column, then the last one will be chose. Example use case: Extract a
# subset of rows that have a unique combination of "first name" & "second name"
# Invocation: From(<array>,<cols>).unique("first_name","second_name")
return From({tuple(row[self.columns.index(col)] for col in columns): row for row in self.array}.values(), self.columns)
def length(self):
return len(self.array)
def get(self):
if len(self.columns) == 1:
return [x[0] for x in self.array]
return self.array
Example Usage:
>>> array = [
... ["ben", "son", 23, 10, 93],
... ["gun", "dad", 21, 9, 99],
... ["sam", "mom", 19, 11, 92]
... ]
>>> cols = ["name1","name2", "age", "grade", "marks"]
>>> From(array, cols).where("name1","sam").get()
[['sam', 'mom', 19, 11, 92]]
The querying that is being done in the real code is quite complex beyond the scope of this question. The From
class works, for now.
Questions
- Can this be improved, maybe made more pythonic from an overall perspective (Please excuse those ugly filter/lambda expressions. I know they are just not pythonic.)?
- The
From.unique(..)
function looks like a dirty hack. Is there any cleaner way to do it? (Please check my comment under the unique function declaration.) - Is there any non-edge case when the above will not work? Anywhere it might just go wrong?
- Am I reinventing the wheel?
2 Answers 2
I totally agree with liuyu that a proper in-memory database like sqlite3
will be much more efficient and useful than what you have here.
But just reviewing your code as it is:
There's no documentation. How are we supposed to use this class?
There are no test cases. You could use the
doctest
module to make your example code into a runnable test case.You don't follow the Python style guide (PEP8). In particular, if you kept to the maximum line length of 79 columns, then we could read your code here without having to scroll the window.
From
is a poor name: it doesn't clearly communicate the meaning of the class. An instance of theFrom
class represents an table of records, so I would give the class a name likeTable
.Unless you have a particular reason to make an old-style class, you should write:
class From(object):
so that you get a new-style class which is portable to Python 3. See "new-style and classic classes" in the docs.
Instead of providing methods
length
to get the length of the table, andget
to get the table itself, why not make your class into a sequence? That is, instead oflength
you'd provide__len__
and instead ofget
you'd provide__iter__
and__getitem__
. Then callers could just iterate over yourFrom
items in the usual way:for record in From(...):
If you derived your class from the abstract base case
collections.Sequence
, you'd also get__contains__
,__iter__
,__reversed__
,index
, andcount
methods for free.It would be most natural to return each record as a
collections.namedtuple
object, so that the caller can refer to the columns by name:for record in From(...).select('description', 'count'): print('Part: {0.description}. Number: ${0.count}'.format(record))
The special case in
get
seems like a bad idea. It is generally best for functions to have simple and clear specifications. If you want a convenient method for returning a single column as a list, then you should write that as a separate method.The
index
method on a list has to scan along the whole list comparing with each element in turn. You should consider building a dictionary in which you can look up the column name and get its index. (Or if you usenamedtuple
you can writegetattr(row, column)
.)There's very little error checking. For example, in
__init__
, wouldn't it be worth checking that the array has the right number of columns?There are a couple of problems with your
unique
method: (i) The comment should be a docstring; (ii) by using a dictionary you don't guarantee to get the records out in the order they went in. I would usecollections.OrderedDict
to ensure that order of rows is preserved.
So I'd write something like this:
from collections import namedtuple, OrderedDict, Sequence
class Table(Sequence):
"""Table(name, data, columns) represents a 2-dimensional table of data
whose column names are given by the columns argument.
>>> from operator import attrgetter
>>> planets = Table('Planet', [
... ('Mercury', 58, 88),
... ('Venus', 108, 225),
... ('Earth', 150, 365),
... ('Mars', 228, 687),
... ('Jupiter', 779, 4333),
... ('Saturn', 1433, 10759),
... ('Uranus', 2877, 30799),
... ('Neptune', 4503, 60190)],
... ('name', 'semi_major_axis', 'period'))
Iterating over the table yields the rows as namedtuple objects:
>>> max(planets, key=attrgetter('semi_major_axis'))
Planet(name='Neptune', semi_major_axis=4503, period=60190)
You can filter the table to get a subset of the rows:
>>> planets.filter(lambda p: p.period > 10000)
... # doctest: +NORMALIZE_WHITESPACE
Table('Planet',
[Planet(name='Saturn', semi_major_axis=1433, period=10759),
Planet(name='Uranus', semi_major_axis=2877, period=30799),
Planet(name='Neptune', semi_major_axis=4503, period=60190)],
['name', 'semi_major_axis', 'period'])
Or select a subset of the columns:
>>> planets.select('name')
... # doctest: +NORMALIZE_WHITESPACE
Table('Planet',
[Planet(name='Mercury'),
Planet(name='Venus'),
Planet(name='Earth'),
Planet(name='Mars'),
Planet(name='Jupiter'),
Planet(name='Saturn'),
Planet(name='Uranus'),
Planet(name='Neptune')],
['name'])
If you just want the values from a column, use the column method:
>>> ' '.join(planets.column('name'))
'Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune'
"""
def __init__(self, name, table, columns):
self.name = name
self.columns = list(columns)
Row = namedtuple(self.name, self.columns)
self.column_set = set(self.columns)
self.table = [Row(*row) for row in table]
def __len__(self):
return len(self.table)
def __getitem__(self, key):
return self.table[key]
def __iter__(self):
return iter(self.table)
def __repr__(self):
return '{0}({1.name!r}, {2!r}, {1.columns!r})'.format(
self.__class__.__name__, self, list(self))
def column(self, column):
"""Generate the values in the given column."""
assert(column in self.column_set)
for row in self:
yield getattr(row, column)
def filter(self, condition):
"""Return a new Table containing only the rows satisfying
condition.
"""
return Table(self.name, filter(condition, self), self.columns)
def where(self, column, value):
"""Return a new Table containing the rows from this table for which
the given column has the given value.
"""
assert(column in self.columns)
return self.filter(lambda row: getattr(row, column) == value)
def whereNot(self, column, value):
"""Return a new Table containing the rows from this table for which
the given column does not have the given value.
"""
assert(column in self.columns)
return self.filter(lambda row: getattr(row, column) != value)
def select(self, *columns):
"""Return a new Table consisting only of the given columns of this
table.
"""
assert(all(c in self.columns for c in columns))
return Table(self.name,
((getattr(row, c) for c in columns) for row in self),
columns)
def unique(self, *columns):
"""Return a new Table containing one row for each unique combination
of values in the given columns. (If there are multiple rows
with the same combination of values, the last one is
included.)
"""
assert(all(c in self.columns for c in columns))
result = OrderedDict((tuple(getattr(row, c) for c in columns), row)
for row in self)
return Table(self.name, result.values(), self.columns)
Consider loading the 2D list into an in-memory SQLite database (aka. sqlite3
package) created on-the-fly. Given that there is an existing codebase of well-defined SQL queries, you can reuse the query expressions in your python code with the in-memory SQLite database. And for sufficiently large datasets, the efficiency of SQLite queries should out-perform brute-force search via filter()
.
-
\$\begingroup\$ Thanks a lot for your response! I agree, this is the correct way to head. But I am actually a bit curious about my specific question/code here. :) \$\endgroup\$UltraInstinct– UltraInstinct2013年11月22日 05:58:44 +00:00Commented Nov 22, 2013 at 5:58