Basic querying on Python 2D arrays

Question 1

Background

Today at work, I had to bring a component that used to load CSV data into MySQL, and perform calculations as result of SQL queries. I was tasked with replicating that in a programming language.

The general example is like:

There's a 2D array (a table) with columns "Name", "Age", "Grade" and "Total". Produce counts like "Unique Ages", "Unique Ages & Grade", "Unique Ages for which Grade = 3" etc.

My Work

So, I chose Python to do this job. There's a class that takes two parameters: 2D array and columns, and allows for performing some very basic querying capability.

Environment: Python 2.x

class From:
 def __init__(self, array, columns):
 self.columns = columns
 self.array = array
 def where(self, column, data):
 return From(filter(lambda x: x[self.columns.index(column)] == data, self.array), self.columns)
 def whereNot(self, column, data):
 return From(filter(lambda x: x[self.columns.index(column)] != data, self.array), self.columns)
 def select(self, *columns):
 return From([[arrayItem[self.columns.index(col)] for col in columns] for arrayItem in self.array], columns)
 def unique(self, *columns):
 # This function filters those rows which have a unique combination of the columns
 # specified as parameter. If there are 2+ rows with same combination of values
 # under the column, then the last one will be chose. Example use case: Extract a
 # subset of rows that have a unique combination of "first name" & "second name"
 # Invocation: From(<array>,<cols>).unique("first_name","second_name")
 return From({tuple(row[self.columns.index(col)] for col in columns): row for row in self.array}.values(), self.columns)
 def length(self):
 return len(self.array)
 def get(self):
 if len(self.columns) == 1:
 return [x[0] for x in self.array]
 return self.array

Example Usage:

>>> array = [
... ["ben", "son", 23, 10, 93],
... ["gun", "dad", 21, 9, 99],
... ["sam", "mom", 19, 11, 92]
... ]
>>> cols = ["name1","name2", "age", "grade", "marks"]
>>> From(array, cols).where("name1","sam").get()
[['sam', 'mom', 19, 11, 92]]

The querying that is being done in the real code is quite complex beyond the scope of this question. The From class works, for now.

Questions

Can this be improved, maybe made more pythonic from an overall perspective (Please excuse those ugly filter/lambda expressions. I know they are just not pythonic.)?
The From.unique(..) function looks like a dirty hack. Is there any cleaner way to do it? (Please check my comment under the unique function declaration.)
Is there any non-edge case when the above will not work? Anywhere it might just go wrong?
Am I reinventing the wheel?

Question 2

I totally agree with liuyu that a proper in-memory database like sqlite3 will be much more efficient and useful than what you have here.

But just reviewing your code as it is:

There's no documentation. How are we supposed to use this class?
There are no test cases. You could use the doctest module to make your example code into a runnable test case.
You don't follow the Python style guide (PEP8). In particular, if you kept to the maximum line length of 79 columns, then we could read your code here without having to scroll the window.
From is a poor name: it doesn't clearly communicate the meaning of the class. An instance of the From class represents an table of records, so I would give the class a name like Table.
Unless you have a particular reason to make an old-style class, you should write:
```
class From(object):
```
so that you get a new-style class which is portable to Python 3. See "new-style and classic classes" in the docs.
Instead of providing methods length to get the length of the table, and get to get the table itself, why not make your class into a sequence? That is, instead of length you'd provide __len__ and instead of get you'd provide __iter__ and __getitem__. Then callers could just iterate over your From items in the usual way:
```
for record in From(...):
```
If you derived your class from the abstract base case collections.Sequence, you'd also get __contains__, __iter__, __reversed__, index, and count methods for free.
It would be most natural to return each record as a collections.namedtuple object, so that the caller can refer to the columns by name:
```
for record in From(...).select('description', 'count'):
 print('Part: {0.description}. Number: ${0.count}'.format(record))
```
The special case in get seems like a bad idea. It is generally best for functions to have simple and clear specifications. If you want a convenient method for returning a single column as a list, then you should write that as a separate method.
The index method on a list has to scan along the whole list comparing with each element in turn. You should consider building a dictionary in which you can look up the column name and get its index. (Or if you use namedtuple you can write getattr(row, column).)
There's very little error checking. For example, in __init__, wouldn't it be worth checking that the array has the right number of columns?
There are a couple of problems with your unique method: (i) The comment should be a docstring; (ii) by using a dictionary you don't guarantee to get the records out in the order they went in. I would use collections.OrderedDict to ensure that order of rows is preserved.

So I'd write something like this:

from collections import namedtuple, OrderedDict, Sequence
class Table(Sequence):
 """Table(name, data, columns) represents a 2-dimensional table of data
 whose column names are given by the columns argument.
 >>> from operator import attrgetter
 >>> planets = Table('Planet', [
 ... ('Mercury', 58, 88),
 ... ('Venus', 108, 225),
 ... ('Earth', 150, 365),
 ... ('Mars', 228, 687),
 ... ('Jupiter', 779, 4333),
 ... ('Saturn', 1433, 10759),
 ... ('Uranus', 2877, 30799),
 ... ('Neptune', 4503, 60190)],
 ... ('name', 'semi_major_axis', 'period'))
 Iterating over the table yields the rows as namedtuple objects:
 >>> max(planets, key=attrgetter('semi_major_axis'))
 Planet(name='Neptune', semi_major_axis=4503, period=60190)
 You can filter the table to get a subset of the rows:
 >>> planets.filter(lambda p: p.period > 10000)
 ... # doctest: +NORMALIZE_WHITESPACE
 Table('Planet',
 [Planet(name='Saturn', semi_major_axis=1433, period=10759),
 Planet(name='Uranus', semi_major_axis=2877, period=30799),
 Planet(name='Neptune', semi_major_axis=4503, period=60190)],
 ['name', 'semi_major_axis', 'period'])
 Or select a subset of the columns:
 >>> planets.select('name')
 ... # doctest: +NORMALIZE_WHITESPACE
 Table('Planet',
 [Planet(name='Mercury'),
 Planet(name='Venus'),
 Planet(name='Earth'),
 Planet(name='Mars'),
 Planet(name='Jupiter'),
 Planet(name='Saturn'),
 Planet(name='Uranus'),
 Planet(name='Neptune')],
 ['name'])
 If you just want the values from a column, use the column method:
 >>> ' '.join(planets.column('name'))
 'Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune'
 """
 def __init__(self, name, table, columns):
 self.name = name
 self.columns = list(columns)
 Row = namedtuple(self.name, self.columns)
 self.column_set = set(self.columns)
 self.table = [Row(*row) for row in table]
 def __len__(self):
 return len(self.table)
 def __getitem__(self, key):
 return self.table[key]
 def __iter__(self):
 return iter(self.table)
 def __repr__(self):
 return '{0}({1.name!r}, {2!r}, {1.columns!r})'.format(
 self.__class__.__name__, self, list(self))
 def column(self, column):
 """Generate the values in the given column."""
 assert(column in self.column_set)
 for row in self:
 yield getattr(row, column)
 def filter(self, condition):
 """Return a new Table containing only the rows satisfying
 condition.
 """
 return Table(self.name, filter(condition, self), self.columns)
 def where(self, column, value):
 """Return a new Table containing the rows from this table for which
 the given column has the given value.
 """
 assert(column in self.columns)
 return self.filter(lambda row: getattr(row, column) == value)
 def whereNot(self, column, value):
 """Return a new Table containing the rows from this table for which
 the given column does not have the given value.
 """
 assert(column in self.columns)
 return self.filter(lambda row: getattr(row, column) != value)
 def select(self, *columns):
 """Return a new Table consisting only of the given columns of this
 table.
 """
 assert(all(c in self.columns for c in columns))
 return Table(self.name,
 ((getattr(row, c) for c in columns) for row in self),
 columns)
 def unique(self, *columns):
 """Return a new Table containing one row for each unique combination
 of values in the given columns. (If there are multiple rows
 with the same combination of values, the last one is
 included.)
 """
 assert(all(c in self.columns for c in columns))
 result = OrderedDict((tuple(getattr(row, c) for c in columns), row)
 for row in self)
 return Table(self.name, result.values(), self.columns)

Question 3

Consider loading the 2D list into an in-memory SQLite database (aka. sqlite3 package) created on-the-fly. Given that there is an existing codebase of well-defined SQL queries, you can reuse the query expressions in your python code with the in-memory SQLite database. And for sufficiently large datasets, the efficiency of SQLite queries should out-perform brute-force search via filter().

Question 4

Thanks a lot for your response! I agree, this is the correct way to head. But I am actually a bit curious about my specific question/code here. :)

Gareth Rees Gareth Rees 50.1k3 gold badges130 silver badges210 bronze badges · Accepted Answer · 2013-11-22 10:55:12Z

I totally agree with liuyu that a proper in-memory database like sqlite3 will be much more efficient and useful than what you have here.

But just reviewing your code as it is:

There's no documentation. How are we supposed to use this class?
There are no test cases. You could use the doctest module to make your example code into a runnable test case.
You don't follow the Python style guide (PEP8). In particular, if you kept to the maximum line length of 79 columns, then we could read your code here without having to scroll the window.
From is a poor name: it doesn't clearly communicate the meaning of the class. An instance of the From class represents an table of records, so I would give the class a name like Table.
Unless you have a particular reason to make an old-style class, you should write:
```
class From(object):
```
so that you get a new-style class which is portable to Python 3. See "new-style and classic classes" in the docs.
Instead of providing methods length to get the length of the table, and get to get the table itself, why not make your class into a sequence? That is, instead of length you'd provide __len__ and instead of get you'd provide __iter__ and __getitem__. Then callers could just iterate over your From items in the usual way:
```
for record in From(...):
```
If you derived your class from the abstract base case collections.Sequence, you'd also get __contains__, __iter__, __reversed__, index, and count methods for free.
It would be most natural to return each record as a collections.namedtuple object, so that the caller can refer to the columns by name:
```
for record in From(...).select('description', 'count'):
 print('Part: {0.description}. Number: ${0.count}'.format(record))
```
The special case in get seems like a bad idea. It is generally best for functions to have simple and clear specifications. If you want a convenient method for returning a single column as a list, then you should write that as a separate method.
The index method on a list has to scan along the whole list comparing with each element in turn. You should consider building a dictionary in which you can look up the column name and get its index. (Or if you use namedtuple you can write getattr(row, column).)
There's very little error checking. For example, in __init__, wouldn't it be worth checking that the array has the right number of columns?
There are a couple of problems with your unique method: (i) The comment should be a docstring; (ii) by using a dictionary you don't guarantee to get the records out in the order they went in. I would use collections.OrderedDict to ensure that order of rows is preserved.

So I'd write something like this:

from collections import namedtuple, OrderedDict, Sequence
class Table(Sequence):
 """Table(name, data, columns) represents a 2-dimensional table of data
 whose column names are given by the columns argument.
 >>> from operator import attrgetter
 >>> planets = Table('Planet', [
 ... ('Mercury', 58, 88),
 ... ('Venus', 108, 225),
 ... ('Earth', 150, 365),
 ... ('Mars', 228, 687),
 ... ('Jupiter', 779, 4333),
 ... ('Saturn', 1433, 10759),
 ... ('Uranus', 2877, 30799),
 ... ('Neptune', 4503, 60190)],
 ... ('name', 'semi_major_axis', 'period'))
 Iterating over the table yields the rows as namedtuple objects:
 >>> max(planets, key=attrgetter('semi_major_axis'))
 Planet(name='Neptune', semi_major_axis=4503, period=60190)
 You can filter the table to get a subset of the rows:
 >>> planets.filter(lambda p: p.period > 10000)
 ... # doctest: +NORMALIZE_WHITESPACE
 Table('Planet',
 [Planet(name='Saturn', semi_major_axis=1433, period=10759),
 Planet(name='Uranus', semi_major_axis=2877, period=30799),
 Planet(name='Neptune', semi_major_axis=4503, period=60190)],
 ['name', 'semi_major_axis', 'period'])
 Or select a subset of the columns:
 >>> planets.select('name')
 ... # doctest: +NORMALIZE_WHITESPACE
 Table('Planet',
 [Planet(name='Mercury'),
 Planet(name='Venus'),
 Planet(name='Earth'),
 Planet(name='Mars'),
 Planet(name='Jupiter'),
 Planet(name='Saturn'),
 Planet(name='Uranus'),
 Planet(name='Neptune')],
 ['name'])
 If you just want the values from a column, use the column method:
 >>> ' '.join(planets.column('name'))
 'Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune'
 """
 def __init__(self, name, table, columns):
 self.name = name
 self.columns = list(columns)
 Row = namedtuple(self.name, self.columns)
 self.column_set = set(self.columns)
 self.table = [Row(*row) for row in table]
 def __len__(self):
 return len(self.table)
 def __getitem__(self, key):
 return self.table[key]
 def __iter__(self):
 return iter(self.table)
 def __repr__(self):
 return '{0}({1.name!r}, {2!r}, {1.columns!r})'.format(
 self.__class__.__name__, self, list(self))
 def column(self, column):
 """Generate the values in the given column."""
 assert(column in self.column_set)
 for row in self:
 yield getattr(row, column)
 def filter(self, condition):
 """Return a new Table containing only the rows satisfying
 condition.
 """
 return Table(self.name, filter(condition, self), self.columns)
 def where(self, column, value):
 """Return a new Table containing the rows from this table for which
 the given column has the given value.
 """
 assert(column in self.columns)
 return self.filter(lambda row: getattr(row, column) == value)
 def whereNot(self, column, value):
 """Return a new Table containing the rows from this table for which
 the given column does not have the given value.
 """
 assert(column in self.columns)
 return self.filter(lambda row: getattr(row, column) != value)
 def select(self, *columns):
 """Return a new Table consisting only of the given columns of this
 table.
 """
 assert(all(c in self.columns for c in columns))
 return Table(self.name,
 ((getattr(row, c) for c in columns) for row in self),
 columns)
 def unique(self, *columns):
 """Return a new Table containing one row for each unique combination
 of values in the given columns. (If there are multiple rows
 with the same combination of values, the last one is
 included.)
 """
 assert(all(c in self.columns for c in columns))
 result = OrderedDict((tuple(getattr(row, c) for c in columns), row)
 for row in self)
 return Table(self.name, result.values(), self.columns)

Stack Exchange Network

Basic querying on Python 2D arrays

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Basic querying on Python 2D arrays

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions