I'm working on a fun project where I'm trying to implement Random Forest algorithm in pure Python, i.e. without NumPy. But then I'll still be dealing with arrays all the time and so I'm writing my own Array class to use in place of NumPy's ndarray. Although I'm only focusing on 2d arrays as I don't need n-dimensional stuff for this project.
It's all working correctly, although it's not fully covered by tests yet. And for most part I think I have reasonably decent performance, but I'm struggling with methods for addition and multiplication of arrays.
In the normal addition or multiplication of arrays in NumPy the result is element-wise addition or multiplication and my solution is O(n * m) which for large arrays will be a problem.
Is there a faster algorithm to do this (the __add__
and __mul__
dunders)? Any other things I could improve?
from typing import Iterable
from pathlib import Path
from functools import reduce
class Array:
def __init__(self, value: Iterable[Iterable[int | float]]) -> None:
self.value = value
@property
def value(self):
return self.__value
@value.setter
def value(self, val: Iterable[Iterable[int | float]]):
if not isinstance(val, Iterable):
raise ValueError('The array must be an Iterable.')
if not all(isinstance(a, Iterable) for a in val):
raise ValueError("All elements of array must be Iterables. "
"If creating a flat array, all elements must be length 1 iterables.")
if not reduce(lambda x, y: len(x)==len(y), val):
raise ValueError("Can not create array from a ragged sequence. "
"Please ensure all elements have the same length.")
self.__value = val
@property
def shape(self) -> tuple[int, int]:
return (len(self.value), len(self.value[0]))
def __repr__(self):
return f"Array({self.value}, shape={self.shape})"
def __eq__(self, other: 'Array') -> bool:
if self.shape != other.shape:
return False
return all(x == y for x, y in zip(self.value, other.value))
def __neq__(self):
return not self.__eq__
def __getitem__(self, idx: tuple[int, int]) -> int | float:
if idx[0] > self.shape[0] or idx[1] > self.shape[1]:
raise IndexError(f"Index out of bounds for the array of shape {self.shape}")
return self.value[idx[0]][idx[1]]
def __add__(self, other: 'Array') -> 'Array':
v = []
if self.shape != other.shape:
raise ValueError("Can only add arrays of the same shape.")
for i in range(self.shape[0]):
v.append([])
for j in range(self.shape[1]):
v[i].append(self.value[i][j] + other.value[i][j])
return Array(v)
def __mul__(self, other: 'Array') -> 'Array':
v = []
if self.shape != other.shape:
raise ValueError("Can only add arrays of the same shape.")
for i in range(self.shape[0]):
v.append([])
for j in range(self.shape[1]):
v[i].append(self.value[i][j] * other.value[i][j])
return Array(v)
@classmethod
def from_text(cls, input_file: str | Path, skip_rows: int | None = None, delimeter: str = ',') -> 'Array':
inp = Path(input_file)
if not input_file or not inp.exists():
raise IOError("No input file provided or file does not exist.")
with open(inp, 'r') as file:
arr = []
for idx, i in enumerate(file.readlines()):
if skip_rows is None or skip_rows <= idx:
try:
arr.append([float(k) for k in i.strip().split(delimeter)])
except (ValueError, TypeError) as e:
raise TypeError("Input data must be castable to float. "
"Found incompatible data type. Check your inputs.") from e
return cls(arr)
def matmul(self, other: 'Array') -> 'Array':
pass
2 Answers 2
(Array.value
does look a candidate for name mangling - I'm amazed it works as presented.)
Don't write, never commit/publish undocumented/uncommented code.
If you kept matmul()
&& __mul__()
together, the difference in docstring or no docstring was obvious.
Avoid multiple indexing in inner loops.
Give Python iteration and comprehensions a try -
def __mul__(self, other: 'Array') -> 'Array':
""" point-wise product
"""
if self.shape != other.shape:
raise ValueError("Can only mul arrays of the same shape.")
return Array([[svalue * ovalue for svalue, ovalue in zip(srow, orow)]
for srow, orow in zip(self.value, other.value)])
-
\$\begingroup\$ (
[map(prod, zip(*columns)) for columns in zip(self.value, other.value)]
happens to not be my way.) \$\endgroup\$greybeard– greybeard2022年07月29日 06:29:44 +00:00Commented Jul 29, 2022 at 6:29 -
\$\begingroup\$ What is the issue with
Array.value
? Why is name mangling required here? \$\endgroup\$NotAName– NotAName2022年07月29日 07:21:16 +00:00Commented Jul 29, 2022 at 7:21
Bugs
Mutable reference
The constructor takes an iterable of iterables, validates the type & shape, and stores that reference. If the source is mutated, the shape of the array can become invalid.
x = [[1, 2], [3, 4]]
a = Array(x)
x[0].append(5)
Function reference used in boolean context
Every function reference, in boolean context, evaluates to True
. Applying not
to the function reference self.__eq__
will always evaluate to False
. Therefore, the __neq__
method will always return False
.
You probably intended to call the __eq__
method and not
the value that was returned, with something like ...
def __neq__(self, other):
return not self.__eq__(other)
... except ...
__neq__
is not an inequality test
For a custom inequality (!=
) test, you need to redefine the method named __ne__
. The name __neq__
does not have any defined meaning, and (due to double-underscore prefix and suffix) reserved for future use by Python.
Indexing
an_array[1]
will crash with an index error, but an_array[1, 2, 3, 4]
will not, despite too many coordinates are given. I'd recommend unpacking the idx
tuple:
def __getitem__(self, idx: tuple[int, int]) -> int | float:
row, col = idx
if row > self.shape[0] or col > self.shape[1]:
raise IndexError(f"Index out of bounds for the array of shape {self.shape}")
return self.value[row][col]
Name mangling
Use of the identified __value
triggers name mangling, to prevent name clashes with super or subclasses. If this is not your intent, you should simply use a single underscore prefix _value
instead, to indicate it is not part of the public interface.
Cache frequently referenced values
You access the .shape
property many times. Each time, the shape(self)
method is called, and len(self.value)
and len(self.value[0])
is called, the results packed into a tuple and returned.
if self.shape != other.shape:
raise ValueError("Can only add arrays of the same shape.")
for i in range(self.shape[0]):
v.append([])
for j in range(self.shape[1]):
v[i].append(self.value[i][j] + other.value[i][j])
If you have a 1000x50 array (taken from your deleted answer), how many times are you calling the .shape(self)
method? 1002 times on the self object? 2002 calls to len()
?
If instead your constructor stored _shape
tuple, or even a _rows
and _columns
attributes, you could access those directly, for less overhead.
Too much indexing
After validating the shapes match, you can loop over elements in value
directly, instead of using the inefficient for index in range(container):
antipattern:
v = []
for row, other_row in zip(self.value, other.value):
new_row = []
for val, other_val in zip(row, other_row):
new_row.append(val + other_val)
v.append(new_row)
How much indexing is in the above fragment? None! How many times is .value(self)
property access function called? Twice!
Compare with your:
v = []
for i in range(self.shape[0]):
v.append([])
for j in range(self.shape[1]):
v[i].append(self.value[i][j] * other.value[i][j])
Given the 1000x50 shape, how many [i]
indexing operations are there? 150,000! How many [j]
indexing operations? 100,000! How many calls to the .value(self)
property? 100,000! Yikes!
(But see graybeard's answer for the faster and more compact list comprehension. I'm just trying to show you how much unnecessary overhead your code has here.)
__mul__()
looks copy-pasted from__add__()
- a "code smell". (Look at the C&P error...). With n * m result values, what time complexity are you hoping for? \$\endgroup\$__add__
because I couldn't come up with a different solution and figured it should follow the same principle. I'm not looking for any specifc time complexity, just trying to see if there are algorithms that can do it faster than what I have. But it's not really critical, as long as it works. \$\endgroup\$