I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster?
2 Answers 2
Avoid reinventing-the-wheel, especially with common tasks that are likely to have optimized implementations. In this case, I recommend trying pandas.read_csv()
to read the file. Then, the matrix can be fed to sklearn.cluster.KMeans()
.
-
1\$\begingroup\$ It is a problem. I must have the points in a list, or the
14GB
will grow to50TB
. A Wikipedia article contains 5000 words for example, w is the weight which is always 1. each point corresponds to a(x, y)
where x is the wikipedia article and y corresponds to the word. The wordelse
might be20000
If I turn this into a matrix I will have a huge space waste. Or am I missing something, sir. Each word is assigned a number id. i.eElse
-->20000
\$\endgroup\$Tony Tannous– Tony Tannous2017年03月09日 09:15:15 +00:00Commented Mar 9, 2017 at 9:15 -
1\$\begingroup\$ Why would the resulting matrix from
read_csv()
be any larger than your list ofpoint
s? Either way, it's the same two-dimensional table containing the same data. \$\endgroup\$200_success– 200_success2017年03月09日 09:42:46 +00:00Commented Mar 9, 2017 at 9:42 -
\$\begingroup\$ The bottleneck ended up being the data being sent to server. Sending objects was costly! moved to binary and it went 40 times faster. :) \$\endgroup\$Tony Tannous– Tony Tannous2017年03月10日 17:31:35 +00:00Commented Mar 10, 2017 at 17:31
I would try the following things:
- remove the
gc
manipulation code use
__slots__
for thePoint
class - should result into memory handling and performance improvementsclass Point: __slots__ = ['x', 'y', 'w'] def __init__(self, x, y, w): self.x = x self.y = y self.w = w
use a list comprehension instead of appending (generally faster):
def parse(path): with open(path) as f: return [Point(*(int(s) for s in line.split())) for line in f]
- try the
PyPy
Python implementation (with the latestPyPy
on my machine the 100k line file is parsed about 4 times faster than the Python 3.6)
-
\$\begingroup\$ Up from
130seconds
to170 seconds
. :-( \$\endgroup\$Tony Tannous– Tony Tannous2017年03月08日 20:52:40 +00:00Commented Mar 8, 2017 at 20:52 -
\$\begingroup\$ @TonyTannous what about the
PyPy
? \$\endgroup\$alecxe– alecxe2017年03月08日 21:04:32 +00:00Commented Mar 8, 2017 at 21:04 -
\$\begingroup\$ I need to try it. Though I rather a software gain and not some other technique from outside code. \$\endgroup\$Tony Tannous– Tony Tannous2017年03月08日 21:07:13 +00:00Commented Mar 8, 2017 at 21:07
Explore related questions
See similar questions with these tags.
Point
. Variables and functions' names should be written in lower case with an underscore if the name consists of more than one word. So your variables' names should be:my_list
,l
andstart_time
, respectively. \$\endgroup\$N
? \$\endgroup\$