Reading a file representing word frequencies in Wikipedia, for clustering analysis

Question 1

I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster?

Question 2

In Python a class's name should be capitalized. So your class's name should be Point. Variables and functions' names should be written in lower case with an underscore if the name consists of more than one word. So your variables' names should be: my_list, l and start_time, respectively.

Question 3

@MahmudMuhammadNaguib does that affect runtime ?

Question 4

I don't think so. It's about style. This is a comment, not intended to be an answer. See: python.org/dev/peps/pep-0008/#naming-conventions

Question 5

What do you intend to do with the results? What is supposed to happen when the middle coordinate is N?

Question 6

@200_success yes sir, I am writing a system that will run Kmeans to determind the most used words in Wikipedia. I do this by building a coreset with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and i want it to run fast. From client I send server 10000 points which are processed.

Question 7

Avoid reinventing-the-wheel, especially with common tasks that are likely to have optimized implementations. In this case, I recommend trying pandas.read_csv() to read the file. Then, the matrix can be fed to sklearn.cluster.KMeans().

Question 8

It is a problem. I must have the points in a list, or the 14GB will grow to 50TB. A Wikipedia article contains 5000 words for example, w is the weight which is always 1. each point corresponds to a (x, y) where x is the wikipedia article and y corresponds to the word. The word else might be 20000 If I turn this into a matrix I will have a huge space waste. Or am I missing something, sir. Each word is assigned a number id. i.e Else --> 20000

Question 9

Why would the resulting matrix from read_csv() be any larger than your list of points? Either way, it's the same two-dimensional table containing the same data.

Question 10

The bottleneck ended up being the data being sent to server. Sending objects was costly! moved to binary and it went 40 times faster. :)

Question 11

I would try the following things:

remove the gc manipulation code

use __slots__ for the Point class - should result into memory handling and performance improvements

class Point:
 __slots__ = ['x', 'y', 'w']
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w

use a list comprehension instead of appending (generally faster):

def parse(path):
 with open(path) as f:
 return [Point(*(int(s) for s in line.split())) 
 for line in f]

try the PyPy Python implementation (with the latest PyPy on my machine the 100k line file is parsed about 4 times faster than the Python 3.6)

Question 12

Up from 130seconds to 170 seconds. :-(

Question 13

@TonyTannous what about the PyPy?

Question 14

I need to try it. Though I rather a software gain and not some other technique from outside code.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Accepted Answer · 2017-03-08 22:47:10Z

5

\$\begingroup\$

Avoid reinventing-the-wheel, especially with common tasks that are likely to have optimized implementations. In this case, I recommend trying pandas.read_csv() to read the file. Then, the matrix can be fed to sklearn.cluster.KMeans().

Share

answered Mar 8, 2017 at 22:47

200_success's user avatar

200_success 200_success

146k22 gold badges190 silver badges479 bronze badges

\$\endgroup\$

3

1

\$\begingroup\$ It is a problem. I must have the points in a list, or the 14GB will grow to 50TB. A Wikipedia article contains 5000 words for example, w is the weight which is always 1. each point corresponds to a (x, y) where x is the wikipedia article and y corresponds to the word. The word else might be 20000 If I turn this into a matrix I will have a huge space waste. Or am I missing something, sir. Each word is assigned a number id. i.e Else --> 20000 \$\endgroup\$

Tony Tannous
– Tony Tannous

2017年03月09日 09:15:15 +00:00
Commented Mar 9, 2017 at 9:15
1

\$\begingroup\$ Why would the resulting matrix from read_csv() be any larger than your list of points? Either way, it's the same two-dimensional table containing the same data. \$\endgroup\$

200_success
– 200_success

2017年03月09日 09:42:46 +00:00
Commented Mar 9, 2017 at 9:42
\$\begingroup\$ The bottleneck ended up being the data being sent to server. Sending objects was costly! moved to binary and it went 40 times faster. :) \$\endgroup\$

Tony Tannous
– Tony Tannous

2017年03月10日 17:31:35 +00:00
Commented Mar 10, 2017 at 17:31

Add a comment |

Stack Exchange Network

Reading a file representing word frequencies in Wikipedia, for clustering analysis

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reading a file representing word frequencies in Wikipedia, for clustering analysis

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions