Return to Question

replaced http://stackoverflow.com/ with https://stackoverflow.com/

edited May 23, 2017 at 12:40

I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster?

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster?

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster?

added 315 characters in body; edited tags; edited title

Source Link

edited Mar 9, 2017 at 18:32

200_success

edited Mar 9, 2017 at 18:32

200_success

145.5k
22
190
479

Python reading from memory too slow Reading a file representing word frequencies in Wikipedia, for clustering analysis

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster? Thanks in advance!

Python reading from memory too slow

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster? Thanks in advance!

Reading a file representing word frequencies in Wikipedia, for clustering analysis

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster?

Source Link

asked Mar 8, 2017 at 20:23

Tony Tannous

asked Mar 8, 2017 at 20:23

Tony Tannous

Python reading from memory too slow

CODE:

import gc
import time
class point:
 def __init__(self, x, y, w):
 self.x = x
 self.y = y
 self.w = w
def parse(pathToFile):
 myList = []
 with open(pathToFile) as f:
 for line in f:
 s = line.split()
 x, y, w = [int(v) for v in s]
 obj = point(x, y, w)
 gc.disable()
 myList.append(obj)
 gc.enable()
 return myList
if __name__ == "__main__":
 startTime = time.time()
 L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
 print("--- %s seconds ---" % (time.time() - startTime))

Parsing a 114MB file takes 130 seconds when I've been told it should take few seconds.

I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.

File Sample:

How should I parse the file correctly or access it in-order to make it faster ? Thanks in advance!

python python-2.x file io

lang-py