I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster?
I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster?
I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster?
- 145.5k
- 22
- 190
- 479
Python reading from memory too slow Reading a file representing word frequencies in Wikipedia, for clustering analysis
I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster? Thanks in advance!
Python reading from memory too slow
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster? Thanks in advance!
Reading a file representing word frequencies in Wikipedia, for clustering analysis
I am writing a system that will run Kmeans to determine the most used words in Wikipedia. I do this by building a core set with streaming data. Wikipedia is 50TB but after a year of processing I was given data of size 14 GB which I need to parse and I want it to run fast. From client I send server 10000 points which are processed.
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster?
Python reading from memory too slow
CODE:
import gc
import time
class point:
def __init__(self, x, y, w):
self.x = x
self.y = y
self.w = w
def parse(pathToFile):
myList = []
with open(pathToFile) as f:
for line in f:
s = line.split()
x, y, w = [int(v) for v in s]
obj = point(x, y, w)
gc.disable()
myList.append(obj)
gc.enable()
return myList
if __name__ == "__main__":
startTime = time.time()
L = parse('C:/Users/user/workspace/finalSub/api/data100mb.txt')
print("--- %s seconds ---" % (time.time() - startTime))
Parsing a 114MB
file takes 130 seconds when I've been told it should take few seconds.
I tried splitting the data to multiple chunks and apply multiprocessing, but it ended up that reading from multiple files is bad and results in even longer time time to parse. Look here.
File Sample:
1 1 1
1 2 1
1 3 1
1 4 1
.
.
1 N 1
2 1 1
2 2 1
2 3 1
2 4 1
.
.
2 N 1
How should I parse the file correctly or access it in-order to make it faster ? Thanks in advance!