Return to Question

deleted 100 characters in body; edited title; edited tags

Source Link

edited Oct 13, 2014 at 14:18

Jamal

edited Oct 13, 2014 at 14:18

Jamal

35.2k
13
134
238

Is there any way to make this python Speeding up script for processing three huge files faster?

I have three huge files I need to process in order to rearrange the data contained within.

The first file is a list of English example sentences (468,785 lines long). A typical line from this file looks like this:

120033#eng#Creativity is an important aspect for the development of human.

The second file is a list which tells me the ID of the equivalent sentences (7,243,419 lines) in the third file. A typical line from the second file looks like this:

1#77

This tells me that the English sentence with ID "1" in the first file matches a translation with ID "77" in the third file.

The third file contains a list of the translated sentences (2,945,676 lines). A typical line from this file looks like this:

1#cmn#我們試試看!

Here is the script I am using to get each line from file one, find which sentences it links to in file 2 and then get the matching sentence from file 3:

with open("eng_lines.txt") as f:
 eng_lines = f.readlines()
with open("all_lines.txt") as f:
 all_lines = f.readlines()
with open("all_links.txt") as f:
 all_links = f.readlines()
for ln,line in enumerate(eng_lines):
 print ln,len(eng_lines)
 with open("matches.txt", "a") as matches:
 matches.write(line+"\n")
 hash = line.index("#")
 sentence_idA = line[:hash]
 for line2 in all_links:
 hash = line2.index("#")
 link_a = line2[:hash]
 link_b = line2[hash+1:].strip()
 if (link_a==sentence_idA):
 for line3 in all_lines:
 hash = line3.index("#")
 sentence_idB = line3[:hash]
 if (sentence_idB==link_b):
 matches.write(line3+"\n")

This process is going to take a LONG time (about a year given that each iteration is currently taking about a minute to process on my i7 PC).

Any advice on how to speed up this process would be very much appreciated.

Thanks in advance.

Is there any way to make this python script for processing three huge files faster?

I have three huge files I need to process in order to rearrange the data contained within.

The first file is a list of English example sentences (468,785 lines long). A typical line from this file looks like this:

120033#eng#Creativity is an important aspect for the development of human.

The second file is a list which tells me the ID of the equivalent sentences (7,243,419 lines) in the third file. A typical line from the second file looks like this:

1#77

This tells me that the English sentence with ID "1" in the first file matches a translation with ID "77" in the third file.

The third file contains a list of the translated sentences (2,945,676 lines). A typical line from this file looks like this:

1#cmn#我們試試看!

Here is the script I am using to get each line from file one, find which sentences it links to in file 2 and then get the matching sentence from file 3:

with open("eng_lines.txt") as f:
 eng_lines = f.readlines()
with open("all_lines.txt") as f:
 all_lines = f.readlines()
with open("all_links.txt") as f:
 all_links = f.readlines()
for ln,line in enumerate(eng_lines):
 print ln,len(eng_lines)
 with open("matches.txt", "a") as matches:
 matches.write(line+"\n")
 hash = line.index("#")
 sentence_idA = line[:hash]
 for line2 in all_links:
 hash = line2.index("#")
 link_a = line2[:hash]
 link_b = line2[hash+1:].strip()
 if (link_a==sentence_idA):
 for line3 in all_lines:
 hash = line3.index("#")
 sentence_idB = line3[:hash]
 if (sentence_idB==link_b):
 matches.write(line3+"\n")

This process is going to take a LONG time (about a year given that each iteration is currently taking about a minute to process on my i7 PC).

Any advice on how to speed up this process would be very much appreciated.

Thanks in advance.

Speeding up script for processing three huge files

I have three huge files I need to process in order to rearrange the data contained within.

The first file is a list of English example sentences (468,785 lines long). A typical line from this file looks like this:

120033#eng#Creativity is an important aspect for the development of human.

The second file is a list which tells me the ID of the equivalent sentences (7,243,419 lines) in the third file. A typical line from the second file looks like this:

1#77

This tells me that the English sentence with ID "1" in the first file matches a translation with ID "77" in the third file.

The third file contains a list of the translated sentences (2,945,676 lines). A typical line from this file looks like this:

1#cmn#我們試試看!

Here is the script I am using to get each line from file one, find which sentences it links to in file 2 and then get the matching sentence from file 3:

with open("eng_lines.txt") as f:
 eng_lines = f.readlines()
with open("all_lines.txt") as f:
 all_lines = f.readlines()
with open("all_links.txt") as f:
 all_links = f.readlines()
for ln,line in enumerate(eng_lines):
 print ln,len(eng_lines)
 with open("matches.txt", "a") as matches:
 matches.write(line+"\n")
 hash = line.index("#")
 sentence_idA = line[:hash]
 for line2 in all_links:
 hash = line2.index("#")
 link_a = line2[:hash]
 link_b = line2[hash+1:].strip()
 if (link_a==sentence_idA):
 for line3 in all_lines:
 hash = line3.index("#")
 sentence_idB = line3[:hash]
 if (sentence_idB==link_b):
 matches.write(line3+"\n")

This process is going to take a LONG time (about a year given that each iteration is currently taking about a minute to process on my i7 PC).

Post Migrated Here from stackoverflow.com (revisions)

occurred Oct 13, 2014 at 12:51

Source Link

asked Oct 12, 2014 at 3:20

praine

asked Oct 12, 2014 at 3:20

praine

Is there any way to make this python script for processing three huge files faster?

I have three huge files I need to process in order to rearrange the data contained within.

The first file is a list of English example sentences (468,785 lines long). A typical line from this file looks like this:

120033#eng#Creativity is an important aspect for the development of human.

The second file is a list which tells me the ID of the equivalent sentences (7,243,419 lines) in the third file. A typical line from the second file looks like this:

1#77

This tells me that the English sentence with ID "1" in the first file matches a translation with ID "77" in the third file.

The third file contains a list of the translated sentences (2,945,676 lines). A typical line from this file looks like this:

1#cmn#我們試試看!

Here is the script I am using to get each line from file one, find which sentences it links to in file 2 and then get the matching sentence from file 3:

with open("eng_lines.txt") as f:
 eng_lines = f.readlines()
with open("all_lines.txt") as f:
 all_lines = f.readlines()
with open("all_links.txt") as f:
 all_links = f.readlines()
for ln,line in enumerate(eng_lines):
 print ln,len(eng_lines)
 with open("matches.txt", "a") as matches:
 matches.write(line+"\n")
 hash = line.index("#")
 sentence_idA = line[:hash]
 for line2 in all_links:
 hash = line2.index("#")
 link_a = line2[:hash]
 link_b = line2[hash+1:].strip()
 if (link_a==sentence_idA):
 for line3 in all_lines:
 hash = line3.index("#")
 sentence_idB = line3[:hash]
 if (sentence_idB==link_b):
 matches.write(line3+"\n")

This process is going to take a LONG time (about a year given that each iteration is currently taking about a minute to process on my i7 PC).

Any advice on how to speed up this process would be very much appreciated.

Thanks in advance.

python sorting

lang-py