Code Review

Return to Revisions

2 of 2

deleted 100 characters in body; edited title; edited tags

edited Oct 13, 2014 at 14:18

Jamal

edited Oct 13, 2014 at 14:18

Jamal

35.2k
13
134
238

Speeding up script for processing three huge files

I have three huge files I need to process in order to rearrange the data contained within.

The first file is a list of English example sentences (468,785 lines long). A typical line from this file looks like this:

120033#eng#Creativity is an important aspect for the development of human.

The second file is a list which tells me the ID of the equivalent sentences (7,243,419 lines) in the third file. A typical line from the second file looks like this:

1#77

This tells me that the English sentence with ID "1" in the first file matches a translation with ID "77" in the third file.

The third file contains a list of the translated sentences (2,945,676 lines). A typical line from this file looks like this:

1#cmn#我們試試看!

Here is the script I am using to get each line from file one, find which sentences it links to in file 2 and then get the matching sentence from file 3:

with open("eng_lines.txt") as f:
 eng_lines = f.readlines()
with open("all_lines.txt") as f:
 all_lines = f.readlines()
with open("all_links.txt") as f:
 all_links = f.readlines()
for ln,line in enumerate(eng_lines):
 print ln,len(eng_lines)
 with open("matches.txt", "a") as matches:
 matches.write(line+"\n")
 hash = line.index("#")
 sentence_idA = line[:hash]
 for line2 in all_links:
 hash = line2.index("#")
 link_a = line2[:hash]
 link_b = line2[hash+1:].strip()
 if (link_a==sentence_idA):
 for line3 in all_lines:
 hash = line3.index("#")
 sentence_idB = line3[:hash]
 if (sentence_idB==link_b):
 matches.write(line3+"\n")

This process is going to take a LONG time (about a year given that each iteration is currently taking about a minute to process on my i7 PC).

asked Oct 12, 2014 at 3:20

praine

lang-py