This code basically finds duplicate strings within a file.
An example: DUPLICATE_LENGTH set to 6, file contains:
process hash michael honeycomb interrupt system call deadlock scheduling michael
The output will be
michael
, as its a duplicate with a length of 6 or higher
The following code shows my approach to solving this issue. If there are completely different ideas how to solve it, I'm open too, but primarily I'd like to get some feedback of the code I wrote.
I'm also searching performance optimizations as it gets kinda slow on large files.
import codecs
# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):
print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()
substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
end = i + DUPLICATE_LENGTH
duplicate = fileContent[i:end]
if duplicate in substringList and '|' not in duplicate:
duplicatesList.append(duplicate)
else:
substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1
for i in range(1, len(duplicatesList)):
if currentMatch[currentMatchPos:] in duplicatesList[i]:
currentMatch += duplicatesList[i][-1]
currentMatchPos += 1
else:
if SHOW_RESULTS_WHILE_PROCESSING:
showDuplicate(currentMatch)
resultList.append(currentMatch) # this match is closed, add it!
currentMatch = duplicatesList[i]
currentMatchPos = 1
if not SHOW_RESULTS_WHILE_PROCESSING:
for duplicate in resultList:
showDuplicate(duplicate)
if not resultList:
print "Awesome, no duplicates were found!"
-
1\$\begingroup\$ related: Effcient way to find longest duplicate string for Python (From Programming Pearls) \$\endgroup\$jfs– jfs2012年11月28日 01:10:17 +00:00Commented Nov 28, 2012 at 1:10
3 Answers 3
Firstly, your code doesn't actually work. Run it against your example data and DUPLICATE_LENGTH. It has no results. So I'm going to look at your code stylistically and ignore the algorithm for now:
import codecs
# Constants
DUPLICATE_LENGTH = 30;
No need for that semicolon
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):
python style guide recommends seperation_by_underscores for function names
print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()
Python style guide recommends words_seperated_by_underscores for variable names.
substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
No need for the 0, range(x) == range(0, x)
end = i + DUPLICATE_LENGTH
duplicate = fileContent[i:end]
I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.
if duplicate in substringList and '|' not in duplicate:
I not clear what significance the |
has
duplicatesList.append(duplicate)
else:
substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1
Any sort of complex logic should really be in a function, not at the module level.
for i in range(1, len(duplicatesList)):
Do you really need the indexes? Maybe you should use for duplicate in duplicatesList[1:]
if currentMatch[currentMatchPos:] in duplicatesList[i]:
currentMatch += duplicatesList[i][-1]
currentMatchPos += 1
else:
if SHOW_RESULTS_WHILE_PROCESSING:
showDuplicate(currentMatch)
resultList.append(currentMatch) # this match is closed, add it!
currentMatch = duplicatesList[i]
currentMatchPos = 1
I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.
if not SHOW_RESULTS_WHILE_PROCESSING:
for duplicate in resultList:
showDuplicate(duplicate)
if not resultList:
print "Awesome, no duplicates were found!"
Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.
The simple brute force strategy is as follows:
# look at every position where the duplicates could start
for x_idx in range(len(file_content)):
for y_idx in range(x_idx):
# count the length of the duplicate
for length, (x_letter, y_letter) in enumerate(zip(file_content[x_idx:], file_content[y_idx:])):
if x_letter != y_letter:
break
# if its good enough, print it
if length > DUPLICATE_LENGTH:
showDuplicate(file_content[x_idx:x_idx + length])
A more efficient and clever algorithm is to use dynamic programming:
# we create 2D list (list of lists) where lengths[x][y] will be length
# of the duplicate which ends on letter x-1 and y-1 in the file content
# so we will calculate the length of duplicate for every combination of two positions
file_size = len(file_content) + 1
lengths = [ [None] * file_size for x in range(file_size) ]
# If we are before the characters, the length of the duplicate string is
# obviously zero
for x in range(file_size):
lengths[x][0] = 0
lengths[0][x] = 0
for x in range(1, file_size):
# we don't check cases where x == y or y > x
# x == y is trivial and y > x is just
# the transpose
for y in range(1, x):
if file_content[x-1] == file_content[y-1]:
# if two letters match, the duplicate length
# is 1 plus whatever we had before
duplicate_length = lengths[x][y] = lengths[x-1][y-1] + 1
if duplicate_length > DUPLICATE_LENGTH:
showDuplicate(file_content[x-duplicate_length:x])
else:
# if the files don't match the duplicate ends here
lengths[x][y] = 0
My two programs produce slightly different results. The first says:
Duplicate found: michael
Duplicate found:michael
The second says:
Duplicate found: michae
Duplicate found: michael
The difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.
-
\$\begingroup\$ That's brilliant! Exactly what I was looking for. A clean code analysis. Can you additionally show how you would merge the output to a consistent form efficiently? "interr interru interrup interrupt" -> "interrupt". I think it's about finding the 'longest' match. \$\endgroup\$Michael– Michael2011年12月29日 11:47:18 +00:00Commented Dec 29, 2011 at 11:47
-
\$\begingroup\$ @Michael, easiest way is to just to check if the next letter matches. If it does, we don't have the longest match, and we shouldn't report that match. \$\endgroup\$Winston Ewert– Winston Ewert2012年01月05日 00:29:17 +00:00Commented Jan 5, 2012 at 0:29
Loop through the lines of the file, rather than reading it all into memory with fileHandle.read()
.
You can use line.split()
to break the line into a list of words.
Keep the words seen as a set
, and the duplicates as another set
. Each time you do in
on a list
, it has to scan through each element. Sets can do membership checking much more quickly.
I'd do this:
import codecs
# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
seen = set()
duplicates = set()
with open(FILE_NAME) as input_file:
for line in input_file:
for word in line.split():
if len(word) >= DUPLICATE_LENGTH and word in seen:
duplicates.add(word)
seen.add(word)
edit: to look for duplicate sequences of length DUPLICATE_LENGTH, you could use a generator to yield successive chunks:
DUPLICATE_LENGTH = 30
CHUNKS = 100
def read_sequences(stream, size):
while True:
chunk = stream.read(size * CHUNKS)
if not chunk:
break
for i in range(len(chunk) - size + 1):
yield chunk[i:i + size]
Then the loop would become:
seen = set()
duplicates = set()
with open(FILE_NAME) as input_file:
for word in read_sequences(input_file, DUPLICATE_LENGTH):
if word in seen:
duplicates.add(word)
seen.add(word)
-
1\$\begingroup\$ Thanks for you comment. There's one difference in progressing the data with my code. It can just recognize single words, probably my example lacked a bit. Another one, what about "balwmichaelkajfalsdfamsmichaelasjfal" - find "michael" \$\endgroup\$Michael– Michael2011年12月28日 13:11:31 +00:00Commented Dec 28, 2011 at 13:11
-
\$\begingroup\$ Ah, ok - so you're looking for repeated sequences of length DUPLICATE_LENGTH? I'll update with more info. \$\endgroup\$wilberforce– wilberforce2011年12月28日 18:59:31 +00:00Commented Dec 28, 2011 at 18:59
To find a duplicate of a length 6
or more in a string you could use regular expressions:
>>> import re
>>> data = "balwmichaelkajfalsdfamsmichaelasjfal"
>>> m = re.search(r"(.{6,}).*?1円", data)
>>> m and m.group(1)
'michael'
-
\$\begingroup\$ nice! Do you know how efficient it is? \$\endgroup\$Winston Ewert– Winston Ewert2012年01月05日 00:30:41 +00:00Commented Jan 5, 2012 at 0:30
-
\$\begingroup\$ @Winston Ewert: it depends on
re
implementation and input data. It can be very inefficient e.g.,"abcdefgh"*100
takes 3 ms and"abcdefgh"*1000
-- 2 seconds (10x input leads to 1000x time). It can be fixed in this case by specifying an upper limit on the duplicate length:r"(.{6,1000}).*?1円"
or by usingregex
module that has different implementation. \$\endgroup\$jfs– jfs2012年01月05日 01:32:56 +00:00Commented Jan 5, 2012 at 1:32
Explore related questions
See similar questions with these tags.