Finding duplicate strings within a file

Question 1

This code basically finds duplicate strings within a file.

An example: DUPLICATE_LENGTH set to 6, file contains:
process hash michael honeycomb interrupt system call deadlock scheduling michael
The output will be michael, as its a duplicate with a length of 6 or higher

The following code shows my approach to solving this issue. If there are completely different ideas how to solve it, I'm open too, but primarily I'd like to get some feedback of the code I wrote.

I'm also searching performance optimizations as it gets kinda slow on large files.

import codecs
# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):
 print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()
substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):
 end = i + DUPLICATE_LENGTH
 duplicate = fileContent[i:end]
 if duplicate in substringList and '|' not in duplicate:
 duplicatesList.append(duplicate)
 else:
 substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1
for i in range(1, len(duplicatesList)):
 if currentMatch[currentMatchPos:] in duplicatesList[i]:
 currentMatch += duplicatesList[i][-1]
 currentMatchPos += 1
 else:
 if SHOW_RESULTS_WHILE_PROCESSING:
 showDuplicate(currentMatch)
 resultList.append(currentMatch) # this match is closed, add it!
 currentMatch = duplicatesList[i]
 currentMatchPos = 1
if not SHOW_RESULTS_WHILE_PROCESSING:
 for duplicate in resultList:
 showDuplicate(duplicate)
if not resultList:
 print "Awesome, no duplicates were found!"

Question 2

related: Effcient way to find longest duplicate string for Python (From Programming Pearls)

Question 3

Firstly, your code doesn't actually work. Run it against your example data and DUPLICATE_LENGTH. It has no results. So I'm going to look at your code stylistically and ignore the algorithm for now:

import codecs
# Constants
DUPLICATE_LENGTH = 30;

No need for that semicolon

FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):

python style guide recommends seperation_by_underscores for function names

 print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()

Python style guide recommends words_seperated_by_underscores for variable names.

substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):

No need for the 0, range(x) == range(0, x)

 end = i + DUPLICATE_LENGTH
 duplicate = fileContent[i:end]

I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.

 if duplicate in substringList and '|' not in duplicate:

I not clear what significance the | has

 duplicatesList.append(duplicate)
 else:
 substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1

Any sort of complex logic should really be in a function, not at the module level.

for i in range(1, len(duplicatesList)):

Do you really need the indexes? Maybe you should use for duplicate in duplicatesList[1:]

 if currentMatch[currentMatchPos:] in duplicatesList[i]:
 currentMatch += duplicatesList[i][-1]
 currentMatchPos += 1
 else:
 if SHOW_RESULTS_WHILE_PROCESSING:
 showDuplicate(currentMatch)
 resultList.append(currentMatch) # this match is closed, add it!
 currentMatch = duplicatesList[i]
 currentMatchPos = 1

I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.

if not SHOW_RESULTS_WHILE_PROCESSING:
 for duplicate in resultList:
 showDuplicate(duplicate)
if not resultList:
 print "Awesome, no duplicates were found!"

Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.

The simple brute force strategy is as follows:

# look at every position where the duplicates could start
for x_idx in range(len(file_content)):
 for y_idx in range(x_idx):
 # count the length of the duplicate
 for length, (x_letter, y_letter) in enumerate(zip(file_content[x_idx:], file_content[y_idx:])):
 if x_letter != y_letter:
 break
 # if its good enough, print it
 if length > DUPLICATE_LENGTH:
 showDuplicate(file_content[x_idx:x_idx + length])

A more efficient and clever algorithm is to use dynamic programming:

# we create 2D list (list of lists) where lengths[x][y] will be length
# of the duplicate which ends on letter x-1 and y-1 in the file content
# so we will calculate the length of duplicate for every combination of two positions
file_size = len(file_content) + 1
lengths = [ [None] * file_size for x in range(file_size) ]
# If we are before the characters, the length of the duplicate string is 
# obviously zero
for x in range(file_size):
 lengths[x][0] = 0
 lengths[0][x] = 0
for x in range(1, file_size):
 # we don't check cases where x == y or y > x
 # x == y is trivial and y > x is just
 # the transpose
 for y in range(1, x):
 if file_content[x-1] == file_content[y-1]:
 # if two letters match, the duplicate length
 # is 1 plus whatever we had before
 duplicate_length = lengths[x][y] = lengths[x-1][y-1] + 1
 if duplicate_length > DUPLICATE_LENGTH:
 showDuplicate(file_content[x-duplicate_length:x])
 else:
 # if the files don't match the duplicate ends here
 lengths[x][y] = 0

My two programs produce slightly different results. The first says:

Duplicate found: michael
Duplicate found:michael

The second says:

Duplicate found: michae
Duplicate found: michael

The difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.

Question 4

That's brilliant! Exactly what I was looking for. A clean code analysis. Can you additionally show how you would merge the output to a consistent form efficiently? "interr interru interrup interrupt" -> "interrupt". I think it's about finding the 'longest' match.

Question 5

@Michael, easiest way is to just to check if the next letter matches. If it does, we don't have the longest match, and we shouldn't report that match.

Question 6

Loop through the lines of the file, rather than reading it all into memory with fileHandle.read().

You can use line.split() to break the line into a list of words.

Keep the words seen as a set, and the duplicates as another set. Each time you do in on a list, it has to scan through each element. Sets can do membership checking much more quickly.

I'd do this:

import codecs
# Constants
DUPLICATE_LENGTH = 30;
FILE_NAME = "data.txt"
seen = set()
duplicates = set()
with open(FILE_NAME) as input_file:
 for line in input_file:
 for word in line.split():
 if len(word) >= DUPLICATE_LENGTH and word in seen:
 duplicates.add(word)
 seen.add(word)

edit: to look for duplicate sequences of length DUPLICATE_LENGTH, you could use a generator to yield successive chunks:

DUPLICATE_LENGTH = 30
CHUNKS = 100
def read_sequences(stream, size):
 while True:
 chunk = stream.read(size * CHUNKS)
 if not chunk:
 break
 for i in range(len(chunk) - size + 1):
 yield chunk[i:i + size]

Then the loop would become:

seen = set()
duplicates = set()
with open(FILE_NAME) as input_file:
 for word in read_sequences(input_file, DUPLICATE_LENGTH):
 if word in seen:
 duplicates.add(word)
 seen.add(word)

Question 7

Thanks for you comment. There's one difference in progressing the data with my code. It can just recognize single words, probably my example lacked a bit. Another one, what about "balwmichaelkajfalsdfamsmichaelasjfal" - find "michael"

Question 8

Ah, ok - so you're looking for repeated sequences of length DUPLICATE_LENGTH? I'll update with more info.

Question 9

To find a duplicate of a length 6 or more in a string you could use regular expressions:

>>> import re
>>> data = "balwmichaelkajfalsdfamsmichaelasjfal"
>>> m = re.search(r"(.{6,}).*?1円", data)
>>> m and m.group(1)
'michael'

Question 10

nice! Do you know how efficient it is?

Question 11

@Winston Ewert: it depends on re implementation and input data. It can be very inefficient e.g., "abcdefgh"*100 takes 3 ms and "abcdefgh"*1000 -- 2 seconds (10x input leads to 1000x time). It can be fixed in this case by specifying an upper limit on the duplicate length: r"(.{6,1000}).*?1円" or by using regex module that has different implementation.

Winston Ewert Winston Ewert 30.7k4 gold badges52 silver badges79 bronze badges · Accepted Answer · 2011-12-28 23:21:15Z

Firstly, your code doesn't actually work. Run it against your example data and DUPLICATE_LENGTH. It has no results. So I'm going to look at your code stylistically and ignore the algorithm for now:

import codecs
# Constants
DUPLICATE_LENGTH = 30;

No need for that semicolon

FILE_NAME = "data.txt"
SHOW_RESULTS_WHILE_PROCESSING = True
def showDuplicate(duplicate):

python style guide recommends seperation_by_underscores for function names

 print("Duplicate found:{0}".format(duplicate))
fileHandle = codecs.open(FILE_NAME, "r", "utf-8-sig")
fileContent = fileHandle.read()
fileHandle.close()

Python style guide recommends words_seperated_by_underscores for variable names.

substringList = [] #contains all possible duplicates.
duplicatesList = [] #contains all duplicates
for i in range(0, len(fileContent) - DUPLICATE_LENGTH):

No need for the 0, range(x) == range(0, x)

 end = i + DUPLICATE_LENGTH
 duplicate = fileContent[i:end]

I'd combine those two lines. I'd also not call it a duplicate as its not a duplicate, just a substring.

 if duplicate in substringList and '|' not in duplicate:

I not clear what significance the | has

 duplicatesList.append(duplicate)
 else:
 substringList.append(duplicate)
resultList = []
currentMatch = duplicatesList[0]
currentMatchPos = 1

Any sort of complex logic should really be in a function, not at the module level.

for i in range(1, len(duplicatesList)):

Do you really need the indexes? Maybe you should use for duplicate in duplicatesList[1:]

 if currentMatch[currentMatchPos:] in duplicatesList[i]:
 currentMatch += duplicatesList[i][-1]
 currentMatchPos += 1
 else:
 if SHOW_RESULTS_WHILE_PROCESSING:
 showDuplicate(currentMatch)
 resultList.append(currentMatch) # this match is closed, add it!
 currentMatch = duplicatesList[i]
 currentMatchPos = 1

I dislike the structure of this code. You are manipulating several variables across loop iterations which makes the code hard to follow. But since the algorithm is simply incorrect for what you are doing I can't really tell you how to fix it.

if not SHOW_RESULTS_WHILE_PROCESSING:
 for duplicate in resultList:
 showDuplicate(duplicate)
if not resultList:
 print "Awesome, no duplicates were found!"

Ok, now for the actual algorithm. I think wilberforce has misinterpreted what you want. His algorithm requires the duplication to be aligned, which I don't think you want.

The simple brute force strategy is as follows:

# look at every position where the duplicates could start
for x_idx in range(len(file_content)):
 for y_idx in range(x_idx):
 # count the length of the duplicate
 for length, (x_letter, y_letter) in enumerate(zip(file_content[x_idx:], file_content[y_idx:])):
 if x_letter != y_letter:
 break
 # if its good enough, print it
 if length > DUPLICATE_LENGTH:
 showDuplicate(file_content[x_idx:x_idx + length])

A more efficient and clever algorithm is to use dynamic programming:

# we create 2D list (list of lists) where lengths[x][y] will be length
# of the duplicate which ends on letter x-1 and y-1 in the file content
# so we will calculate the length of duplicate for every combination of two positions
file_size = len(file_content) + 1
lengths = [ [None] * file_size for x in range(file_size) ]
# If we are before the characters, the length of the duplicate string is 
# obviously zero
for x in range(file_size):
 lengths[x][0] = 0
 lengths[0][x] = 0
for x in range(1, file_size):
 # we don't check cases where x == y or y > x
 # x == y is trivial and y > x is just
 # the transpose
 for y in range(1, x):
 if file_content[x-1] == file_content[y-1]:
 # if two letters match, the duplicate length
 # is 1 plus whatever we had before
 duplicate_length = lengths[x][y] = lengths[x-1][y-1] + 1
 if duplicate_length > DUPLICATE_LENGTH:
 showDuplicate(file_content[x-duplicate_length:x])
 else:
 # if the files don't match the duplicate ends here
 lengths[x][y] = 0

My two programs produce slightly different results. The first says:

Duplicate found: michael
Duplicate found:michael

The second says:

Duplicate found: michae
Duplicate found: michael

The difference is because the actual match is 7 characters long. As a result, it matches twice. I don't know what results you actually do want, so I haven't looked into it.

That's brilliant! Exactly what I was looking for. A clean code analysis. Can you additionally show how you would merge the output to a consistent form efficiently? "interr interru interrup interrupt" -> "interrupt". I think it's about finding the 'longest' match.
@Michael, easiest way is to just to check if the next letter matches. If it does, we don't have the longest match, and we shouldn't report that match.

Stack Exchange Network

Finding duplicate strings within a file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Finding duplicate strings within a file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions