Remove repetitive strings from a given sentence efficiently

Question 1

I recently gave a test on HackerRank:

Given a string, return the most concise string that can be formed from it.

For example:
string = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
The following string contains many duplicates like "Watson represents." We have to ignore extra spacing between characters or lower/upper case. "watson Represents" and "watson represents" are the same thing.

Semicolons and commas represent the same thing. For example, "Cognitive Systems; a new era" is present inside "Watson represents a first step into cognitive systems, a new era of computing."

Your final string should not contain any duplicates. Ignore lowercase/uppercase or extra space if you have any.

My answer:

watson = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
import re
watson = re.sub(r';', r',', watson) #replace the semicolon with colon
watson = watson.strip().split('|')
removeWatson = list(watson)
for i in range(len(watson)):
 for j in range(len(watson)):
 if j == i:
 continue
 if " ".join(watson[i].strip().lower().split()) in " ".join(watson[j].strip().lower().split()):
 removeWatson[i] = ''
print "|".join(filter(None, removeWatson))

My answer is definitely inefficient and I am wondering if you can suggest an alternative way of solving this problem.

Most concise string:

Watson represents a first step into cognitive systems, a new era of computing.|what does watson represent

Question 2

I cannot find this problem on hackerrank.com. Can you update your post to include a link to it, please?

Question 3

Hey it was on a Hacker rank test question, I don't think they generally publish such questions on their public website.

Question 4

Here's my shot at this puzzle.

Benchmark of original code, over 1,000,000 iterations (with print() commented out):

real 1m8.486s
user 0m0.000s
sys 0m0.000s

Benchmark of my code, over 1,000,000 iterations (with print() commented out):

real 0m16.226s
user 0m0.000s
sys 0m0.015s

Performance gain: 422.07%

It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.

The strategy here is to preprocess the strings, instead of going through strip(), lower(), and split() each time you need to do a comparison. Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.

WATSON_RAW = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
# 1. Preprocess the string.
# Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')
# 2. Split string by delimiter.
split = modified.split('|')
# 3. Generate string metadata, sorted by length
# Keep track of the string contents, its index in the "split" variable, and its length.
# Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])
# 4. Uniqify the strings.
indexes_of_unique_strings = []
for index, _, string_lowercase1 in metadata:
 # Only search from current index on, skipping comparisions with earlier indexes.
 for _, _, string_lowercase2 in metadata[i:]:
 if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
 # This string is redundant.
 break
 else:
 # This string is unique.
 indexes_of_unique_strings.append(index)
# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)
print(final)

Question 5

# Replace the semicolon with colon is a misleading comment, because you're actually replacing with a comma, not a colon. Be careful to make sure comments are accurate.

Also you don't really need to use r in front of the strings. You might have mistakenly thought that's a regex convention in Python but it's actually just something called a raw string. It tells Python not to treat backslashes as escape characters, so it will interpret: r'\n\n\t\r' literally, instead of turning them into special whitespace characters. This is often used in regex, but not necessary for your semicolon and comma strings.

You make a copy of watson called removeWatson, but I don't understand why. You use watson to test the length but then you make sure not to change the length of removeWatson anyway. You've modified watson at the start so it's not about data preservation. I suspect that this is a remnant of an approach that ended up not working and you never thought to tidy it away. Unless you intend to print original watson alongside your modified one, I'd only use watson. And if you intend to print them together, then make one entirely untouched copy of watson at the start and do everything in the copy.

To run two for loops at once, it's neater to use product from the itertools module. It basically does what your nested for loops do, but it saves you a level of nesting and makes it clear that it's a simple nested loop. I'd also make the range(len(watson)) up front rather than typing it twice.

from itertools import product
loop = range(len(watson))
for i, j in product(loop, loop):

Your line where all the magic happens is quite unclear, if you explained how it works then it'd make your code much more readable.

And lastly, why are you filtering out None in the end? If you're going to do that you could set None instead of empty strings (it's marginally clearer to use None), but it signals that maybe you've gotten errant None values and this is your way of masking them. If there's a better reason I've missed then it's worth noting in a comment.

Question 6

I actually can't think of a greatly more efficient way to do this. Just some general Python style tips:

I would change the variable name removeWatson to watson_list, to use a more informative name, (also underscores are more common in python than camel case)

I would do the "ignore uppercase/lowercase" and "remove extra spaces" like this:

watson = watson.lower()
watson_list = [w.strip() for w in watson.split('|')]

Now that you've done this, your code is a more concise. Also if you set the range of your inner loop as starting at i then you don't need the j==i check inside the loop.

for i in range(len(watson_list)):
 for j in range(i, len(watson_list)):
 if watson_list[i] in watson_list(j):
 watson_list[i] = ''

Question 7

Thank you for comments but there is a mistake in your code. The second loop will always start from 0 not i because otherwise you will miss out some repetitive strings. Try to run my code with your approach, you will see the difference.

midrare midrare 1413 bronze badges · Answer 1 · 2015-10-20 05:07:07Z

Here's my shot at this puzzle.

Benchmark of original code, over 1,000,000 iterations (with print() commented out):

real 1m8.486s
user 0m0.000s
sys 0m0.000s

Benchmark of my code, over 1,000,000 iterations (with print() commented out):

real 0m16.226s
user 0m0.000s
sys 0m0.015s

Performance gain: 422.07%

It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.

The strategy here is to preprocess the strings, instead of going through strip(), lower(), and split() each time you need to do a comparison. Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.

WATSON_RAW = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
# 1. Preprocess the string.
# Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')
# 2. Split string by delimiter.
split = modified.split('|')
# 3. Generate string metadata, sorted by length
# Keep track of the string contents, its index in the "split" variable, and its length.
# Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])
# 4. Uniqify the strings.
indexes_of_unique_strings = []
for index, _, string_lowercase1 in metadata:
 # Only search from current index on, skipping comparisions with earlier indexes.
 for _, _, string_lowercase2 in metadata[i:]:
 if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
 # This string is redundant.
 break
 else:
 # This string is unique.
 indexes_of_unique_strings.append(index)
# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)
print(final)

score 3 · Answer 2 · 2015-10-20 10:01:30Z

# Replace the semicolon with colon is a misleading comment, because you're actually replacing with a comma, not a colon. Be careful to make sure comments are accurate.

Also you don't really need to use r in front of the strings. You might have mistakenly thought that's a regex convention in Python but it's actually just something called a raw string. It tells Python not to treat backslashes as escape characters, so it will interpret: r'\n\n\t\r' literally, instead of turning them into special whitespace characters. This is often used in regex, but not necessary for your semicolon and comma strings.

You make a copy of watson called removeWatson, but I don't understand why. You use watson to test the length but then you make sure not to change the length of removeWatson anyway. You've modified watson at the start so it's not about data preservation. I suspect that this is a remnant of an approach that ended up not working and you never thought to tidy it away. Unless you intend to print original watson alongside your modified one, I'd only use watson. And if you intend to print them together, then make one entirely untouched copy of watson at the start and do everything in the copy.

To run two for loops at once, it's neater to use product from the itertools module. It basically does what your nested for loops do, but it saves you a level of nesting and makes it clear that it's a simple nested loop. I'd also make the range(len(watson)) up front rather than typing it twice.

from itertools import product
loop = range(len(watson))
for i, j in product(loop, loop):

Your line where all the magic happens is quite unclear, if you explained how it works then it'd make your code much more readable.

And lastly, why are you filtering out None in the end? If you're going to do that you could set None instead of empty strings (it's marginally clearer to use None), but it signals that maybe you've gotten errant None values and this is your way of masking them. If there's a better reason I've missed then it's worth noting in a comment.

Max Flander Max Flander 1613 bronze badges · Answer 3 · 2015-10-20 04:27:14Z

I actually can't think of a greatly more efficient way to do this. Just some general Python style tips:

I would change the variable name removeWatson to watson_list, to use a more informative name, (also underscores are more common in python than camel case)

I would do the "ignore uppercase/lowercase" and "remove extra spaces" like this:

watson = watson.lower()
watson_list = [w.strip() for w in watson.split('|')]

Now that you've done this, your code is a more concise. Also if you set the range of your inner loop as starting at i then you don't need the j==i check inside the loop.

for i in range(len(watson_list)):
 for j in range(i, len(watson_list)):
 if watson_list[i] in watson_list(j):
 watson_list[i] = ''

Thank you for comments but there is a mistake in your code. The second loop will always start from 0 not i because otherwise you will miss out some repetitive strings. Try to run my code with your approach, you will see the difference.

Stack Exchange Network

Remove repetitive strings from a given sentence efficiently

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Remove repetitive strings from a given sentence efficiently

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions