4
\$\begingroup\$

I recently gave a test on HackerRank:

Given a string, return the most concise string that can be formed from it.

For example:

string = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
  • The following string contains many duplicates like "Watson represents." We have to ignore extra spacing between characters or lower/upper case. "watson Represents" and "watson represents" are the same thing.
  • Semicolons and commas represent the same thing. For example, "Cognitive Systems; a new era" is present inside "Watson represents a first step into cognitive systems, a new era of computing."
  • Your final string should not contain any duplicates. Ignore lowercase/uppercase or extra space if you have any.

My answer:

watson = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
import re
watson = re.sub(r';', r',', watson) #replace the semicolon with colon
watson = watson.strip().split('|')
removeWatson = list(watson)
for i in range(len(watson)):
 for j in range(len(watson)):
 if j == i:
 continue
 if " ".join(watson[i].strip().lower().split()) in " ".join(watson[j].strip().lower().split()):
 removeWatson[i] = ''
print "|".join(filter(None, removeWatson))

My answer is definitely inefficient and I am wondering if you can suggest an alternative way of solving this problem.

Most concise string:

Watson represents a first step into cognitive systems, a new era of computing.|what does watson represent

Pimgd
22.5k5 gold badges68 silver badges144 bronze badges
asked Oct 20, 2015 at 3:14
\$\endgroup\$
2
  • \$\begingroup\$ I cannot find this problem on hackerrank.com. Can you update your post to include a link to it, please? \$\endgroup\$ Commented Oct 20, 2015 at 14:02
  • \$\begingroup\$ Hey it was on a Hacker rank test question, I don't think they generally publish such questions on their public website. \$\endgroup\$ Commented Oct 20, 2015 at 23:04

3 Answers 3

4
\$\begingroup\$

Here's my shot at this puzzle.

Benchmark of original code, over 1,000,000 iterations (with print() commented out):

real 1m8.486s
user 0m0.000s
sys 0m0.000s

Benchmark of my code, over 1,000,000 iterations (with print() commented out):

real 0m16.226s
user 0m0.000s
sys 0m0.015s

Performance gain: 422.07%


It looks like you want to preserve the string order and uppercase/lowercase, so I took that into account. Otherwise this could be even faster/shorter.

The strategy here is to preprocess the strings, instead of going through strip(), lower(), and split() each time you need to do a comparison. Additionally, sort the strings by length, so you only perform comparisons you need to perform. See comments for more details.

WATSON_RAW = 'watson Represents|watson represents|Watson represents a first step into cognitive systems, a new era of computing.|first step into Cognitive|Cognitive Systems; a new era|what does watson represent'
# 1. Preprocess the string.
# Remove extraneous whitespace, replace semicolon with comma.
modified = ' '.join(WATSON_RAW.split())
modified = modified.replace(';', ',')
# 2. Split string by delimiter.
split = modified.split('|')
# 3. Generate string metadata, sorted by length
# Keep track of the string contents, its index in the "split" variable, and its length.
# Tuple has a speed advantage over list.
metadata = [(i, len(s), s.lower()) for i, s in enumerate(split)]
metadata = sorted(metadata, key=lambda x: x[1])
# 4. Uniqify the strings.
indexes_of_unique_strings = []
for index, _, string_lowercase1 in metadata:
 # Only search from current index on, skipping comparisions with earlier indexes.
 for _, _, string_lowercase2 in metadata[i:]:
 if string_lowercase1 in string_lowercase2 and string_lowercase1 != string_lowercase2:
 # This string is redundant.
 break
 else:
 # This string is unique.
 indexes_of_unique_strings.append(index)
# 5. Map indexes back to original, correctly capitalized strings, in original order.
indexes_of_unique_strings = sorted(indexes_of_unique_strings)
final = '|'.join(split[i] for i in indexes_of_unique_strings)
print(final)
answered Oct 20, 2015 at 5:07
\$\endgroup\$
3
\$\begingroup\$

# Replace the semicolon with colon is a misleading comment, because you're actually replacing with a comma, not a colon. Be careful to make sure comments are accurate.

Also you don't really need to use r in front of the strings. You might have mistakenly thought that's a regex convention in Python but it's actually just something called a raw string. It tells Python not to treat backslashes as escape characters, so it will interpret: r'\n\n\t\r' literally, instead of turning them into special whitespace characters. This is often used in regex, but not necessary for your semicolon and comma strings.

You make a copy of watson called removeWatson, but I don't understand why. You use watson to test the length but then you make sure not to change the length of removeWatson anyway. You've modified watson at the start so it's not about data preservation. I suspect that this is a remnant of an approach that ended up not working and you never thought to tidy it away. Unless you intend to print original watson alongside your modified one, I'd only use watson. And if you intend to print them together, then make one entirely untouched copy of watson at the start and do everything in the copy.

To run two for loops at once, it's neater to use product from the itertools module. It basically does what your nested for loops do, but it saves you a level of nesting and makes it clear that it's a simple nested loop. I'd also make the range(len(watson)) up front rather than typing it twice.

from itertools import product
loop = range(len(watson))
for i, j in product(loop, loop):

Your line where all the magic happens is quite unclear, if you explained how it works then it'd make your code much more readable.

And lastly, why are you filtering out None in the end? If you're going to do that you could set None instead of empty strings (it's marginally clearer to use None), but it signals that maybe you've gotten errant None values and this is your way of masking them. If there's a better reason I've missed then it's worth noting in a comment.

answered Oct 20, 2015 at 10:01
\$\endgroup\$
1
\$\begingroup\$

I actually can't think of a greatly more efficient way to do this. Just some general Python style tips:

I would change the variable name removeWatson to watson_list, to use a more informative name, (also underscores are more common in python than camel case)

I would do the "ignore uppercase/lowercase" and "remove extra spaces" like this:

watson = watson.lower()
watson_list = [w.strip() for w in watson.split('|')]

Now that you've done this, your code is a more concise. Also if you set the range of your inner loop as starting at i then you don't need the j==i check inside the loop.

for i in range(len(watson_list)):
 for j in range(i, len(watson_list)):
 if watson_list[i] in watson_list(j):
 watson_list[i] = ''
answered Oct 20, 2015 at 4:27
\$\endgroup\$
1
  • \$\begingroup\$ Thank you for comments but there is a mistake in your code. The second loop will always start from 0 not i because otherwise you will miss out some repetitive strings. Try to run my code with your approach, you will see the difference. \$\endgroup\$ Commented Oct 20, 2015 at 4:42

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.