3
\$\begingroup\$

I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.

I'm curious about other methods for finding words without using whitespace.

def unique_words(string):
 words = dict()
 p1 = 0 # String slice position 1
 p2 = 1 # String slice position 2
 len_string = len(string)
 while p2 < len_string:
 p2 += 1
 sub1 = string[p1:p2] # A shorter sub
 sub2 = string[p1:(p2 + 1)] # A longer sub
 sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
 sub2_count = string.count(sub2) # Counts the frequency of the longer sub
 if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
 words[sub1] = ('') # Add 
 p1 = p2
 return words

The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like

'updownleftupdowndownleftupleftrightupdownleftup'

and the above code returns a dictionary exactly as desired:

{'up': '', 'down': '', 'left': '', 'right': ''}

Here's the problem:

When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.

My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.

asked Sep 9, 2014 at 22:39
\$\endgroup\$
3
  • \$\begingroup\$ how do you know they are words ? \$\endgroup\$ Commented Sep 9, 2014 at 22:47
  • \$\begingroup\$ It was originally a log file with names and locations. If you look at the file you can clearly see the words are intact, they're just out of order, repeated many times, and have no white space. The original files are gone. I want to know for each of my files, what the original information was. \$\endgroup\$ Commented Sep 9, 2014 at 22:54
  • 2
    \$\begingroup\$ 'you can clearly see the words are intact'. This is my question: how a computer could know that than you can see so clearly ?. Maybe you can use some initial seed to start breaking the text, using tools from natural language or simply a dictionary prepared with the more frequent words on a similar log file. Once you take worrds from the middle the others will shine. \$\endgroup\$ Commented Sep 9, 2014 at 22:59

2 Answers 2

1
\$\begingroup\$

Disclaimer : I am not quite sure how this is supposed to work and as far as I can tell the example you gave is wrong. Indeed, I don't have "right" in the output. I'll try give you a few advices anyway.

Use the relevant type

You are storing substrings in a dictionnary with a useless associated value. It seems like what you want to use is a set.

words = set()
...
words.add(sub1)

Don't use while when for does the trick

As far as I can tell,

p2 = 1 # String slice position 2
len_string = len(string)
while p2 < len_string:
 p2 += 1
 ... something that doesn't change p2

can be used :

for p2 in range(2, 1+len(string)):
 ... something that doesn't change p2

For the latter, it is easy to see the different values that p2 will take. It is also more concise and more idiomatic.

Also, my guess is that you do one pointless iteration at the end. Indeed, you have sub1 == sub2 and then basically nothing happens. I wouldn't been surprised if :

for p2 in range(2, len(string)):

was more correct. Also, my instinct tells me that starting from 1 isn't that stupid but I have no good reason for this.

answered Jul 3, 2015 at 16:15
\$\endgroup\$
3
\$\begingroup\$

You should give your variables better names. For example, while you might know what the variable p1, or p2 does, people reading your code don't. Giving your variables better names also reduces the need for inline comments like # String slice position 2.

Rather than using the dict function to initialize an empty dictionary, you can just type the following: words = {}.

Finally, you don't need the parentheses around the '' in words[sub1] = (''). It can be changed to the following: words[sub1] = ''.

answered Jul 3, 2015 at 15:40
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.