I'm trying to parse words from a badly garbled text file that contains many repeats. It's about 100k characters in length and was formed from joining many substrings in alphabetical order.
I'm curious about other methods for finding words without using whitespace.
def unique_words(string):
words = dict()
p1 = 0 # String slice position 1
p2 = 1 # String slice position 2
len_string = len(string)
while p2 < len_string:
p2 += 1
sub1 = string[p1:p2] # A shorter sub
sub2 = string[p1:(p2 + 1)] # A longer sub
sub1_count = string.count(sub1) # Counts the frequency of the shorter sub
sub2_count = string.count(sub2) # Counts the frequency of the longer sub
if sub2_count * len(sub2) < sub1_count * len(sub1): # True if the frequency of sub1 * its length is greater
words[sub1] = ('') # Add
p1 = p2
return words
The above code works when the number of unique words is small but fails when it is large. I've used the website TextMechanic to generate a random string like
'updownleftupdowndownleftupleftrightupdownleftup'
and the above code returns a dictionary exactly as desired:
{'up': '', 'down': '', 'left': '', 'right': ''}
Here's the problem:
When the number of unique words increases, there is a point where the occurrence of single letters out numbers the total character count of any word in a string.
My current solution uses the algorithm on short slices of the original string, but this involves trial-and-error and has artifacts.
-
\$\begingroup\$ how do you know they are words ? \$\endgroup\$joaquin– joaquin2014年09月09日 22:47:55 +00:00Commented Sep 9, 2014 at 22:47
-
\$\begingroup\$ It was originally a log file with names and locations. If you look at the file you can clearly see the words are intact, they're just out of order, repeated many times, and have no white space. The original files are gone. I want to know for each of my files, what the original information was. \$\endgroup\$12345678910111213– 123456789101112132014年09月09日 22:54:12 +00:00Commented Sep 9, 2014 at 22:54
-
2\$\begingroup\$ 'you can clearly see the words are intact'. This is my question: how a computer could know that than you can see so clearly ?. Maybe you can use some initial seed to start breaking the text, using tools from natural language or simply a dictionary prepared with the more frequent words on a similar log file. Once you take worrds from the middle the others will shine. \$\endgroup\$joaquin– joaquin2014年09月09日 22:59:32 +00:00Commented Sep 9, 2014 at 22:59
2 Answers 2
Disclaimer : I am not quite sure how this is supposed to work and as far as I can tell the example you gave is wrong. Indeed, I don't have "right" in the output. I'll try give you a few advices anyway.
Use the relevant type
You are storing substrings in a dictionnary with a useless associated value. It seems like what you want to use is a set
.
words = set()
...
words.add(sub1)
Don't use while
when for
does the trick
As far as I can tell,
p2 = 1 # String slice position 2
len_string = len(string)
while p2 < len_string:
p2 += 1
... something that doesn't change p2
can be used :
for p2 in range(2, 1+len(string)):
... something that doesn't change p2
For the latter, it is easy to see the different values that p2
will take. It is also more concise and more idiomatic.
Also, my guess is that you do one pointless iteration at the end. Indeed, you have sub1 == sub2
and then basically nothing happens. I wouldn't been surprised if :
for p2 in range(2, len(string)):
was more correct. Also, my instinct tells me that starting from 1 isn't that stupid but I have no good reason for this.
You should give your variables better names. For example, while you might know what the variable p1
, or p2
does, people reading your code don't. Giving your variables better names also reduces the need for inline comments like # String slice position 2
.
Rather than using the dict
function to initialize an empty dictionary, you can just type the following: words = {}
.
Finally, you don't need the parentheses around the ''
in words[sub1] = ('')
. It can be changed to the following: words[sub1] = ''
.