Catogorising strings into random versus non-random

Mon Dec 21 04:56:24 EST 2015

Am 21.12.15 um 09:24 schrieb Peter Otten:
> Steven D'Aprano wrote:
>>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
>>>>>> baby lions at play
>> saturday_morning12
>> Fukushima
>> ImpossibleFork
>>>>>> (note that some use underscores, others spaces, and some CamelCase) while
>> others are completely meaningless (or mostly so):
>>>>>> xy39mGWbosjY
>> 9sjz7s8198ghwt
>> rz4sdko-28dbRW00u
>>>>>> Let's call the second group "random" and the first "non-random", without
>> getting bogged down into arguments about whether they are really random or
>> not. I wish to process the strings and automatically determine whether
>> each string is random or not. I need to split the strings into three
>> groups:
>>>> - those that I'm confident are random
>> - those that I'm unsure about
>> - those that I'm confident are non-random
>>>> Ideally, I'll get some sort of numeric score so I can tweak where the
>> boundaries fall.
>>>> Strings are *mostly* ASCII but may include a few non-ASCII characters.
>>>> Note that false positives (detecting a meaningful non-random string as
>> random) is worse for me than false negatives (miscategorising a random
>> string as non-random).
>>>> Does anyone have any suggestions for how to do this? Preferably something
>> already existing. I have some thoughts and/or questions:
>>>> - I think nltk has a "language detection" function, would that be
>> suitable?
>>>> - If not nltk, are there are suitable language detection libraries?
>>>> - Is this the sort of problem that neural networks are good at solving?
>> Anyone know a really good tutorial for neural networks in Python?
>>>> - How about Bayesian filters, e.g. SpamBayes?
>> A dead simple approach -- look at the pairs in real words and calculate the
> ratio
>> pairs-also-found-in-real-words/num-pairs

Sounds reasonable. Building on this approach, two simple improvements:
- calculate the log-likelihood instead, which also makes use of the 
frequency of the digraphs in the training set
- Use trigraphs instead of digraphs
- preprocess the string (lowercase), but more sophisticated 
preprocessing could be an option (i.e. converting under_scores and 
CamelCase to spaces)
The main reason for the low score of the baby lions is the space 
character, I think - the word list does not contain that much spaces. 
Maybe one should feed in some long wikipedia article to calculate the 
digraph/trigraph probabilities
=====================================
Apfelkiste:Tests chris$ cat score_my.py
from __future__ import division
from collections import Counter, defaultdict
from math import log
import sys
WORDLIST = "/usr/share/dict/words"
SAMPLE = """\
baby lions at play
saturday_morning12
Fukushima
ImpossibleFork
xy39mGWbosjY
9sjz7s8198ghwt
rz4sdko-28dbRW00u
""".splitlines()
def extract_pairs(text):
 for i in range(len(text)-1):
 yield text.lower()[i:i+2]
 # or len(text)-2 and i:i+3
def load_pairs():
 pairs = Counter()
 with open(WORDLIST) as f:
 for line in f:
 pairs.update(extract_pairs(line.strip()))
 # normalize to sum
 total_count = sum([pairs[x] for x in pairs])
 N = total_count+len(pairs)
 dist = defaultdict(lambda:1/N, ((x, (pairs[x]+1)/N) for x in pairs))
 return dist
def get_score(text, dist):
 ll = 0
 for i, x in enumerate(extract_pairs(text), 1):
 ll += log(dist[x])
 return ll / i
def main():
 pair_dist = load_pairs()
 for text in sys.argv[1:] or SAMPLE:
 score = get_score(text, pair_dist)
 print("%.3g %s" % (score, text))
if __name__ == "__main__":
 main()
Apfelkiste:Tests chris$ python score_my.py
-8.74 baby lions at play
-7.63 saturday_morning12
-6.38 Fukushima
-5.72 ImpossibleFork
-10.6 xy39mGWbosjY
-12.9 9sjz7s8198ghwt
-12.1 rz4sdko-28dbRW00u
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-9.43 bnsip atl ayba loy
Apfelkiste:Tests chris$
and using trigraphs:
Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
-12.5 bnsip atl ayba loy
Apfelkiste:Tests chris$ python score_my.py
-11.5 baby lions at play
-9.88 saturday_morning12
-9.85 Fukushima
-7.68 ImpossibleFork
-13.4 xy39mGWbosjY
-14.2 9sjz7s8198ghwt
-14.2 rz4sdko-28dbRW00u
==============================