Catogorising strings into random versus non-random

Mon Dec 21 05:36:28 EST 2015

On 2015年12月21日 08:56 pm, Christian Gollwitzer wrote:
> Apfelkiste:Tests chris$ python score_my.py
> -8.74 baby lions at play
> -7.63 saturday_morning12
> -6.38 Fukushima
> -5.72 ImpossibleFork
> -10.6 xy39mGWbosjY
> -12.9 9sjz7s8198ghwt
> -12.1 rz4sdko-28dbRW00u
> Apfelkiste:Tests chris$ python score_my.py 'bnsip atl ayba loy'
> -9.43 bnsip atl ayba loy

Thanks Christian and Peter for the suggestion, I'll certainly investigate
this further.
But the scoring doesn't seem very good. "baby lions at play" is 100% English
words, and ought to have a radically different score from (say)
xy39mGWbosjY which is extremely non-English like. (How many English words
do you know of with W, X, two Y, and J?) And yet they are only two units
apart. "baby lions..." is a score almost as negative as the authentic
gibberish, while Fukushima (a Japanese word) has a much less negative
score. Using trigraphs doesn't change that:
> -11.5 baby lions at play
> -9.85 Fukushima
> -13.4 xy39mGWbosjY

So this test appears to find that English-like words are nearly as "random"
as actual random strings.
But it's certainly worth looking into.
-- 
Steven