Catogorising strings into random versus non-random
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Dec 21 03:57:36 EST 2015
On Monday 21 December 2015 15:22, Chris Angelico wrote:
> On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
[...]
> The first thing that comes to my mind is poking the string into a
> search engine and seeing how many results come back. You might need to
> do some preprocessing to recognize multi-word forms (maybe a handful
> of recognized cases like snake_case, CamelCase,
> CamelCasewiththeLittleWordsLeftUnchanged, etc),
I could possibly split the string into "words", based on CamelCase, spaces,
hyphens or underscores. That would cover most of the cases.
> How many of these keywords would you be looking up, and would a
> network transaction (a search engine API call) for each one be too
> expensive?
Tens or hundreds of thousands of strings, and yes a network transaction
probably would be a bit much. I'd rather not have Google or Bing be a
dependency :-)
--
Steve
More information about the Python-list
mailing list