Catogorising strings into random versus non-random

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Dec 21 03:57:36 EST 2015


On Monday 21 December 2015 15:22, Chris Angelico wrote:
> On Mon, Dec 21, 2015 at 2:01 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> I have a large number of strings (originally file names) which tend to
>> fall into two groups. Some are human-meaningful, but not necessarily
>> dictionary words e.g.:
[...]
> The first thing that comes to my mind is poking the string into a
> search engine and seeing how many results come back. You might need to
> do some preprocessing to recognize multi-word forms (maybe a handful
> of recognized cases like snake_case, CamelCase,
> CamelCasewiththeLittleWordsLeftUnchanged, etc),

I could possibly split the string into "words", based on CamelCase, spaces, 
hyphens or underscores. That would cover most of the cases.
> How many of these keywords would you be looking up, and would a
> network transaction (a search engine API call) for each one be too
> expensive?

Tens or hundreds of thousands of strings, and yes a network transaction 
probably would be a bit much. I'd rather not have Google or Bing be a 
dependency :-)
-- 
Steve


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /