Add support for emoji and flags in any Lucene compatible search engine!
If you wish to search π© to find donuts in your documents, you came to the
right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.
Test all synonym files on a real Elasticsearch
There is no requirements for Elasticsearch >= 6.7.
Using older version of Elasticsearch? Open me! π±
| Version | Requirements |
|---|---|
| Elasticsearch >= 6.4 and < 6.7 | You need to install the official ICU Plugin. See our blog post about this change. |
| Elasticsearch < 6.4 | You need our custom ICU Tokenizer Plugin, see our blog post (2016). |
Run the following test to verify that you get 4 EMOJI tokens:
GET _analyze { "text": ["π© π«π· π©βπ π£πΎββ"] }
What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.
We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:
π©βπ => π©βπ, firefighter, firetruck, woman
π©ββ => π©ββ, pilot, plane, woman
π₯ => π₯, bacon, meat, food
π₯ => π₯, potato, vegetable, food
π
=> π
, cold, face, open, smile, sweat
π => π, face, laugh, mouth, open, satisfied, smile
π => π, bus, tram, trolley
π«π· => π«π·, france
π¬π§ => π¬π§, united kingdom
For emoticons, use this mapping with a char_filter to replace emoticons by emoji.
Download the emoji and emoticon file you want from this repository and store
them in PATH_TO_ES/config/analysis (or anywhere Elasticsearch can read).
config
βββ analysis
β βββ cldr-emoji-annotation-synonyms-en.txt
β βββ emoticons.txt
βββ elasticsearch.yml
...
Use them like this (this is a complete english example with Elasticsearch >= 6.7):
PUT /tweets { "settings": { "analysis": { "filter": { "english_emoji": { "type": "synonym", "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt" }, "emoji_variation_selector_filter": { "type": "pattern_replace", "pattern": "\\uFE0E|\\uFE0F", "replace": "" }, "english_stop": { "type": "stop", "stopwords": "_english_" }, "english_keywords": { "type": "keyword_marker", "keywords": ["example"] }, "english_stemmer": { "type": "stemmer", "language": "english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" } }, "analyzer": { "english_with_emoji": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "emoji_variation_selector_filter", "english_emoji", "english_stop", "english_keywords", "english_stemmer" ] } } } }, "mappings": { "properties": { "content": { "type": "text", "analyzer": "english_with_emoji" } } } }
You can now test the result with:
GET tweets/_analyze { "field": "content", "text": "π© π«π· π©βπ π£πΎββ" }
You will need:
- php cli
- php zip, mbstring, xml and curl extensions
- a running Elasticsearch (
make start)
Edit the tag in tools/build-released.php and run php tools/build-released.php.
Run php tools/build-emoticon.php.
Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).
This repository in distributed under MIT License. Feel free to use and contribute as you please!