✔ Train large-scale semantic NLP models
✔ Represent text as semantic vectors
✔ Find semantically related documents
from gensim import corpora, models, similarities, downloader
# Stream a training corpus directly from S3.
corpus = corpora.MmCorpus("s3://path/to/corpus")
# Train Latent Semantic Indexing with 200D vectors.
lsi = models.LsiModel(corpus, num_topics=200)
# Convert another corpus to the LSI space and index it.
index = similarities.MatrixSimilarity(lsi[another_corpus])
# Compute similarity of a query vs indexed documents.
sims = index[query]
The fastest library for training of vector embeddings – Python or otherwise. The core algorithms in Gensim use battle-hardened, highly optimized & parallelized C routines.
Gensim can process arbitrarily large corpora, using data-streamed algorithms. There are no "dataset must fit in RAM" limitations.
Gensim runs on Linux, Windows and OS X, as well as any other platform that supports Python and NumPy.
With thousands of companies using Gensim every day, over 2600 academic citations and 1M downloads per week, Gensim is one of the most mature ML libraries.
All Gensim source code is hosted on Github under the GNU LGPL license, maintained by its open source community. For commercial arrangements, see Business Support.
The Gensim community also publishes pretrained models for specific domains like legal or health, via the Gensim-data project.
pip install --upgrade gensim
conda install -c conda-forge gensim
That's it! Congratulations, you can proceed to the tutorials.
Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 3.8+ and NumPy. Gensim depends on the following software:
Gensim uses continuous integration, automatically running a full test suite on each pull request:
| CI service | Task | Build status |
|---|---|---|
| Github Actions | Run tests on Linux and Mac, plus check code-style | Github Action |
| AppVeyor | Run tests on Windows | AppVeyor |
| CircleCI | Build documentation | CircleCI |
Or, to install and test Gensim locally:
pip install -e . # compile and install Gensim from the current directory
pytest gensim # run the tests
Doing something interesting with Gensim? Sponsor Gensim and ask to be featured among adopters.
"Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing."
"We are using Gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works."
"Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness."
"I used Gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It's a reliable library that can be used beyond prototyping too."
"We used Gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having Gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets."
"Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality."
"Based on our experience with Gensim on DML-CZ, we naturally opted to use it on a much bigger scale for similarity of fulltexts of scientific papers in the European Digital Mathematics Library. In evaluation with other approaches, Gensim became a clear winner, especially because of speed, scalability and ease of use."
"We have been using Gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA."