Gensim: Topic modelling for humans

Gensim relies on your donations for sustenance. If you like Gensim, please consider donating.

Topic modelling
for humans

Gensim is a FREE Python library

✔ Train large-scale semantic NLP models

✔ Represent text as semantic vectors

✔ Find semantically related documents

from gensim import corpora, models, similarities, downloader
# Stream a training corpus directly from S3.
corpus = corpora.MmCorpus("s3://path/to/corpus")
# Train Latent Semantic Indexing with 200D vectors.
lsi = models.LsiModel(corpus, num_topics=200)
# Convert another corpus to the LSI space and index it.
index = similarities.MatrixSimilarity(lsi[another_corpus])
# Compute similarity of a query vs indexed documents.
sims = index[query]

Why Gensim?

Super fast

The fastest library for training of vector embeddings – Python or otherwise. The core algorithms in Gensim use battle-hardened, highly optimized & parallelized C routines.

Data Streaming

Gensim can process arbitrarily large corpora, using data-streamed algorithms. There are no "dataset must fit in RAM" limitations.

Platform independent

Gensim runs on Linux, Windows and OS X, as well as any other platform that supports Python and NumPy.

Proven

With thousands of companies using Gensim every day, over 2600 academic citations and 1M downloads per week, Gensim is one of the most mature ML libraries.

Open source

All Gensim source code is hosted on Github under the GNU LGPL license, maintained by its open source community. For commercial arrangements, see Business Support.

Ready-to-use models and corpora

The Gensim community also publishes pretrained models for specific domains like legal or health, via the Gensim-data project.

Installation

Quick install

Run in your terminal (recommended):

pip install --upgrade gensim

or, alternatively for conda environments:

conda install -c conda-forge gensim

That's it! Congratulations, you can proceed to the tutorials.

Code dependencies

Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 3.8+ and NumPy. Gensim depends on the following software:

Python, tested with versions 3.8, 3.9, 3.10 and 3.11.
NumPy for number crunching.
smart_open for transparently opening files on remote storages or compressed files.

Testing Gensim

Gensim uses continuous integration, automatically running a full test suite on each pull request:

CI service	Task	Build status
Github Actions	Run tests on Linux and Mac, plus check code-style	Github Action
AppVeyor	Run tests on Windows	AppVeyor
CircleCI	Build documentation	CircleCI

Or, to install and test Gensim locally:


 pip install -e . # compile and install Gensim from the current directory


 pytest gensim # run the tests

Who is using Gensim?

Doing something interesting with Gensim? Sponsor Gensim and ask to be featured among adopters.

"Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing."

Waylon Flinn
Tailwind
"We are using Gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works."

Andrius Butkus
Issuu
"Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness."

Alan J. Salmoni
Roistr.com
"I used Gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It's a reliable library that can be used beyond prototyping too."

Dieter Plaetinck
IBCN group
"We used Gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having Gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets."

Josh Hemann
Sports Authority
"Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality."

Bruno Champion
DynAdmic
"Based on our experience with Gensim on DML-CZ, we naturally opted to use it on a much bigger scale for similarity of fulltexts of scientific papers in the European Digital Mathematics Library. In evaluation with other approaches, Gensim became a clear winner, especially because of speed, scalability and ease of use."

Petr Sojka
EuDML
"We have been using Gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA."

Michael Kai Petersen
Technical University of Denmark

Fork on Github

Why Gensim?

Super fast

Data Streaming

Platform independent

Proven

Open source

Ready-to-use models and corpora

Installation

Quick install

Run in your terminal (recommended):

or, alternatively for conda environments:

Code dependencies

Testing Gensim

Who is using Gensim?

Waylon Flinn

Andrius Butkus

Alan J. Salmoni

Dieter Plaetinck

Josh Hemann

Bruno Champion

Petr Sojka

Michael Kai Petersen