A machine learning tool that predicts a movie's IMDb rating from its metadata and plot description — with the ability to suggest similar existing movies for reference.
It's hard to judge a movie concept early because:
- Complex Factors: Success depends on story, genre, and audience taste
- Human Intuition: Limited comparisons and subjective biases
- No Reference Points: "Great" ideas can be risky without context
Learn patterns from thousands of past movies to:
- Predict expected audience rating
- Find similar movies for context and benchmarking
All inputs are available before release — making predictions realistic and useful for creators.
| Dataset | Purpose | Key Features |
|---|---|---|
| IMDb Dataset | Core Training Labels | Year, runtime, genres, IMDb score/votes |
| TMDB Movies Dataset | NLP Features | Plot overview, budget, revenue, credits |
- Plot Summary ("overview") - Converted to embeddings via NLP
- Genres (Action, Drama, Sci-Fi, etc.)
- Runtime + Release Year
- Budget (optional)
-
Rating Prediction (Metadata)
- Input: Year, runtime, genres
- Output: Predicted IMDb rating (e.g., 7.3/10)
-
Rating + Plot Understanding (NLP)
- Enhanced predictions using plot summary, keywords, and tagline
- Better accuracy through content understanding
-
Simple Demo UI
- Select genre + runtime
- Paste a plot description
- Get instant predicted rating
- Similar Movies Suggestions: Top 5 most similar existing movies with their ratings as benchmarks
- Explainability: Show which factors (keywords, genre, runtime) influenced the prediction
- Confidence Scoring: High/Medium/Low confidence levels based on training data coverage
Input:
- Genre: Sci-Fi, Thriller
- Runtime: 118 min
- Plot: "A detective investigates crimes in a city controlled by AI..."
Output:
- Predicted Rating: 7.2/10
- Similar Movies:
- Blade Runner 2049 (8.0)
- Minority Report (7.6)
- Ex Machina (7.7)
| Metric | Value | Notes |
|---|---|---|
| R2 Score | 0.42 | For new movies (no vote data) |
| Training Data | 39k movies | Rich dataset with TMDB plot data |
| Algorithm | GradientBoosting | Best performer across 18 experiments |
| Features | 49 | IMDb metadata + 20 PCA components from plot embeddings |
See notebooks/03_model_training.ipynb for full experiment results.
Python 3.12 · pandas · NumPy · scikit-learn · GradientBoostingRegressor · sentence-transformers · all-MiniLM-L6-v2 · BAAI/bge-base-en-v1.5 · FAISS · FastAPI · jQuery · Select2 · Docker · Fly.io · Google Cloud Storage
floportop/
├── apps/
│ ├── api/ # FastAPI app
│ └── frontend/ # Streamlit app
├── src/
│ └── floportop/ # Shared prediction/search package
├── deploy/
│ ├── cloudbuild.yaml # Google Cloud Build config
│ └── docker/
│ ├── Dockerfile
│ └── .dockerignore
├── requirements/
│ ├── prod.in
│ ├── prod.lock
│ └── dev.txt
├── models/ # Trained model artifacts
├── cache/ # Runtime model caches
├── data/ # Local datasets (not in production image)
├── notebooks/
│ ├── 01_data_pipeline.ipynb # IMDb + TMDB → clean datasets
│ ├── 02_feature_engineering.ipynb # Embeddings, PCA, genre encoding → features
│ ├── 03_model_training.ipynb # 18 experiments → model v5
│ └── archive/ # Team explorations & earlier iterations
├── scripts/ # Data and notebook helpers
├── docs/
│ └── restructure-plan.md
├── Makefile
├── start.sh
└── README.md
pip install -e .
PYTHONPATH=src:. uvicorn apps.api.app:app --reloadThe API will be available at http://localhost:8000. Interactive docs at http://localhost:8000/docs.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Health check |
/predict |
GET | Predict movie rating from metadata |
/similar-film |
GET | Find similar movies by text query |
Predict rating (v5):
curl "http://localhost:8000/predict?startYear=2024&runtimeMinutes=148&genres=Action,Sci-Fi&overview=A%20team%20of%20astronauts%20travel%20through%20a%20wormhole%20in%20search%20of%20a%20new%20home%20for%20humanity"Parameters:
startYear(required): Release yearruntimeMinutes(required): Movie lengthoverview(required): Plot description - used for semantic analysisgenres(optional): Comma-separated genres (default: "Drama")budget(optional): Production budget in dollars
Find similar movies:
curl "http://localhost:8000/similar-film?query=dark+sci-fi+time+travel&k=5"Note: The similarity search index is built lazily on the first /similar-film call. Subsequent calls use the cached index from cache/.
PYTHONPATH=src:. python -m floportop.movie_search "dark sci-fi time travel"| Name | Role |
|---|---|
| Igor Novokshonov | Team Leader |
| Benjamin Steinborn | Developer |
| Jesús López | Developer |
| Kyle Thomas | Developer |
| mucahit TIMAR | Developer |
- Project ID:
wagon-bootcamp-479218 - Region:
europe-west1 - Engine: Use OrbStack (recommended for Mac) or Docker Desktop.
- Note: OrbStack is a lightweight, drop-in replacement that uses the same
dockercommands but with better performance on Apple Silicon.
Critical: Google Cloud Run requires linux/amd64 images.
- The Issue: Apple Silicon Macs (M1/M2/M3) build
arm64images by default. - The Fix: Use Remote Builds. By running
gcloud builds submit, the image is built natively on Google’samd64servers, bypassing local architecture mismatches.
| Task | Command | Description |
|---|---|---|
| Build & Push | make gcp_build |
Remote build on GCP; ensures amd64 compatibility. |
| Live Deploy | make gcp_deploy |
Launches the latest image to the public Cloud Run URL. |
| Full Ship | make gcp_ship |
Runs both build and deploy in one sequence. |
gcloud run deploy floportop-v2
--image gcr.io/wagon-bootcamp-479218/floportop-v2
--memory 2Gi
--set-env-vars KAGGLE_API_TOKEN=your_token_here
--region europe-west1
- Streamlit UI:
https://floportop-v2-25462497140.europe-west1.run.app - Features: Rating prediction + Similar films search (two tabs)
- Note: Cold starts take ~60s due to model loading. The container runs both Streamlit (port 8501, exposed) and FastAPI (port 8080, internal).
- Logs: View live server logs in the terminal:
gcloud run services logs read floportop-v2 --region europe-west1
If the app deploys but the logs show exec user process caused "exec format error", you have pushed an arm64 image instead of amd64. Verification: Run docker inspect [IMAGE_NAME] | grep Architecture.The Fix: Re-run make gcp_build or use the manual --platform linux/amd64 flag.
- Memory Requirements: This service requires at least 2Gi of RAM to load the FAISS index and models.
- Image Size: Optimized to ~1.8GB using CPU-only PyTorch and production-only dependencies.
- Ports: Container runs API on 8080 (internal) and Streamlit on 8501 (exposed to Cloud Run).
- FAISS Index: Downloaded from GCS during build (
https://storage.googleapis.com/floportop-models/index.faiss). - Lazy Imports: Do not move the Kaggle import back to the top of
movie_search.py; it must remain inside the function to allow the API to boot.
# Build optimized image (CPU-only, ~1.8GB) docker build -f deploy/docker/Dockerfile -t floportop . # Run locally (exposes both API and Streamlit UI) docker run -p 8080:8080 -p 8501:8501 floportop # Access: # - Streamlit UI: http://localhost:8501 # - API directly: http://localhost:8080 # Test API endpoints curl http://localhost:8080/ curl "http://localhost:8080/predict?startYear=2024&runtimeMinutes=120&genres=Action&overview=A%20hero%20saves%20the%20world" curl "http://localhost:8080/similar-film?query=comedy&k=5"
Final project for Le Wagon Batch #2201 (2025)
This project demonstrates real-world data processing, NLP, and machine learning — combining prediction with discovery to help creators and fans alike.