Sistema de recomendación de discos construido a partir de datos de Sputnikmusic.
Este proyecto es exclusivamente educativo. Desarrollado para aprender sobre web scraping, procesamiento de datos y sistemas de recomendación. Implementa rate limiting, delays entre requests y scraping ético para no sobrecargar la plataforma.
El proyecto se divide en dos etapas principales:
Pipeline de scraping y crawling que recolecta información de Sputnikmusic:
- Artistas y discografías completas
- Releases con metadata
- Interacciones de usuarios
- Perfiles de usuarios con roles y estadísticas
Documentación detallada de extracción
RRF-Ensemble: Fusión de rankings con Reciprocal Rank Fusion
Motor híbrido que combina múltiples estrategias para generar recomendaciones personalizadas formado por:
- NMF: Factorización matricial no negativa
- Two Towers: Arquitectura de deep learning
- Co-ocurrencia: Basado en patrones de consumo conjunto
- Content-based: Perfiles de géneros y artistas
Documentación detallada de estrategias
# Crear entorno mamba env create -f environment.yml conda activate sputnik-sr # Inicializar base de datos sqlite3 data/sputnik.db < data/schema.sql
# Obtener charts anuales con soundoffs python -m crawler --start-year 1960 --end-year 2025 --db data/sputnik.db # Expandir discografías de artistas python -m crawler.discography --db data/sputnik.db --batch-size 25 # Expandir ratings de usuarios python -m crawler.user_expander --db data/sputnik.db --batch-size 25
# Construir co-ocurrencias python offline_recommender/build_release_pairs.py --database data/sputnik.db # Construir embeddings NMF python offline_recommender/build_nmf_embeddings.py --database data/sputnik.db # Construir embeddings Two Towers python offline_recommender/build_two_towers.py --database data/sputnik.db # Iniciar aplicación web python -m app.app # Abrir http://localhost:5050
sputnik-SR/
├── scraper/ # Parsing de HTML y cliente HTTP
├── crawler/ # Orquestadores de crawling
├── app/ # Aplicación Flask y motor de recomendación
├── offline_recommender/ # Scripts de construcción y evaluación
├── maintenance/ # Scripts de mantenimiento de la DB
├── data/ # Esquema SQL y bases de datos
├── models/ # Modelos entrenados y vocabularios
├── scripts/ # Utilidades bash
├── tests/ # Suite de pruebas
├── notebooks/ # Análisis exploratorios y de resultados finales
└── docs/ # Documentación detallada
| Documento | Descripción |
|---|---|
| Extracción de Datos | Scraping, crawling, flujo de ingestión, monitoreo |
| Estrategias de Recomendación | Algoritmos, métricas, configuración, evaluación |
| Mantenimiento | Scripts de salud y optimización de la DB |
# Ejecutar tests pytest -q # Linter ruff check . # Pre-commits pre-commit install pre-commit run --all-files
MIT License - Ver LICENSE para más detalles.
Album recommendation system built from Sputnikmusic data.
This project is educational only. Built to learn web scraping, data processing, and recommender systems. It implements rate limiting, delays between requests, and ethical scraping practices to avoid overloading the platform.
The project is split into two main stages:
Scraping + crawling pipeline that collects from Sputnikmusic:
- Artists and complete discographies
- Releases and metadata
- User interactions
- User profiles (roles and statistics)
RRF-Ensemble: rank fusion via Reciprocal Rank Fusion
Hybrid engine that combines multiple strategies to produce personalized recommendations:
- NMF: Non-negative Matrix Factorization
- Two Towers: Deep-learning architecture
- Co-occurrence: consumption co-occurrence signals
- Content-based: genre + artist profiles
Note: the results notebooks are written in Spanish, but they should be easy to interpret via the plots, tables, and code.
# Create environment mamba env create -f environment.yml conda activate sputnik-sr # Initialize database sqlite3 data/sputnik.db < data/schema.sql
# Fetch yearly charts with soundoffs python -m crawler --start-year 1960 --end-year 2025 --db data/sputnik.db # Expand artist discographies python -m crawler.discography --db data/sputnik.db --batch-size 25 # Expand user ratings python -m crawler.user_expander --db data/sputnik.db --batch-size 25
# Build co-occurrences python offline_recommender/build_release_pairs.py --database data/sputnik.db # Build NMF embeddings python offline_recommender/build_nmf_embeddings.py --database data/sputnik.db # Build Two Towers embeddings python offline_recommender/build_two_towers.py --database data/sputnik.db # Start web app python -m app.app # Open http://localhost:5050
sputnik-SR/
├── scraper/ # HTML parsing and HTTP client
├── crawler/ # Crawling orchestrators
├── app/ # Flask app and recommender engine
├── offline_recommender/ # Build + evaluation scripts
├── maintenance/ # DB health + optimization scripts
├── data/ # SQL schema and databases
├── models/ # Trained models and vocabularies
├── scripts/ # Bash utilities
├── tests/ # Test suite
├── notebooks/ # EDA + evaluation notebooks
└── docs/ # Detailed documentation
| Document | Description |
|---|---|
| Data Extraction | Scraping, crawling, ingestion flow, monitoring |
| Recommendation Strategies | Algorithms, metrics, configuration, evaluation |
| Maintenance | DB health and optimization scripts |
# Run tests pytest -q # Linter ruff check . # Pre-commits pre-commit install pre-commit run --all-files
MIT License - see LICENSE for details.