UFC is the largest MMA promotion in the world, it is in the top 5-10 sports promotions in the US, and growing! It outgrew such leagues as the NBA, F1, Champions League, NHL, NFL, MLB, and others in terms of relative pct growth over the last 7 years.
This repository is a complete UFC fights dataset project containing every single UFC fight, fighter stats, and official fight scorecards.
Data was scraped, OCR-parsed from scorecard images (using PaddleOCR), cleaned, and preprocessed for analysis. Finally, the data was explored through EDA-driven storytelling and summarized in a presentation of key findings.
By the end of this project, the following were achieved:
βοΈ UFC fight stats scraped
βοΈ UFC scorecards scraped
βοΈ Scorecards OCR-parsed
βοΈ Dataset cleaned & preprocessed
βοΈ Final dataset organized for analysis
βοΈ EDA questions posed & answered
βοΈ Results presented in a clear report
- π₯ Scraping UFC Stats & Scorecards β automated collection of official data
- π OCR Processing β extracting structured data from official scorecard images
- π§Ή Data Cleaning & Preprocessing β ready-to-use datasets for research & analysis
- π Organized Dataset Storage β structured for smooth EDA workflows
- π Exploratory Data Analysis (EDA) β insights into UFC fights, fighters, and outcomes
- π€ Presentation of Findings β data-driven stories and visualizations
Clone the repo and set up the environment:
# 1. Clone repository git clone https://github.com/komaksym/UFC-DataLab.git # 2. Verify conda installation conda --version # 3. Create virtual environment from config conda env create -f environment.yml # 4. Activate the environment conda activate paddle_env
UFC-DataLab/ βββ data/ β βββ external_data/ # External reference datasets β βββ merged_stats_n_scorecards/ # Final merged dataset β βββ scorecards/ # Raw + OCR-processed scorecards β βββ src/ # Data-related scripts & notebooks β βββ stats/ # UFC fight statistics β βββ src/ β βββ EDA/ # Exploratory Data Analysis notebooks β βββ scorecard_OCR/ # OCR parsing scripts β βββ scraping/ # Web scraping spiders β βββ tests/ βββ OCR_parsing/ # OCR unit tests βββ scrapers/ # Scraper unit tests
Contributions are welcome!
- Open an Issue to report bugs or request features
- Submit a Pull Request (PR) for improvements
This project provides one of the most complete UFC datasets available β combining official fight stats with OCR-parsed scorecards. It opens the door for:
- Sports analytics & machine learning models
- UFC win prediction research
- Fighter performance tracking
- Data storytelling around MMA
π If you find this project useful, donβt forget to β star this repository to support its growth!