Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

WetCatto/uber_case_study

Repository files navigation

Uber Real-Time Data Streaming Pipeline

A real-time data streaming pipeline that simulates Uber ride data using Kafka, PostgreSQL, and Streamlit for live visualization.

Architecture

┌─────────────┐ ┌───────┐ ┌──────────┐ ┌──────────────┐ ┌───────────┐
│ uber_sample │ ───> │ Kafka │ ───> │ Kafka │ ───> │ PostgreSQL │ ───> │ Streamlit │
│ .csv │ │ Topic │ │ Consumer │ │ Database │ │ Dashboard │
└─────────────┘ └───────┘ └──────────┘ └──────────────┘ └───────────┘
 Producer rides_raw Stores data rides table Visualizes

Tech Stack

  • Kafka: Message streaming platform
  • PostgreSQL: Data storage
  • Streamlit: Real-time dashboard
  • Python: Data processing (pandas, kafka-python, sqlalchemy)
  • Docker: Containerization for Kafka, Zookeeper, and PostgreSQL

Project Structure

uber_case_study/
├── data/
│ ├── uber_sample.csv # Source dataset
│ └── clean.py # Data cleaning script
├── producer/
│ └── producer.py # Kafka producer
├── consumer/
│ └── consumer.py # Kafka consumer
├── dashboard/
│ └── app.py # Streamlit dashboard
├── init_database.py # Database initialization
├── script.sh # Environment setup script
├── requirements.txt # Python dependencies
└── docker-compose.yml # Docker services

Quick Start

1. Prerequisites

This project requires Docker to run the pipeline's services.

  • Please install Docker Desktop (or Docker Engine on Linux) before you begin.

2. Setup & Run (All Systems)

Follow these manual steps to get the pipeline running.

1. Start Services

Note: if you are in linux you can run:

chmod +x script.sh
./script.sh

This single command starts Kafka, Zookeeper, and Postgres in the background.

docker compose up -d

Note: Wait about 60 seconds for the services to fully initialize.

  1. Create Kafka Topic Run this command in your terminal to create the rides_raw topic with 3 partitions for better parallelism.
docker exec kafka kafka-topics \
 --create \
 --topic rides_raw \
 --partitions 3 \
 --replication-factor 1 \
 --bootstrap-server kafka:29092 \
 --if-not-exists
  1. Create Python Environment
python3 -m venv .venv
  1. Activate Environment Windows
# On Command Prompt
.venv\Scripts\activate.bat
# On PowerShell
.venv\Scripts\Activate.ps1

On macOS & Linux:

source .venv/bin/activate
  1. Install Dependencies
pip install -r requirements.txt
  1. Initialize Database This script connects to the Postgres container and creates the rides table.
python init_database.py

3. Run the Pipeline

Open 3 separate terminal windows (and make sure your virtual environment is activated in each one):

Terminal 1: Start Producer

python producer/producer.py

Terminal 2: Start Consumer

python consumer/consumer.py

Terminal 3: Start Dashboard

streamlit run dashboard/app.py

Access Points

Once everything is running, you can access:

Features

  • Real-time streaming: Simulates live ride data from CSV
  • Live map visualization: Shows ride pickup locations
  • Auto-refresh dashboard: Updates every 5 seconds
  • Metrics tracking: Total rides, passengers, and timestamps

Cleanup

To stop all services:

docker compose down

To stop and remove volumes (deletes all data):

docker compose down -v

To deactivate virtual environment:

deactivate

Dependencies

Core Python packages (see requirements.txt):

  • pandas==2.2.3
  • kafka-python==2.0.2
  • sqlalchemy==2.0.36
  • psycopg2-binary==2.9.9
  • streamlit==1.39.0
  • plotly==5.24.1
  • streamlit-autorefresh==1.0.1

Troubleshooting

Issue: "Module not found"

Solution: Make sure virtual environment is activated:

# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate.bat

Issue: Dashboard shows no data

Solution: Make sure producer and consumer are both running before starting the dashboard


Notes

  • The producer simulates real-time data by sending records with a 0.2s delay
  • The consumer stores all incoming messages in PostgreSQL
  • The dashboard auto-refreshes every 5 seconds to show new data
  • Data is cached for 5 seconds to improve performance

Quick Reference

Component Command URL
Setup ./script.sh -
Producer python producer/producer.py -
Consumer python consumer/consumer.py -
Dashboard streamlit run dashboard/app.py http://localhost:8501
Kafka UI - http://localhost:8080
Stop All docker-compose down -

Contributing

Feel free to submit issues or pull requests for improvements!

License

This project is for educational purposes.

About

Simple Kafka Demo

Resources

Stars

Watchers

Forks

AltStyle によって変換されたページ (->オリジナル) /