Name	Name	Last commit message	Last commit date
Latest commit History 19 Commits
consumer	consumer
dashboard	dashboard
data	data
producer	producer
.gitignore	.gitignore
README.md	README.md
docker-compose.yml	docker-compose.yml
init_database.py	init_database.py
requirements.txt	requirements.txt
script.sh	script.sh

Uber Real-Time Data Streaming Pipeline

A real-time data streaming pipeline that simulates Uber ride data using Kafka, PostgreSQL, and Streamlit for live visualization.

Architecture

┌─────────────┐ ┌───────┐ ┌──────────┐ ┌──────────────┐ ┌───────────┐
│ uber_sample │ ───> │ Kafka │ ───> │ Kafka │ ───> │ PostgreSQL │ ───> │ Streamlit │
│ .csv │ │ Topic │ │ Consumer │ │ Database │ │ Dashboard │
└─────────────┘ └───────┘ └──────────┘ └──────────────┘ └───────────┘
 Producer rides_raw Stores data rides table Visualizes

Tech Stack

Kafka: Message streaming platform
PostgreSQL: Data storage
Streamlit: Real-time dashboard
Python: Data processing (pandas, kafka-python, sqlalchemy)
Docker: Containerization for Kafka, Zookeeper, and PostgreSQL

Project Structure

uber_case_study/
├── data/
│ ├── uber_sample.csv # Source dataset
│ └── clean.py # Data cleaning script
├── producer/
│ └── producer.py # Kafka producer
├── consumer/
│ └── consumer.py # Kafka consumer
├── dashboard/
│ └── app.py # Streamlit dashboard
├── init_database.py # Database initialization
├── script.sh # Environment setup script
├── requirements.txt # Python dependencies
└── docker-compose.yml # Docker services

Quick Start

1. Prerequisites

This project requires Docker to run the pipeline's services.

Please install Docker Desktop (or Docker Engine on Linux) before you begin.

2. Setup & Run (All Systems)

Follow these manual steps to get the pipeline running.

1. Start Services

Note: if you are in linux you can run:

chmod +x script.sh
./script.sh

This single command starts Kafka, Zookeeper, and Postgres in the background.

docker compose up -d

Note: Wait about 60 seconds for the services to fully initialize.

Create Kafka Topic Run this command in your terminal to create the rides_raw topic with 3 partitions for better parallelism.

docker exec kafka kafka-topics \
 --create \
 --topic rides_raw \
 --partitions 3 \
 --replication-factor 1 \
 --bootstrap-server kafka:29092 \
 --if-not-exists

Create Python Environment

python3 -m venv .venv

Activate Environment Windows

# On Command Prompt
.venv\Scripts\activate.bat
# On PowerShell
.venv\Scripts\Activate.ps1

On macOS & Linux:

source .venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Initialize Database This script connects to the Postgres container and creates the rides table.

python init_database.py

3. Run the Pipeline

Open 3 separate terminal windows (and make sure your virtual environment is activated in each one):

Terminal 1: Start Producer

python producer/producer.py

Terminal 2: Start Consumer

python consumer/consumer.py

Terminal 3: Start Dashboard

streamlit run dashboard/app.py

Access Points

Once everything is running, you can access:

Streamlit Dashboard: http://localhost:8501
Kafka UI: http://localhost:8080
PostgreSQL: localhost:5432

Features

Real-time streaming: Simulates live ride data from CSV
Live map visualization: Shows ride pickup locations
Auto-refresh dashboard: Updates every 5 seconds
Metrics tracking: Total rides, passengers, and timestamps

Cleanup

To stop all services:

docker compose down

To stop and remove volumes (deletes all data):

docker compose down -v

To deactivate virtual environment:

deactivate

Dependencies

Core Python packages (see requirements.txt):

pandas==2.2.3
kafka-python==2.0.2
sqlalchemy==2.0.36
psycopg2-binary==2.9.9
streamlit==1.39.0
plotly==5.24.1
streamlit-autorefresh==1.0.1

Troubleshooting

Issue: "Module not found"

Solution: Make sure virtual environment is activated:

# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate.bat

Issue: Dashboard shows no data

Solution: Make sure producer and consumer are both running before starting the dashboard

Notes

The producer simulates real-time data by sending records with a 0.2s delay
The consumer stores all incoming messages in PostgreSQL
The dashboard auto-refreshes every 5 seconds to show new data
Data is cached for 5 seconds to improve performance

Quick Reference

Component	Command	URL
Setup	`./script.sh`	-
Producer	`python producer/producer.py`	-
Consumer	`python consumer/consumer.py`	-
Dashboard	`streamlit run dashboard/app.py`	http://localhost:8501
Kafka UI	-	http://localhost:8080
Stop All	`docker-compose down`	-

Contributing

Feel free to submit issues or pull requests for improvements!

License

This project is for educational purposes.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WetCatto/uber_case_study

Folders and files

Latest commit

History

Repository files navigation