A real-time data streaming pipeline that simulates Uber ride data using Kafka, PostgreSQL, and Streamlit for live visualization.
┌─────────────┐ ┌───────┐ ┌──────────┐ ┌──────────────┐ ┌───────────┐
│ uber_sample │ ───> │ Kafka │ ───> │ Kafka │ ───> │ PostgreSQL │ ───> │ Streamlit │
│ .csv │ │ Topic │ │ Consumer │ │ Database │ │ Dashboard │
└─────────────┘ └───────┘ └──────────┘ └──────────────┘ └───────────┘
Producer rides_raw Stores data rides table Visualizes
- Kafka: Message streaming platform
- PostgreSQL: Data storage
- Streamlit: Real-time dashboard
- Python: Data processing (pandas, kafka-python, sqlalchemy)
- Docker: Containerization for Kafka, Zookeeper, and PostgreSQL
uber_case_study/
├── data/
│ ├── uber_sample.csv # Source dataset
│ └── clean.py # Data cleaning script
├── producer/
│ └── producer.py # Kafka producer
├── consumer/
│ └── consumer.py # Kafka consumer
├── dashboard/
│ └── app.py # Streamlit dashboard
├── init_database.py # Database initialization
├── script.sh # Environment setup script
├── requirements.txt # Python dependencies
└── docker-compose.yml # Docker services
This project requires Docker to run the pipeline's services.
- Please install Docker Desktop (or Docker Engine on Linux) before you begin.
Follow these manual steps to get the pipeline running.
1. Start Services
Note: if you are in linux you can run:
chmod +x script.sh ./script.sh
This single command starts Kafka, Zookeeper, and Postgres in the background.
docker compose up -d
Note: Wait about 60 seconds for the services to fully initialize.
- Create Kafka Topic Run this command in your terminal to create the rides_raw topic with 3 partitions for better parallelism.
docker exec kafka kafka-topics \
--create \
--topic rides_raw \
--partitions 3 \
--replication-factor 1 \
--bootstrap-server kafka:29092 \
--if-not-exists- Create Python Environment
python3 -m venv .venv
- Activate Environment Windows
# On Command Prompt .venv\Scripts\activate.bat # On PowerShell .venv\Scripts\Activate.ps1
On macOS & Linux:
source .venv/bin/activate- Install Dependencies
pip install -r requirements.txt
- Initialize Database This script connects to the Postgres container and creates the rides table.
python init_database.py
Open 3 separate terminal windows (and make sure your virtual environment is activated in each one):
python producer/producer.py
python consumer/consumer.py
streamlit run dashboard/app.py
Once everything is running, you can access:
- Streamlit Dashboard: http://localhost:8501
- Kafka UI: http://localhost:8080
- PostgreSQL: localhost:5432
- Real-time streaming: Simulates live ride data from CSV
- Live map visualization: Shows ride pickup locations
- Auto-refresh dashboard: Updates every 5 seconds
- Metrics tracking: Total rides, passengers, and timestamps
To stop all services:
docker compose down
To stop and remove volumes (deletes all data):
docker compose down -v
To deactivate virtual environment:
deactivate
Core Python packages (see requirements.txt):
- pandas==2.2.3
- kafka-python==2.0.2
- sqlalchemy==2.0.36
- psycopg2-binary==2.9.9
- streamlit==1.39.0
- plotly==5.24.1
- streamlit-autorefresh==1.0.1
Solution: Make sure virtual environment is activated:
# On macOS/Linux: source .venv/bin/activate # On Windows: .venv\Scripts\activate.bat
Solution: Make sure producer and consumer are both running before starting the dashboard
- The producer simulates real-time data by sending records with a 0.2s delay
- The consumer stores all incoming messages in PostgreSQL
- The dashboard auto-refreshes every 5 seconds to show new data
- Data is cached for 5 seconds to improve performance
| Component | Command | URL |
|---|---|---|
| Setup | ./script.sh |
- |
| Producer | python producer/producer.py |
- |
| Consumer | python consumer/consumer.py |
- |
| Dashboard | streamlit run dashboard/app.py |
http://localhost:8501 |
| Kafka UI | - | http://localhost:8080 |
| Stop All | docker-compose down |
- |
Feel free to submit issues or pull requests for improvements!
This project is for educational purposes.