Python License Jupyter Hackathon
Research-grade analytics for India's digital identity ecosystem
IdentityLab is a comprehensive data analysis platform for UIDAI Aadhaar enrollment, demographic, and biometric data. Built to uncover actionable insights in identity access and adoption patterns, this platform analyzes ~5 million records to identify trends, detect anomalies, and support evidence-based policy decisions.
- π Automated Data Loading: Seamlessly load and combine multiple CSV files
- π§Ή Data Preprocessing: Clean, validate, and transform raw data with built-in quality checks
- π Statistical Analysis: Univariate, bivariate, and multivariate analysis tools
- π Advanced Visualizations: Interactive charts with Plotly, Seaborn, and Matplotlib
- πΊοΈ Geospatial Analysis: District and state-level mapping capabilities
- β±οΈ Time Series Analysis: Temporal pattern detection and forecasting
βββ Dataset/ # Raw CSV data files
β βββ api_data_aadhar_biometric/
β βββ api_data_aadhar_demographic/
β βββ api_data_aadhar_enrolment/
βββ notebooks/ # Jupyter notebooks for analysis
β βββ 01_data_exploration.ipynb
β βββ 02_data_cleaning.ipynb
β βββ 03_analysis.ipynb
β βββ 04_visualization.ipynb
βββ src/ # Source code modules
β βββ data_loader.py
β βββ preprocessing.py
β βββ analysis.py
β βββ visualization.py
βββ outputs/ # Generated visualizations and reports
βββ requirements.txt # Python dependencies
βββ README.md # This file
- Python 3.9 or higher
- pip package manager
# Clone the repository git clone https://github.com/Rakshit-2005/IdentityLab.git cd IdentityLab # Create virtual environment python -m venv .venv # Activate virtual environment # Windows: .\.venv\Scripts\activate # Linux/Mac: source .venv/bin/activate # Install dependencies pip install -r requirements.txt # Register Jupyter kernel (optional) python -m ipykernel install --user --name=identity-lab --display-name "IdentityLab"
# Load data from src.data_loader import AadhaarDataLoader loader = AadhaarDataLoader('.') df_enrolment = loader.load_enrolment_data() df_demographic = loader.load_demographic_data() df_biometric = loader.load_biometric_data() # Preprocess from src.preprocessing import AadhaarDataPreprocessor preprocessor = AadhaarDataPreprocessor() df_clean = preprocessor.clean_enrolment_data(df_enrolment) # Analyze from src.analysis import AadhaarAnalyzer analyzer = AadhaarAnalyzer() stats = analyzer.univariate_analysis(df_clean, 'total_enrolments') staModules ### `data_loader.py` - Load and combine multiple CSV files - Automatic data type detection - Memory-efficient loading with optional Dask support ### `preprocessing.py` - Data cleaning and validation - Date normalization - Outlier detection (IQR and Z-score methods) - Feature engineering (temporal features, totals, ratios) ### `analysis.py` - Univariate statistics (mean, median, skewness, kurtosis) - Bivariate analysis (correlation, statistical tests) - Temporal aggregation (daily, weekly, monthly) - Geographical aggregation (state, district, pincode) - Seasonality detection - Growth rate calculation ### `visualization.py` - Interactive time series plots - Distribution analysis (histograms, box plots) - Top N bar charts - Correlation heatmaps - Geospatial choropleth maps - Seasonal pattern visualization viz = AadhaarVisualizer(output_dir="outputs") fig = viz.plot_time_series(df_clean, 'date', 'total_enrolments', 'Enrolments Over Time')
# Start Jupyter jupyter notebook # Open notebooks in sequence: # 1. notebooks/01_data_exploration.ipynb # 2. notebooks/02_data_cleaning.ipynb # 3. notebooks/03_analysis.ipynb # 4. notebooks/04_visualization.ipynb
- New Aadhaar registrations by date, location, and age group
- FContributing Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- UIDAI for providing the hackathon opportunity and datasets
- Open-source community for the amazing data science tools
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This project was developed for the UIDAI Hackathon 2026. The datasets used are for educational and analytical purposes.strict, pincode, bio_age_5_17, bio_age_17_
- Total Records Analyzed: ~4.35 million (after cleaning)
- Enrolment: 983,000 records
- Demographic Updates: 1,596,358 records
- Biometric Updates: 1,766,148 records
- Data Quality: Removed 591,180 duplicate records (11.9% of raw data)
- Date Range: March 2025 - December 2025
- Geographic Coverage: 50+ states/UTs, 950+ districts
- Total Enrolments: 5,331,661 new Aadhaar registrations
- Total Demographic Updates: 36,595,767 demographic changes
- Total Biometric Updates: 68,247,029 biometric captures
- Update-to-Enrolment Ratio: ~20:1 (updates significantly exceed new enrolments)
- Strong Seasonality: Coefficient of variation 109.75%
- Peak Activity Month: July 2025 (month 7)
- Lowest Activity Month: October 2025 (month 10)
- Weekday Pattern: Higher activity mid-week, reduced on weekends
- Age Distribution (Enrolments):
- Age 0-5: 65.16% (3,474,307 enrolments) - Majority children
- Age 5-17: 31.71% (1,690,892 enrolments)
- Age 18+: 3.12% (166,462 enrolments)
- Insight: Aadhaar enrollment heavily focused on children, particularly infants (0-5 years)
- Gini Coefficient: 0.74-0.77 across all metrics
- Interpretation: High geographical inequality - activity concentrated in specific states
- Top State: Accounts for disproportionately high percentage of total activity
- District Variation: Top 20 districts account for significant portion of national activity
- Missing Values: 0% (all fields complete after cleaning)
- Invalid Dates: <0.01% removed during preprocessing
- Pincode Validation: 100% conform to 6-digit format
- Outliers Detected:
- Enrolment: 3.52% of records
- Demographic Updates: 5.70% of records
- Biometric Updates: 7.52% of records
- Demographic Update Ratio: Varies by state (0-2000%+)
- Biometric Update Ratio: More uniform distribution than demographic
- Pattern: Some states show extremely high update ratios, indicating data correction campaigns or policy initiatives
- 22 HTML files generated in
outputs/directory - Categories: Temporal trends, distributions, geographical maps, seasonality, age groups, update ratios, correlations
- Comprehensive Dashboard: Multi-panel overview combining key metrics
- Objective: Initial profiling of ~5M raw records
- Techniques:
- Schema validation and data type inference
- Missing value analysis (0% missing found)
- Duplicate detection (591K duplicates identified)
- Date range validation (March-December 2025)
- Outputs: Data quality report, summary statistics, unique value counts
- Date Normalization: Converted dd-mm-yyyy format to datetime64
- String Cleaning: Title-cased state/district names for consistency
- Pincode Validation: Zero-padded to 6-digit format
- Numeric Conversion:
pd.to_numeric()with error coercion, filled NaN with 0 - Total Calculation: Summed age group columns to create totals
- Zero-Row Removal: Eliminated records with zero counts across all metrics
- Feature Engineering:
- Temporal features: year, month, day_of_week, week_of_year, quarter
- Derived metrics: total_enrolments, total_demo_updates, total_bio_updates
- Duplicate Removal: Dropped exact duplicates across all columns
- Outlier Detection: IQR method (threshold=3.0) for flagging anomalies
- Univariate Analysis: Mean, median, std, skewness, kurtosis for all metrics
- Temporal Aggregation:
- Monthly trends using
pd.Grouper(freq='M') - Weekly/daily patterns for granular analysis
- Monthly trends using
- Seasonality Detection:
- Monthly average calculation
- Coefficient of variation to quantify seasonal strength
- Peak/low month identification
- Geographical Aggregation:
- State-level: Sum, mean, count by state
- District-level: Top N analysis
- Concentration Analysis:
- Gini coefficient calculation (0.74-0.77)
- Inequality measurement across geographical units
- Update Ratio Calculation:
- (Total Updates / Total Enrolments) Γγ°γ€ 100 by state
- Identified states with high correction activity
- Age Group Breakdown: Distribution analysis across 3 age cohorts
- Time Series Plots: Plotly interactive line charts with
hovermode='x unified' - Distribution Analysis: Histogram + box plot combinations
- Top N Bar Charts: Horizontal bars for state/district rankings
- Seasonality Patterns: Monthly average line plots
- Age Group Charts: Pie charts and grouped bar charts
- Correlation Heatmaps: Plotly heatmap with annotated values
- Multi-Panel Dashboard: 6-panel comprehensive overview
- Output Format: All visualizations saved as standalone HTML files
- Color Scheme: Consistent
plotly_whitetemplate throughout
Raw CSV Files (5M records)
β
Data Loading (AadhaarDataLoader)
β
Data Cleaning (AadhaarDataPreprocessor)
β [Remove 591K duplicates, normalize dates, add features]
Clean Data (4.35M records)
β
Statistical Analysis (AadhaarAnalyzer)
β [Univariate, temporal, geographical, seasonality]
Analytical Insights
β
Visualization (AadhaarVisualizer)
β [22 interactive HTML charts]
Actionable Reports
- IQR Outlier Detection:
Q1 - Γγ°γ€IQRtoQ3 + Γγ°γ€IQRbounds - Gini Coefficient: Measure of statistical dispersion (0=equality, 1=inequality)
- Seasonality CV: Standard deviation / mean Γγ°γ€ 100
- Temporal Features: ISO week extraction, day-of-week encoding
- Aggregation: Multi-level groupby operations (state β district β pincode)
- Class-Based Design: Separate classes for loading, preprocessing, analysis, visualization
- Method Chaining: Sequential processing through pipeline stages
- Logging: INFO-level logging for all major operations
- Error Handling:
errors='coerce'for graceful handling of invalid data - Memory Optimization: Optional Dask support for large datasets
- β All pincodes 6 digits (100% compliance)
- β No missing values in cleaned data
- β Date range consistent across datasets
- β Geographic coverage complete (50+ states)
- β Temporal features correctly derived
- Data Processing: Pandas, NumPy, Dask
- Statistical Analysis: SciPy, Statsmodels, Scikit-learn
- Time Series: Prophet
- Visualization: Plotly, Seaborn, Matplotlib, Folium
- Geospatial: GeoPandas
Harsh Vardhan Github : https://github.com/The-Harsh-Vardhan Rakshit Modanwal Github : https://github.com/Rakshit-2005
This project is created for the UIDAI Hackathon 2026.
[Add contact information]