A robust Python application that syncs Gmail messages to a local SQLite database for analysis and archival purposes.
- Incremental Sync: Only downloads new messages by default
- Full Sync: Option to download all messages and detect deletions
- Parallel Processing: Multi-threaded message fetching for improved performance
- Robust Error Handling: Automatic retries with exponential backoff
- Graceful Shutdown: Handles interruption signals cleanly
- Type Safety: Comprehensive type hints throughout the codebase
- Python 3.8 or higher
- Google Cloud Project with Gmail API enabled
- OAuth 2.0 credentials file (
credentials.json)
-
Clone the repository:
git clone https://github.com/marcboeker/gmail-to-sqlite.git cd gmail-to-sqlite -
Install dependencies:
# Using uv uv sync -
Set up Gmail API credentials:
- Go to the Google Cloud Console
- Create a new project or select an existing one
- Enable the Gmail API
- Create OAuth 2.0 credentials (Desktop application)
- Download the credentials file and save it as
credentials.jsonin the project root
You can run the application using either python directly or via uv:
# Incremental sync (default) python main.py sync --data-dir ./data # or: uv run main.py sync --data-dir ./data # Full sync with deletion detection python main.py sync --data-dir ./data --full-sync # Sync a specific message python main.py sync-message --data-dir ./data --message-id MESSAGE_ID # Detect and mark deleted messages only python main.py sync-deleted-messages --data-dir ./data # Use custom number of worker threads python main.py sync --data-dir ./data --workers 8 # Get help python main.py --help python main.py sync --help
command: Required. One ofsync,sync-message, orsync-deleted-messages--data-dir: Required. Directory where the SQLite database will be stored--full-sync: Optional. Forces a complete sync of all messages--message-id: Required forsync-message. The ID of a specific message to sync--workers: Optional. Number of worker threads (default: number of CPU cores)--help: Show help information for commands and options
The application supports graceful shutdown when you press CTRL+C:
- Stops accepting new tasks
- Waits for currently running tasks to complete
- Saves progress of completed work
- Exits cleanly
Pressing CTRL+C a second time will force an immediate exit.
The application creates a SQLite database with the following schema:
| Field | Type | Description |
|---|---|---|
| message_id | TEXT | Unique Gmail message ID |
| thread_id | TEXT | Gmail thread ID |
| sender | JSON | Sender information (name, email) |
| recipients | JSON | Recipients by type (to, cc, bcc) |
| labels | JSON | Array of Gmail labels |
| subject | TEXT | Message subject |
| body | TEXT | Message body (plain text) |
| size | INTEGER | Message size in bytes |
| timestamp | DATETIME | Message timestamp |
| is_read | BOOLEAN | Read status |
| is_outgoing | BOOLEAN | Whether sent by user |
| is_deleted | BOOLEAN | Whether deleted from Gmail |
| last_indexed | DATETIME | Last sync timestamp |
SELECT sender->>'$.email', COUNT(*) AS count FROM messages GROUP BY sender->>'$.email' ORDER BY count DESC
This is great to determine who is spamming you the most with uninteresting emails.
SELECT sender->>'$.email', COUNT(*) AS count FROM messages WHERE is_read = 0 GROUP BY sender->>'$.email' ORDER BY count DESC
- For years:
strftime('%Y', timestamp) - For months in a year:
strftime('%m', timestamp) - For days in a month:
strftime('%d', timestamp) - For weekdays:
strftime('%w', timestamp) - For hours in a day:
strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count FROM messages GROUP BY period ORDER BY count DESC
This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use
SELECT sender->>'$.email', COUNT(*) AS count FROM messages WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%' GROUP BY sender->>'$.email' ORDER BY count DESC
SELECT sender->>'$.email', sum(size)/1024/1024 AS size FROM messages GROUP BY sender->>'$.email' ORDER BY size DESC
SELECT count(*) FROM messages WHERE EXISTS ( SELECT 1 FROM json_each(messages.recipients->'$.to') WHERE json_extract(value, '$.email') = 'foo@example.com' ) AND sender->>'$.email' = 'foo@example.com'
SELECT sender->>'$.email', sum(size)/1024/1024 as total_size FROM messages WHERE is_outgoing=false GROUP BY sender->>'$.email' ORDER BY total_size DESC
SELECT message_id, subject, timestamp FROM messages WHERE is_deleted=1 ORDER BY timestamp DESC