Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Script for extracting information from emails to be then processed using OCR, fitted to custom JSON and presented in Google Sheets, version on master is to be deployed on railway

License

Notifications You must be signed in to change notification settings

Floressek/Gmail-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

19 Commits

Repository files navigation

Gmail Extractor

Overview

Gmail Extractor is an automated system for processing email attachments from a Gmail account. It downloads attachments, processes them based on their file type, and saves the processed data in a structured format. The system is designed to handle various file types including PDFs, Word documents, Excel spreadsheets, CSVs, and images.

Table of Contents

  1. Project Structure
  2. Prerequisites
  3. Installation
  4. Configuration
  5. Usage + Process Flow
  6. File Processing
  7. Troubleshooting
  8. Deployment
  9. Contributing
  10. License

Project Structure

gmail-extractor/
│
├── config/
│ └── constants.js
│
├── logs/
│
├── src/
│ ├── attachments/
│ │ ├── fileHandler/
│ │ │ ├── imageHandler.js
│ │ │ ├── pdfHandler.js
│ │ │ ├── spreadsheetHandler.js
│ │ │ └── wordHandler.js
│ │ └── attachmentProcessor.js
│ │
│ ├── auth/
│ │ └── authHandler.js
│ │
│ ├── email/
│ │ ├── emailProcessor.js
│ │ ├── imapListener.js
│ │ └── resetEmailsAndAttachments.js
│ │
│ ├── google-sheets/
│ │ └── google-sheets-api.js
│ │
│ ├── utils/
│ │ ├── combineEmailData.js
│ │ ├── convertPdfToImage.js
│ │ ├── createDataDirectories.js
│ │ ├── deleteFile.js
│ │ ├── fileUtils.js
│ │ └── logger.js
│ │
│ └── zod-json/
│ ├── emailDataProcessor.js
│ └── emailDataSchema.js
│
├── .env
├── .gitignore
├── credentials.json
├── Dockerfile
├── index.js
├── package.json
├── README.md
└── token.json

Prerequisites

  • Node.js (v14 or later)
  • npm or Yarn
  • A Gmail account
  • Google Cloud Console project with Gmail API enabled

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/gmail-extractor.git
    cd gmail-extractor
    
  2. Install dependencies:

    npm install
    

    or if you're using Yarn:

    yarn install
    
  3. Copy the .env.example file to .env:

    cp .env.example .env
    

Configuration

Environment Variables

Edit the .env file and fill in your specific details:

  • EMAIL_ADDRESS: Your Gmail address
  • PROCESSED_DIR: Directory for processed attachments (e.g., processed_attachments)
  • Add any other necessary environment variables

Google Cloud Console Setup

  1. Go to the Google Cloud Console.
  2. Create a new project or select an existing one.
  3. Enable the Gmail API for your project.
  4. Go to "Credentials" and create an OAuth 2.0 Client ID.
  5. Set up the OAuth consent screen if prompted.
  6. For "Application type", choose "Web application".
  7. Add http://localhost:3000/auth/google/callback to the "Authorized redirect URIs".

Credentials File

Create a credentials.json file in the root directory with the following structure:

{
 "web": {
 "client_id": "YOUR_CLIENT_ID.apps.googleusercontent.com",
 "project_id": "your-project-name",
 "auth_uri": "https://accounts.google.com/o/oauth2/auth",
 "token_uri": "https://oauth2.googleapis.com/token",
 "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
 "client_secret": "YOUR_CLIENT_SECRET",
 "redirect_uris": ["http://localhost:3000/auth/google/callback"]
 }
}

Gmail Account Settings

  1. Enable IMAP in your Gmail settings.
  2. If not using OAuth, create an App Password:
    • Go to your Google Account settings.
    • Select "Security".
    • Under "Signing in to Google," select "App Passwords".
    • Generate a new App Password for "Mail" and "Other (Custom name)".
    • Use this password in your .env file instead of your regular Gmail password.

Usage

To start the Gmail extractor:

npm start

On first run, you'll be prompted to authorize the application. Follow the URL provided in the console to complete the OAuth2 flow.

Process Flow

Below is a sequence diagram illustrating the main process flow of the Gmail Extractor:

sequenceDiagram
 participant User
 participant ImapListener
 participant EmailProcessor
 participant AttachmentProcessor
 participant FileHandlers
 participant AuthHandler
 participant ZodProcessor
 participant OpenAIProcessor
 participant Gmail
 participant GoogleSheets
 User->>ImapListener: Start application
 ImapListener->>AuthHandler: Request authentication
 AuthHandler->>Gmail: Authenticate (OAuth2)
 Gmail-->>AuthHandler: Return access token
 AuthHandler-->>ImapListener: Authentication successful
 loop Listen for new emails
 ImapListener->>Gmail: Check for new emails
 Gmail-->>ImapListener: New email notification
 ImapListener->>EmailProcessor: Process new email
 EmailProcessor->>Gmail: Fetch email content
 Gmail-->>EmailProcessor: Return email content
 EmailProcessor->>AttachmentProcessor: Process attachments
 AttachmentProcessor->>FileHandlers: Handle specific file types
 FileHandlers-->>AttachmentProcessor: Return processed data
 AttachmentProcessor-->>EmailProcessor: Return processed attachments
 EmailProcessor->>EmailProcessor: Combine email data (all_{emailId}.json)
 EmailProcessor->>ZodProcessor: Validate combined data
 ZodProcessor-->>EmailProcessor: Return validated data
 EmailProcessor->>OpenAIProcessor: Process data with OpenAI
 OpenAIProcessor-->>EmailProcessor: Return structured data
 EmailProcessor->>EmailProcessor: Save processed_offer_{emailId}.json
 EmailProcessor->>GoogleSheets: Update spreadsheet with processed data
 GoogleSheets-->>EmailProcessor: Confirmation
 end
 ImapListener->>User: Notification of processed emails
Loading

File Processing

The system processes the following file types:

  • PDF: Handled by pdfHandler.js
  • Word (.doc, .docx): Handled by wordHandler.js
  • Excel (.xls, .xlsx), CSV: Handled by spreadsheetHandler.js
  • Images (.png, .jpg, .jpeg): Handled by imageHandler.js

Processed files and their extracted data are managed by attachmentProcessor.js.

Troubleshooting

  • If you encounter authentication issues, ensure your credentials.json file is correctly set up and your Gmail account settings are properly configured.
  • Check the logs in the logs/ directory for detailed error messages.
  • For IMAP connection issues, verify that IMAP is enabled in your Gmail settings and that your network allows the connection.

Deployment

For deploying to a production environment:

  1. Ensure all sensitive data (like credentials.json and .env) are properly secured and not exposed in your repository.
  2. Consider using environment variables for all sensitive information.
  3. If deploying to a cloud service, follow their specific guidelines for Node.js applications.
  4. Use a process manager like PM2 to keep the application running continuously.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

[MIT]

About

Script for extracting information from emails to be then processed using OCR, fitted to custom JSON and presented in Google Sheets, version on master is to be deployed on railway

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

AltStyle によって変換されたページ (->オリジナル) /