This project aims to collect and scrub extensive and valuable data for future machine learning (ML) projects, with a focus on business-related cases.
The project utilizes various data scrubbers to gather information from different sources, categorized as follows:
- Political News
- Business News
- Business Magazine Articles
- Marketing Data
- Weather Data
- Legal News and Regulations
These scrubbers are designed to perform a PESTLE analysis, ensuring comprehensive coverage of Political, Economic, Social, Technological, Legal, and Environmental factors.
The data is collected from a variety of sources, including news websites, business magazines, weather services, and legal databases.
The project is built using Django for the backend and PostgreSQL for the database. The architecture is designed to support local deployment while being future-proofed for potential deployment on platforms like Heroku
To deploy the project locally, follow these steps:
- Clone the repository.
- Install the necessary dependencies.
- Set up the PostgreSQL database.
- Run the Django server.
Detailed setup instructions will be provided below.
The primary dependencies for this project include:
- Django
- PostgreSQL
- Django REST framework (for building APIs)
- Additional libraries as needed for specific data sources and scrubbing tasks.
Users can interact with the project through a RESTful API, which allows them to initiate data scrubbing tasks, retrieve scrubbed data, and manage the data collection process.
Future enhancements and planned features include:
- Additional scrubbers for more data sources.
- Improved data cleaning and enrichment algorithms.
- Enhanced user interface for managing scrubbing tasks.
- Deployment on cloud platforms like Heroku.
- data cleansing
- data annotation
- pending - data annotation
- implemented centralised tasks update
- implemented cleansing of data - the function is being tested and works inside test.py
- implemented cleansing function as part of scrapy
- implemented technology scraper with inquirer scraper
- added messages on all scrapers
- implemented task ids
- decided to skip PNA scraper for now. 403 on testing. Might need headers.
- completed business news with inquirer scraper
- successfully connect celery-django and scrapy
- add page selection to scraping logic
- inquirerscrapy now working - added basic logs - need further refinement
- now able to Scrape author, publication date
- revised all scraping logic - use scrapy
- working inquirer.net national scraping
- Implemented Dockers (Django, Redis, Celery)
- Weather Data Scrubbing working in Celery
- Implemenent Logging System
- started working on the Political News Scrub - completed
- add check before saving political data to DB - unique entries only
- remove 3rd party cookies warning
- added check before saving weather data to DB - Unique entries only
- Weather Scrubbers now working
- Initialise Project
- Started woring on a simple UI
- Add scheduled run for Weather Scrubbers
- Create Auto Backup of Weather Scrubbers data
- need to improve query time (SQL Cache?)
- improve notifications on celery tasks (get best practice)
- improve logger - lessen verbose, add more logs for scrapy logic
- PNA Scanner for political news (adding diversity to news sources)
- need to clean data source - remove unreadable text
- json to db (backup reupload)
- dashboard
- jupyter in a webpage
- need to refine message pop up - task completed not showing on ui
- pause long tasks?
- Technology through inquirer.net only has 219 pages. - will scrub this as soon as its working and latest news will always be on page 1.
Weather API is now working Needs further refinement. Need to make sure duplicate records are not allowed in DB. Create a way to pause the app midway through run
- use this idea to check for pause state everytime the program finishes a cycle.
Duplicate records are check during weather fetching.
Making the pause the app midway through needs a bit more thought and I think i'll push this at a later date when refining the function