Commit 9ac2397

authored

Merge pull request #823 from devkumar24/dev

MOVIE REVIEWS SCRAPING

2 parents 9268c10 + c9ae615 commit 9ac2397Copy full SHA for 9ac2397

File tree

8 files changed

+262

-0

lines changed

WebScrapingScripts/Movie Review Scraping
- .ipynb_checkpoints
  - IMBD DATA SCRAPING-checkpoint.ipynb
- Images
  - img.png
  - terminal.png
- README.md
- img.png
- requirements.txt
- scrap_data.py
- terminal.png

8 files changed

+262

-0

lines changed

`‎WebScrapingScripts/Movie Review Scraping/.ipynb_checkpoints/IMBD DATA SCRAPING-checkpoint.ipynb`

Lines changed: 6 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,6 @@`
	`1`	`+{`
	`2`	`+ "cells": [],`
	`3`	`+ "metadata": {},`
	`4`	`+ "nbformat": 4,`
	`5`	`+ "nbformat_minor": 4`
	`6`	`+}`

`‎WebScrapingScripts/Movie Review Scraping/Images/img.png`

422 KB

Loading[フレーム]

`‎WebScrapingScripts/Movie Review Scraping/Images/terminal.png`

425 KB

Loading[フレーム]

`‎WebScrapingScripts/Movie Review Scraping/README.md`

Lines changed: 57 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,57 @@`
	`1`	`+# MOVIE REVIEW SCRAPING`
	`2`	`+`
	`3`	`+## AIM`
	`4`	`+> To Extract the Reviews of Movies.`
	`5`	`+`
	`6`	`+## DESCRIPTION`
	`7`	+Here is the Python Script which is used to extract the Reviews of Movies from IMDb. We have use `requests` and `bs4` packages to extract the data.
	`8`	`+`
	`9`	`+## PURPOSE`
	`10`	`+In this project you’ll learn about HTTP requests and how to send them using the requests package and will also learn how to extract required data from HTML pages using some simple functions of beautifulsoup module. As we know Sentimental Analysis is very popular task in Machine Learning, so I have wrote a Python script to get the data for you and perform several task on this type of NLP.`
	`11`	`+`
	`12`	`+## PACKAGES USED`
	`13`	`+> The purpose of these packages in project`
	`14`	+- `requests` - It has been to send and recieve the request in order to fetch the data from IMDB.
	`15`	+- `bs4` - It has been used to extract the HTML elements from website.
	`16`	+- `json` - json is used as helper in order to save the list of movies and its links.
	`17`	+- `pandas` - It is used to create and store dataframes into .csv format.
	`18`	`+`
	`19`	`+`
	`20`	`+## Workflow`
	`21`	`+- Import above packages mentioned above.`
	`22`	`+- Extracting movies and links`
	`23`	`+- After that we have extracted the reviews along with their rating.`
	`24`	`+- Saving the data in .csv format`
	`25`	`+`
	`26`	`+## SETUP PACKAGES`
	`27`	+- `pip install requests `
	`28`	+- `pip install pandas`
	`29`	+- `pip install bs4`
	`30`	+- `pip install json`
	`31`	`+`
	`32`	`+## COMPILATION STEPS`
	`33`	`+> Go to terminal`
	`34`	`+`
	`35`	+> Run command : `python3 scrapy_data.py`
	`36`	`+`
	`37`	`+> Rest the script will do the work.`
	`38`	`+## SOURCE`
	`39`	`+ ### IMDB`
	`40`	`+ ![Image](Images/img.png)`
	`41`	`+`
	`42`	`+`
	`43`	`+`
	`44`	`+## OUTPUT`
	`45`	`+ ### VS CODE TERMINAL`
	`46`	`+ ![OUTPUT](Images/terminal.png)`
	`47`	`+`
	`48`	`+`
	`49`	`+## AUTHOR`
	`50`	`+`
	`51`	`+---`
	`52`	`+### NAME : DEV KUMAR`
	`53`	`+---`
	`54`	`+### EMAIL : dev247kumar@gmail.com`
	`55`	`+---`
	`56`	`+### GitHub : devkumar24`
	`57`	`+---`

`‎WebScrapingScripts/Movie Review Scraping/img.png`

422 KB

Loading[フレーム]

`‎WebScrapingScripts/Movie Review Scraping/requirements.txt`

Lines changed: 4 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,4 @@`
	`1`	`+bs4==0.0.1`
	`2`	`+requests==2.25.1`
	`3`	`+json`
	`4`	`+pip==21.1.2`

`‎WebScrapingScripts/Movie Review Scraping/scrap_data.py`

Lines changed: 195 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,195 @@`
	`1`	`+# Import important packages that are used to fetch the details of the movies.`
	`2`	`+`
	`3`	`+from bs4 import BeautifulSoup`
	`4`	`+import requests`
	`5`	`+import pandas`
	`6`	`+import json`
	`7`	`+`
	`8`	`+`
	`9`	`+# Enter the url from where you want to fetch the Movie data`
	`10`	`+# In this Program we have fetch data from IMDB.`
	`11`	`+# Here, we will collect data from top-rated movies on IMDB and most popular movies upto date 21 July 2021.`
	`12`	`+# There are approx 250 top- rated movies on IMDB, and 100 most popular movies, here are a total of 350 movies`
	`13`	`+`
	`14`	`+`
	`15`	`+# urls`
	`16`	`+top_rated_movies = "https://www.imdb.com/chart/top"`
	`17`	`+most_popular_movies = "https://www.imdb.com/chart/moviemeter/"`
	`18`	`+`
	`19`	`+`
	`20`	`+# ---------------------------------------------------------------------------------------------------------------------------`
	`21`	`+def get_movies_list(url):`
	`22`	`+ """`
	`23`	`+ This function will help us to get the list of movies that are present in the given url`
	`24`	`+ This function takes an input url, and get the list of all movies present in the url.`
	`25`	`+ It will return the movies with its corresponding rating and links, so that we can`
	`26`	`+ get our review.`
	`27`	`+`
	`28`	`+ Return Type : Dictionary`
	`29`	`+ because we have make seperate link and rating for each movie, so that we don't get confuse while watching the data.`
	`30`	`+ If we use list instead of dict, we won't understand what is there in the data.`
	`31`	`+ """`
	`32`	`+`
	`33`	`+ # sending request to access the particular url`
	`34`	`+ response = requests.get(url)`
	`35`	`+ soup = BeautifulSoup(response.content, 'lxml')`
	`36`	`+ content = soup.find_all('tbody', class_ = "lister-list")`
	`37`	`+`
	`38`	`+ # We have got our movie names using list comprehension`
	`39`	`+ movies_names = [content[0].find_all('tr')[i].find('td', class_ = "titleColumn").a.text for i in range(len(content[0].find_all('tr')))]`
	`40`	`+`
	`41`	`+ # here we have not use list comprehension because there are some movies which don't have their ratings`
	`42`	`+ rating = []`
	`43`	`+ for i in range(len(content[0].find_all('tr'))):`
	`44`	`+`
	`45`	`+ try:`
	`46`	`+ rating.append(content[0].find_all('tr')[i].find('td', class_ = "ratingColumn imdbRating").strong.text)`
	`47`	`+ except:`
	`48`	`+ # Here, we mark that rating will be empty if no rating is present, later while performing any task,`
	`49`	`+ # we will fill this value by proper techniques`
	`50`	`+ rating.append(" ")`
	`51`	`+`
	`52`	`+ # Links for each movie`
	`53`	`+ links = [content[0].find_all('tr')[i].find('td', class_ = "titleColumn").a['href'] for i in range(len(content[0].find_all('tr')))]`
	`54`	`+`
	`55`	`+ # here we have created movies dictonary in which all the data of each movie is present.`
	`56`	`+ movies = {}`
	`57`	`+ for i in range(len(content[0].find_all('tr'))):`
	`58`	`+ if movies.get(movies_names[i]) is None:`
	`59`	`+ movies[movies_names[i]] = {}`
	`60`	`+ link = "https://www.imdb.com" + links[i]`
	`61`	`+ movies[movies_names[i]] = (rating[i], link)`
	`62`	`+ else:`
	`63`	`+ link = "https://www.imdb.com" + links[i]`
	`64`	`+ movies[movies_names[i]] = (rating[i], link)`
	`65`	`+`
	`66`	`+`
	`67`	`+ return movies # Return type: DICT`
	`68`	`+`
	`69`	`+`
	`70`	`+`
	`71`	`+# ---------------------------------------------------------------------------------------------------------------------------`
	`72`	`+def fetch_data(movies):`
	`73`	`+ """`
	`74`	`+ This function will give us the reviews about the movies that we have got in our get_movies_list().`
	`75`	`+ It will take input a movies dictionary in which movies and its links are present`
	`76`	`+`
	`77`	`+ It will return a list of reviews, in which reviews are in the form of tuple.`
	`78`	`+ e.g-> review = [('6',`
	`79`	`+ 'Average Marvel Movie',`
	`80`	`+ 'As the perspective is everything in reviewing movies)]`
	`81`	`+`
	`82`	`+ rating = review[0][0]`
	`83`	`+ title = review[0][1]`
	`84`	`+ review_content = review[0][2]`
	`85`	`+ """`
	`86`	`+ reviews = list()`
	`87`	`+ for key, val in movies.items():`
	`88`	`+`
	`89`	`+ # sending request to access the particular url`
	`90`	`+ movie_url = val[1]`
	`91`	`+ print("Getting Data of Movie : {}".format(key))`
	`92`	`+ response = requests.get(movie_url)`
	`93`	`+ soup = BeautifulSoup(response.content, 'lxml')`
	`94`	`+ content = soup.find_all('section', class_ = "ipc-page-section ipc-page-section--base")`
	`95`	`+`
	`96`	`+ review_url = soup.find_all('a', class_ = "ipc-title ipc-title--section-title ipc-title--base ipc-title--on-textPrimary ipc-title-link-wrapper")`
	`97`	`+ review_url = "https://www.imdb.com" + review_url[2]['href']`
	`98`	`+`
	`99`	`+ review_url_response = requests.get(review_url)`
	`100`	`+ review_url_soup = BeautifulSoup(review_url_response.content, 'lxml')`
	`101`	`+`
	`102`	`+ # here we have got several reviews from a single movie.`
	`103`	`+ total_reviews = review_url_soup.find_all('div', class_ = "review-container")`
	`104`	`+ # here, it made us necessary to iterate a loop, because it contains several reviews, and every review is important to us.`
	`105`	`+ for review in total_reviews:`
	`106`	`+ # using exception handling in case, if there is no title or review or rating is not present.`
	`107`	`+ try:`
	`108`	`+ rating = review.find("div", class_ = "ipl-ratings-bar")`
	`109`	`+ rating = rating.find('span').text.strip().split("/")[0]`
	`110`	`+ except:`
	`111`	`+ rating = " "`
	`112`	`+ try:`
	`113`	`+ title = review.find('a', class_ = "title").text.strip()`
	`114`	`+ except:`
	`115`	`+ title = "NaN"`
	`116`	`+ try:`
	`117`	`+ review_content = review.find('div', class_ = "text show-more__control").text.strip()`
	`118`	`+ except:`
	`119`	`+ review_content = None`
	`120`	`+`
	`121`	`+`
	`122`	`+ # Appending data to the list`
	`123`	`+ reviews.append((rating, title, review_content))`
	`124`	`+`
	`125`	`+ print("Total Reviews Fetch from the data are : {}".format(len(reviews)))`
	`126`	`+`
	`127`	`+ return reviews # return type: list of tuples`
	`128`	`+`
	`129`	`+`
	`130`	`+`
	`131`	`+# ---------------------------------------------------------------------------------------------------------------------------`
	`132`	`+def to_csv(reviews,flocation : str = "", return_data = True):`
	`133`	`+ """`
	`134`	`+ It will make the dataframe of the reviews and present us, it will easily able to understand and read the data,`
	`135`	`+ and main aim of this function is to save the data in csv format,`
	`136`	`+`
	`137`	`+ : If we don't enter the file location, it will automatically store the data into existing file with the name`
	`138`	`+ as "data.csv"`
	`139`	`+`
	`140`	`+ : If we don't want to return the data, we won't by entering return_data = False`
	`141`	`+ """`
	`142`	`+ dataFrame = pd.DataFrame(data = reviews, columns = ['Rating', 'Title', 'Review'])`
	`143`	`+`
	`144`	`+ if flocation:`
	`145`	`+ dataFrame.to_csv(flocation)`
	`146`	`+ else:`
	`147`	`+ dataFrame.to_csv("data.csv")`
	`148`	`+`
	`149`	`+ if return_data:`
	`150`	`+ return dataFrame`
	`151`	`+ else:`
	`152`	`+ pass`
	`153`	`+`
	`154`	`+`
	`155`	`+`
	`156`	`+`
	`157`	`+# ---------------------------------------------------------------------------------------------------------------------------`
	`158`	`+def to_json(movies, fname : str = ""):`
	`159`	`+ """`
	`160`	`+ A helper function which is used to save the movies name and its links.`
	`161`	`+ """`
	`162`	`+ with open(fname, 'w') as file:`
	`163`	`+ json.dump(movies, file)`
	`164`	`+`
	`165`	`+`
	`166`	`+`
	`167`	`+# ---------------------------------------------------------------------------------------------------------------------------`
	`168`	`+def selectMovie(**kwargs):`
	`169`	`+ #**kwargs creates a dictionary so to fetch the data we have dictionary concept to get data`
	`170`	`+ for key, val in kwargs.items():`
	`171`	`+`
	`172`	`+ # If we want get data from top-rated movies`
	`173`	`+ if key == "top_rated_movies" and val == True:`
	`174`	`+ # fetch data from top-rated movies`
	`175`	`+ movies = get_movies_list(top_rated_movies)`
	`176`	`+ reviews = fetch_data(movies = movies)`
	`177`	`+ to_csv(reviews = reviews,flocation = "datasets/reviews_top-rated.csv" ,return_data=False)`
	`178`	`+`
	`179`	`+ # If we want to get the data from most-popular movies`
	`180`	`+ elif key == "most_popular_movies" and val == True:`
	`181`	`+ # fetch data from most-popular movies`
	`182`	`+ movies = get_movies_list(most_popular_movies)`
	`183`	`+ reviews = fetch_data(movies = movies)`
	`184`	`+ to_csv(reviews = reviews,flocation = "datasets/reviews_most-pop.csv" ,return_data=False)`
	`185`	`+`
	`186`	`+`
	`187`	`+`
	`188`	`+`
	`189`	`+`
	`190`	`+`
	`191`	`+if __name__ == "__main__":`
	`192`	`+ # here we will fetching both the data from the IMDB`
	`193`	`+ selectMovie(top_rated_movies = True)`
	`194`	`+ selectMovie(most_popular_movies = True)`
	`195`	`+`

`‎WebScrapingScripts/Movie Review Scraping/terminal.png`

425 KB

Loading[フレーム]

0 commit comments

Comments

(0)

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Commit 9ac2397

File tree

8 files changed

8 files changed

`‎WebScrapingScripts/Movie Review Scraping/.ipynb_checkpoints/IMBD DATA SCRAPING-checkpoint.ipynb`

`‎WebScrapingScripts/Movie Review Scraping/Images/img.png`

`‎WebScrapingScripts/Movie Review Scraping/Images/terminal.png`

`‎WebScrapingScripts/Movie Review Scraping/README.md`

`‎WebScrapingScripts/Movie Review Scraping/img.png`

`‎WebScrapingScripts/Movie Review Scraping/requirements.txt`

`‎WebScrapingScripts/Movie Review Scraping/scrap_data.py`

`‎WebScrapingScripts/Movie Review Scraping/terminal.png`

0 commit comments