Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 9ac2397

Browse files
Merge pull request #823 from devkumar24/dev
MOVIE REVIEWS SCRAPING
2 parents 9268c10 + c9ae615 commit 9ac2397

File tree

8 files changed

+262
-0
lines changed

8 files changed

+262
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{
2+
"cells": [],
3+
"metadata": {},
4+
"nbformat": 4,
5+
"nbformat_minor": 4
6+
}
422 KB
Loading[フレーム]
425 KB
Loading[フレーム]
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# MOVIE REVIEW SCRAPING
2+
3+
## AIM
4+
> To Extract the Reviews of Movies.
5+
6+
## DESCRIPTION
7+
Here is the Python Script which is used to extract the Reviews of Movies from IMDb. We have use `requests` and `bs4` packages to extract the data.
8+
9+
## PURPOSE
10+
In this project you’ll learn about HTTP requests and how to send them using the requests package and will also learn how to extract required data from HTML pages using some simple functions of beautifulsoup module. As we know Sentimental Analysis is very popular task in Machine Learning, so I have wrote a Python script to get the data for you and perform several task on this type of NLP.
11+
12+
## PACKAGES USED
13+
> The purpose of these packages in project
14+
- `requests` - It has been to send and recieve the request in order to fetch the data from IMDB.
15+
- `bs4` - It has been used to extract the HTML elements from website.
16+
- `json` - json is used as helper in order to save the list of movies and its links.
17+
- `pandas` - It is used to create and store dataframes into .csv format.
18+
19+
20+
## Workflow
21+
- Import above packages mentioned above.
22+
- Extracting movies and links
23+
- After that we have extracted the reviews along with their rating.
24+
- Saving the data in .csv format
25+
26+
## SETUP PACKAGES
27+
- `pip install requests `
28+
- `pip install pandas`
29+
- `pip install bs4`
30+
- `pip install json`
31+
32+
## COMPILATION STEPS
33+
> Go to terminal
34+
35+
> Run command : `python3 scrapy_data.py`
36+
37+
> Rest the script will do the work.
38+
## SOURCE
39+
### **IMDB**
40+
![Image](Images/img.png)
41+
42+
43+
44+
## OUTPUT
45+
### VS CODE TERMINAL
46+
![OUTPUT](Images/terminal.png)
47+
48+
49+
## AUTHOR
50+
51+
---
52+
### NAME : DEV KUMAR
53+
---
54+
### EMAIL : dev247kumar@gmail.com
55+
---
56+
### GitHub : devkumar24
57+
---
422 KB
Loading[フレーム]
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
bs4==0.0.1
2+
requests==2.25.1
3+
json
4+
pip==21.1.2
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
# Import important packages that are used to fetch the details of the movies.
2+
3+
from bs4 import BeautifulSoup
4+
import requests
5+
import pandas
6+
import json
7+
8+
9+
# Enter the url from where you want to fetch the Movie data
10+
# In this Program we have fetch data from IMDB.
11+
# Here, we will collect data from top-rated movies on IMDB and most popular movies upto date 21 July 2021.
12+
# There are approx 250 top- rated movies on IMDB, and 100 most popular movies, here are a total of 350 movies
13+
14+
15+
# urls
16+
top_rated_movies = "https://www.imdb.com/chart/top"
17+
most_popular_movies = "https://www.imdb.com/chart/moviemeter/"
18+
19+
20+
# ---------------------------------------------------------------------------------------------------------------------------
21+
def get_movies_list(url):
22+
"""
23+
This function will help us to get the list of movies that are present in the given url
24+
This function takes an input url, and get the list of all movies present in the url.
25+
It will return the movies with its corresponding rating and links, so that we can
26+
get our review.
27+
28+
Return Type : Dictionary
29+
because we have make seperate link and rating for each movie, so that we don't get confuse while watching the data.
30+
If we use list instead of dict, we won't understand what is there in the data.
31+
"""
32+
33+
# sending request to access the particular url
34+
response = requests.get(url)
35+
soup = BeautifulSoup(response.content, 'lxml')
36+
content = soup.find_all('tbody', class_ = "lister-list")
37+
38+
# We have got our movie names using list comprehension
39+
movies_names = [content[0].find_all('tr')[i].find('td', class_ = "titleColumn").a.text for i in range(len(content[0].find_all('tr')))]
40+
41+
# here we have not use list comprehension because there are some movies which don't have their ratings
42+
rating = []
43+
for i in range(len(content[0].find_all('tr'))):
44+
45+
try:
46+
rating.append(content[0].find_all('tr')[i].find('td', class_ = "ratingColumn imdbRating").strong.text)
47+
except:
48+
# Here, we mark that rating will be empty if no rating is present, later while performing any task,
49+
# we will fill this value by proper techniques
50+
rating.append(" ")
51+
52+
# Links for each movie
53+
links = [content[0].find_all('tr')[i].find('td', class_ = "titleColumn").a['href'] for i in range(len(content[0].find_all('tr')))]
54+
55+
# here we have created movies dictonary in which all the data of each movie is present.
56+
movies = {}
57+
for i in range(len(content[0].find_all('tr'))):
58+
if movies.get(movies_names[i]) is None:
59+
movies[movies_names[i]] = {}
60+
link = "https://www.imdb.com" + links[i]
61+
movies[movies_names[i]] = (rating[i], link)
62+
else:
63+
link = "https://www.imdb.com" + links[i]
64+
movies[movies_names[i]] = (rating[i], link)
65+
66+
67+
return movies # Return type: DICT
68+
69+
70+
71+
# ---------------------------------------------------------------------------------------------------------------------------
72+
def fetch_data(movies):
73+
"""
74+
This function will give us the reviews about the movies that we have got in our get_movies_list().
75+
It will take input a movies dictionary in which movies and its links are present
76+
77+
It will return a list of reviews, in which reviews are in the form of tuple.
78+
e.g-> review = [('6',
79+
'Average Marvel Movie',
80+
'As the perspective is everything in reviewing movies)]
81+
82+
rating = review[0][0]
83+
title = review[0][1]
84+
review_content = review[0][2]
85+
"""
86+
reviews = list()
87+
for key, val in movies.items():
88+
89+
# sending request to access the particular url
90+
movie_url = val[1]
91+
print("Getting Data of Movie : {}".format(key))
92+
response = requests.get(movie_url)
93+
soup = BeautifulSoup(response.content, 'lxml')
94+
content = soup.find_all('section', class_ = "ipc-page-section ipc-page-section--base")
95+
96+
review_url = soup.find_all('a', class_ = "ipc-title ipc-title--section-title ipc-title--base ipc-title--on-textPrimary ipc-title-link-wrapper")
97+
review_url = "https://www.imdb.com" + review_url[2]['href']
98+
99+
review_url_response = requests.get(review_url)
100+
review_url_soup = BeautifulSoup(review_url_response.content, 'lxml')
101+
102+
# here we have got several reviews from a single movie.
103+
total_reviews = review_url_soup.find_all('div', class_ = "review-container")
104+
# here, it made us necessary to iterate a loop, because it contains several reviews, and every review is important to us.
105+
for review in total_reviews:
106+
# using exception handling in case, if there is no title or review or rating is not present.
107+
try:
108+
rating = review.find("div", class_ = "ipl-ratings-bar")
109+
rating = rating.find('span').text.strip().split("/")[0]
110+
except:
111+
rating = " "
112+
try:
113+
title = review.find('a', class_ = "title").text.strip()
114+
except:
115+
title = "NaN"
116+
try:
117+
review_content = review.find('div', class_ = "text show-more__control").text.strip()
118+
except:
119+
review_content = None
120+
121+
122+
# Appending data to the list
123+
reviews.append((rating, title, review_content))
124+
125+
print("Total Reviews Fetch from the data are : {}".format(len(reviews)))
126+
127+
return reviews # return type: list of tuples
128+
129+
130+
131+
# ---------------------------------------------------------------------------------------------------------------------------
132+
def to_csv(reviews,flocation : str = "", return_data = True):
133+
"""
134+
It will make the dataframe of the reviews and present us, it will easily able to understand and read the data,
135+
and main aim of this function is to save the data in csv format,
136+
137+
: If we don't enter the file location, it will automatically store the data into existing file with the name
138+
as "data.csv"
139+
140+
: If we don't want to return the data, we won't by entering return_data = False
141+
"""
142+
dataFrame = pd.DataFrame(data = reviews, columns = ['Rating', 'Title', 'Review'])
143+
144+
if flocation:
145+
dataFrame.to_csv(flocation)
146+
else:
147+
dataFrame.to_csv("data.csv")
148+
149+
if return_data:
150+
return dataFrame
151+
else:
152+
pass
153+
154+
155+
156+
157+
# ---------------------------------------------------------------------------------------------------------------------------
158+
def to_json(movies, fname : str = ""):
159+
"""
160+
A helper function which is used to save the movies name and its links.
161+
"""
162+
with open(fname, 'w') as file:
163+
json.dump(movies, file)
164+
165+
166+
167+
# ---------------------------------------------------------------------------------------------------------------------------
168+
def selectMovie(**kwargs):
169+
#**kwargs creates a dictionary so to fetch the data we have dictionary concept to get data
170+
for key, val in kwargs.items():
171+
172+
# If we want get data from top-rated movies
173+
if key == "top_rated_movies" and val == True:
174+
# fetch data from top-rated movies
175+
movies = get_movies_list(top_rated_movies)
176+
reviews = fetch_data(movies = movies)
177+
to_csv(reviews = reviews,flocation = "datasets/reviews_top-rated.csv" ,return_data=False)
178+
179+
# If we want to get the data from most-popular movies
180+
elif key == "most_popular_movies" and val == True:
181+
# fetch data from most-popular movies
182+
movies = get_movies_list(most_popular_movies)
183+
reviews = fetch_data(movies = movies)
184+
to_csv(reviews = reviews,flocation = "datasets/reviews_most-pop.csv" ,return_data=False)
185+
186+
187+
188+
189+
190+
191+
if __name__ == "__main__":
192+
# here we will fetching both the data from the IMDB
193+
selectMovie(top_rated_movies = True)
194+
selectMovie(most_popular_movies = True)
195+
425 KB
Loading[フレーム]

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /