Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Scraping real estate data on houses and apartments for sale in Belgium.

Notifications You must be signed in to change notification settings

silventesa/Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

90 Commits

Repository files navigation

ImmoScraping

ImmoScraping is a Python project using web scraping to create a dataset of Belgian real estate sales data.

This project was carried out by Emre Ozan, Joachim Kotek and Sara Silvente in Sept 2020 in the context of the course Bouman 2.22 organized by BeCode, a network of inclusive coding bootcamps.

Context

Here we used immoweb real estate company and looked for 10000 houses and appartments for sale across Belgium.

Currently, the program grasps the following information (stored in the dataset):

  • Locality
  • Type of property (House/apartment)
  • Subtype of property (Bungalow, Chalet, Mansion, ...)
  • Price
  • Type of sale (Exclusion of life sales)
  • Number of rooms
  • Area
  • Fully equipped kitchen (Yes/No)
  • Furnished (Yes/No)
  • Open fire (Yes/No)
  • Terrace (Yes/No)
    • If yes: Area
  • Garden (Yes/No)
    • If yes: Area
  • Surface of the land
  • Surface area of the plot of land
  • Number of facades
  • Swimming pool (Yes/No)
  • State of the building (New, to be renovated, ...)

Source code

Source code files correspond to each main project step, and are sequentially ordered:

  1. Get URLs for each property for sale (web automation)
  2. Grasp and select target data
  3. Store data

Scraping data from the website

We used Chrome WebDriver through Selenium and Parsel to get and extract the URL of each result page using XPath selector (the web_drivers folder contains also the driver for Firefox).

Our search yielded 333 pages of results. Note that you'll need to change the value below by yours.

# 1) Obtain URLs through WebDrive
driver = webdriver.Chrome(executable_path='../web_drivers/chromedriver.exe')
# Each URL will be stored in a list.
houses_url = []
# Iterate through all result pages and get the url of each of them. 
# CHANGE 334 VALUE BY YOURS
for i in range(1, 334):
 apikey = str(i)+'&orderBy=relevance'
 url = 'https://www.immoweb.be/en/search/house/for-sale?countries=BE&page='+apikey

We next get the url of each house and store it in a csv file. The same procedure was performed for appartments, since different URLs are used for each kind of property.

Accessing the information

We got a HTML parsed tree of each property by using BeautifulSoup library.

Data on properties was contained in "windows.classified", under a ''<script>'' tag.

HTML_PROPERTY_WINDOW_CLASSIFIED

We selected this bunch of text and converted it into a python dictionary, where keys = features of properties and values = values (check features dict layout here

 def house_dict(self):
 '''
 Define a method that creates the dictionary with attributes as keys and houses' values as values
 '''
 try:
 # The relevant info is under a "script" tag in the website
 result_set = self.soup.find_all('script',attrs={"type" :"text/javascript"})
 
 # Iterate through the "script" tags found and keep the one containing the substring "window.classified"
 # which contains all the relevant info
 for tag in result_set:
 if 'window.classified' in str(tag.string):
 window_classified = tag
 #when we've found the right tag we can stop the loop earlier
 
 
 # Access to the string attribute of the tag and remove leading and trailing whitespaces (strip)break
 wcs = window_classified.string
 wcs.strip()
 
 # Keep only the part of the string that will be converted into a dictionary
 wcs = wcs[wcs.find("{"):wcs.rfind("}")+1]
 
 # Convert it into a dictionary through json library
 house_dict = json.loads(wcs)
 return house_dict
 except:
 return None

We defined HouseApartmentScraping class and used class methods to get each property attribute (and store it as a dictionary value) through an iteration performed on the URLs that were previously stored in the csv file.

# Example with number of rooms
 def num_rooms(self):
 try:
 return int(self.house_dict['property']['bedroomCount'])
 except:
 return None

Store values into a defaultdict()

houses_apartments_dict = defaultdict(list)
with open('../csv_files/houses_apartments_urls.csv', 'r') as file:
 url = file.readline()
 while url != "":
 
 houses_class = HouseApartmentScraping(url)
 
 houses_apartments_dict['Locality'].append(houses_class.locality)
 houses_apartments_dict['Type of property'].append(houses_class.type_property)
 houses_apartments_dict['Subtype of property'].append(houses_class.subtype)
 houses_apartments_dict['Price'].append(houses_class.price)
 
 ...

Store data

We finally converted our dict into a pandas DataFrame and saved it as a csv.

About

Scraping real estate data on houses and apartments for sale in Belgium.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 71.8%
  • Python 28.2%

AltStyle によって変換されたページ (->オリジナル) /