From the last few days I have been studying scraping webpages and for further development I challenge myself to transform the script I have created into a class which will receive inputs from a user.
The main reason I'm posting this is because I would like advice and insights on how should I approach and manipulate instance variables from given object.
I had quite the hard time to figure it out how should I start or initiate the flow of the program. I will give you briefly what the scrape script does and see for yourself if I make sense I have done in the later script.
Here's the scraper.
It's to scrape all the jobs vacancies from given State or city.
The program receive the first page of a given url from a State/city and extract 3 main information. The last page, the total of jobs and how many jobs it has each page.
Then it use this information to build a loop to keep scraping until it reaches the last page. I put the loop inside of CSV function in order to save the data into a file at the same time.
require 'nokogiri'
require 'rest-client'
require 'csv'
url = 'https://www.infojobs.com.br/empregos-em-santa-catarina.aspx'
begin
html = RestClient.get(url)
rescue => e
puts "ERROR:#{e.message}"
next
parsed_html = Nokogiri::HTML(html)
total = parsed_html.css('.js_xiticounter').text.gsub('.', '')
page = 1
per_pagina = parsed_html.css('.element-vaga').count
last_page = (total.to_f / per_page).round
CSV.open('data_infojobsSC.csv', 'w') do |csv|
csv << ['Title', 'Company', 'City', 'Área']
until page >= last_page
current_url = "https://www.infojobs.com.br/empregos-em-santa-catarina.aspx?Page=#{page}"
begin
current_html = RestClient.get(current_url)
rescue => e
puts "ERROR: #{current_url}"
puts "Exception Message:#{e.message}"
next
end
parsed_current_html = Nokogiri::HTML(current_html)
jobs = parsed_current_html.css('.element-vaga')
jobs.each do |job|
title = job.css('div.vaga > a > h2').text.strip()
company = job.css('div.vaga-company > a').text.strip()
company = job.empty? ? "Confidential Company" : company
city = job.css('p.location2 > span > span > span').text
area = job.css('p.area > span').text
csv << [title, company, city, area]
end
puts "Registrados da página #{page} salvos com sucesso."
page+=1
end
end
Here's my POO code. This is the third version of the first functional code. Each time I try to make more modular. This is my first POO ruby code. It was really hard to me grasp how the class itself would work because all the previous classes I had written were simple as a dog class and bark function from beginners videos...
require 'nokogiri'
require 'rest-client'
require 'csv'
class InfoJobs #this is the name of the website
attr_accessor :url, :parsed_html, :total, :per_page, :last_page, :list
attr_reader :city, :state
def city=(city)
@city = city.chomp.downcase.gsub(' ', '-')
end
def state=(state)
@state = state.chomp.downcase
end
def build_url
@url = 'https://www.infojobs.com.br/empregos-em-' + @city + ',-' + @state + '.aspx'
end
def parsing(url) #since I need to parse many urls I decided to make a function
begin
html = RestClient.get(url)
rescue => e
puts "ERROR on #{url}"
puts "Exception Class:#{e.class.name}"
puts "Exception Message:#{e.message}"
end
@parsed_html = Nokogiri::HTML(html)
end
def get_page_values #to the initial values to know how many pages to scrape and use these values to build a loop
self.parsing(@url)
@total = @parsed_html.css('.js_xiticounter').text.gsub('.', '')
@per_page = @parsed_html.css('.element-vaga').count
@last_page = (@total.to_f / @per_page).round
end
def scraping
@list = []
page = 1
@url = @url + "?Page="
until page >= @last_page
@url + page.to_s
jobs = self.parsing(@url).css('.element-vaga') #in this part of program, the instance variable @url is no longer that previosly page we sent to get_page_values method. Is ok to use in this way the same instance variable ? or the best practice is to create another instance variable called url_pagination?
jobs.each do |job|
company = job.css('div.vaga-company > a').text.strip()
company = company.empty? ? "Confidential Company" : company
data = {
title: job.css('div.vaga > a > h2').text.strip(),
company: company,
city: job.css('p.location2 > span > span > span').text,
area: job.css('p.area > span').text
}
@list << data
puts "The page #{page} was successfully saved."
page+=1
end
end
end
def writing
CSV.open("list_jobs.csv", "wb", {headers: list.first.keys} ) do |csv|
csv << ['Title', 'Company', 'City', 'Area']
list.each do |hash|
csv << hash
end
end
end
def run
self.build_url
self.parsing(@url)
self.get_page_values
self.scraping
self.writing
end
end
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
test.run
The user input is to set a State and a city. And then build the url with these values.
Just a note...Previously to call each class method I had something like this (below). In the code above I made a method 'run' to keep the object calling the methods inside the class. I really don't know if this is a correct approach or not...
teste = InfoJobs.new
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
teste.build_url
teste.get_page_values
teste.scraping
teste.writing
NOTE: Each code runs perfectly fine.
1 Answer 1
In general, your OOP approach looks good to me.
Please check the comments to know more about the changes/details I found:
require 'nokogiri'
require 'rest-client'
require 'csv'
# Just as a general reference, I changed all the code's indentation from 4 to 2
# spaces as it's the most used way, also suggested by the ruby style guide
# https://rubystyle.guide/#spaces-indentation , but it's up to you to use 4 if
# you feel more comfortable with it.
class InfoJobs
# If you're not really reading/writing object attributes when you create an
# instance, like:
#
# test = InfoJobs.new(city, state)
# test.total = 12345
# puts test.parsed_html
#
# there's no need to expose them for the object's instance. That's why I
# remove all `attr_accessor`s. I have the feeling `city` and `state` are the
# only attributes should be accessible, but only read access. Another option
# would be to declare attributes as private `attr_reader`s, but I leave that
# to you.
attr_reader :city, :state
# city/state attributes should be initialized in the constructor. Also is more
# like an OOP approach.
def initialize(city, state)
@city = city.chomp.downcase.tr(' ', '-')
@state = state.chomp.downcase
end
def run
# Removed:
# - `build_url`, as `base_url` will be called from every method when be
# required.
# - `parsing(@url)` as it's later called in `initial_page_values`, where it
# really do something with the received html.
# - All `self.` definitions, in general `self.` is required only when you're
# modifying a value of the instance, but it's not needed to read values or
# call methods.
initial_page_values
scraping
ensure
# With `ensure`, in case you ctrl+c the process or fail for some reason, you
# "ensure" the gathered info so far is saved.
writing
end
# The only accessible method for the class instance is `.run`, so the rest of
# them should be kept as private.
private
# Changed method `build_url` to memoized method `base_url`. More info in
# https://www.rubydoc.info/gems/rubocop/RuboCop/Cop/Naming/MemoizedInstanceVariableName
def base_url
# Changes:
# - Using ||= to set a value only when this is empty,
# - Using already defined city/state attr_readers instead of accessing
# directly to the global variables @city and @state
# - Instead of using + to concatenate, I'm interpolating, which is the most
# prefered way. https://rubystyle.guide/#string-interpolation
@base_url ||= "https://www.infojobs.com.br/empregos-em-#{city},-#{state}.aspx"
end
def parsing(url)
# No need of `begin` when `rescue` contemplates the whole method
html = RestClient.get(url)
# No sense to have html parsing out of begin/rescue block. For `rescue`
# case, the parsing would fail, so moving html parsing before rescue.
# Also the result of this operation should be returned, not kept in a global
# variable `@parsed_html`.
Nokogiri::HTML(html)
rescue StandardError => e
puts "ERROR on #{url}"
puts "Exception Class:#{e.class.name}"
puts "Exception Message:#{e.message}"
end
# Minor change, but `get_` and `set_` method names are discouraged in ruby.
# More info in https://rubystyle.guide/#accessor_mutator_method_names
def initial_page_values
# Following what was described in the previous method, you're going to need
# the result of the parsing method only for the scope of this method, so
# there's no need to save it as a global variable.
parsed_html = parsing(base_url)
# You're only using @last page out of this method, so there shouldn't be
# necessary to have @total and @per_page as global variables.
total = parsed_html.css('.js_xiticounter').text.delete('.')
per_page = parsed_html.css('.element-vaga').count
@last_page = (total.to_f / per_page).round
puts "#{@last_page} pages found"
end
def scraping
@list = []
page = 1
# Removed
# @url += '?Page='
# as it should be better to handle it directly when the parsing is invoked.
until page >= @last_page
# This is wrong:
# @url + page.to_s
# as you're adding the page to @url in a void context, after this line,
# @url will keep having the same value as before this line.
# This should be the url to parse:
jobs = parsing("#{base_url}?Page=#{page}").css('.element-vaga')
# Moving all the job parsing to its own method. Also to decrease ABC
# metric https://en.wikipedia.org/wiki/ABC_Software_Metric and Cyclomatic
# complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity for this
# `scraping` method.
jobs.each { |job| @list.push(job_data(job)) }
# Page increasing should be out of the jobs loop.
puts "The page #{page} was successfully saved"
page += 1
end
end
def job_data(job)
company = job.css('div.vaga-company > a').text.strip
company = company.empty? ? 'Confidential Company' : company
{
title: job.css('div.vaga > a > h2').text.strip,
company: company,
city: job.css('p.location2 > span > span > span').text,
area: job.css('p.area > span').text
}#.tap { |jd| puts jd } # Uncomment if you need to debug values.
end
def writing
# If you don't have values to save, simply return
return unless @list && @list.any?
puts "saving #{@list.count} jobs."
# Just a minor change to write to a different file every time you run this,
# but feel free to remove it.
filename = "list_jobs_#{Time.now.to_i}.csv"
CSV.open(filename, 'wb', headers: @list.first.keys) do |csv|
csv << %w[Title Company City Area]
@list.each do |hash|
csv << hash
end
end
end
end
puts '[ Type a city ] '
city = gets
puts '[ Type the state ]'
state = gets
puts 'Processing...'
test = InfoJobs.new(city, state)
test.run
When I talk about void context, check the following example:
# First definition of variable with a value
@url = 'www.stackoverflow.com' # => "www.stackoverflow.com"
# Then I'm adding a text, which is returning a value in a void context, in other
# words, there's no context, like a variable or a method, which is receiving the
# resulting value.
@url + '/tags' # => "www.stackoverflow.com/tags"
# If I access to the variable again, it has the same initial value
@url # => "www.stackoverflow.com"
# A different situation is when there's a context, in this case the same @url
# variable, which it will receive the resulting value of the addition.
@url = @url + '/tags' # => "www.stackoverflow.com/tags"
@url # => "www.stackoverflow.com/tags"
# Or also using the shorter += version
@url += '?tab=name' # => "www.stackoverflow.com/tags?tab=name"
@url # => "www.stackoverflow.com/tags?tab=name"
I'm far to be an OOP expert (I've forgotten most of theory learned in Uni.😅), but the idea is try to keep only the essential information of your object as attributes.
For your case, I think city and state are essential to make all the object's functioning works as expected.
Also, could be good to keep as attributes information to be accessed multiple times and that takes a while to get it. For your case, the jobs list was saved to be written later to a file (actually, rethinking the problem, maybe it wasn't necessary at all to have @list
if after writing to a file isn't going to be used anymore).
Finally, you must to think that any created instance variable isn't garbage collected and will keep living in memory as long as your object's instance is alive. For your particular case, maybe this isn't a problem, but in the eventual case where you have thousands of those objects created, they should be as lightweight as possible or you could run out of memory.
-
\$\begingroup\$ You have my sincere thanks. I really appreciate the break down you have done in my code. I could grasp most of all you wrote, but I couldn't understand what do you mean by "void context". Another thing, instance variables are the DNA of each object, right? So you don't keep changing their values like a did in the previously code right? like I was setting different values to my '@url' and '@parsed_html' variables each time the page changes. \$\endgroup\$Greg M– Greg M2020年02月27日 01:00:15 +00:00Commented Feb 27, 2020 at 1:00