Tentative to transform this scrape script into a class object program

Question 1

From the last few days I have been studying scraping webpages and for further development I challenge myself to transform the script I have created into a class which will receive inputs from a user.

The main reason I'm posting this is because I would like advice and insights on how should I approach and manipulate instance variables from given object.

I had quite the hard time to figure it out how should I start or initiate the flow of the program. I will give you briefly what the scrape script does and see for yourself if I make sense I have done in the later script.

Here's the scraper.

It's to scrape all the jobs vacancies from given State or city.

The program receive the first page of a given url from a State/city and extract 3 main information. The last page, the total of jobs and how many jobs it has each page.

Then it use this information to build a loop to keep scraping until it reaches the last page. I put the loop inside of CSV function in order to save the data into a file at the same time.

require 'nokogiri'
require 'rest-client'
require 'csv'
url = 'https://www.infojobs.com.br/empregos-em-santa-catarina.aspx'
begin
 html = RestClient.get(url)
rescue => e
 puts "ERROR:#{e.message}"
 next
parsed_html = Nokogiri::HTML(html)
total = parsed_html.css('.js_xiticounter').text.gsub('.', '')
page = 1
per_pagina = parsed_html.css('.element-vaga').count
last_page = (total.to_f / per_page).round
CSV.open('data_infojobsSC.csv', 'w') do |csv|
 csv << ['Title', 'Company', 'City', 'Área']
 until page >= last_page
 current_url = "https://www.infojobs.com.br/empregos-em-santa-catarina.aspx?Page=#{page}"
 begin
 current_html = RestClient.get(current_url)
 rescue => e
 puts "ERROR: #{current_url}"
 puts "Exception Message:#{e.message}"
 next
 end
 parsed_current_html = Nokogiri::HTML(current_html)
 jobs = parsed_current_html.css('.element-vaga')
 jobs.each do |job|
 title = job.css('div.vaga > a > h2').text.strip()
 company = job.css('div.vaga-company > a').text.strip()
 company = job.empty? ? "Confidential Company" : company
 city = job.css('p.location2 > span > span > span').text
 area = job.css('p.area > span').text
 csv << [title, company, city, area]
 end
 puts "Registrados da página #{page} salvos com sucesso."
 page+=1
 end
end

Here's my POO code. This is the third version of the first functional code. Each time I try to make more modular. This is my first POO ruby code. It was really hard to me grasp how the class itself would work because all the previous classes I had written were simple as a dog class and bark function from beginners videos...

require 'nokogiri'
require 'rest-client'
require 'csv'
class InfoJobs #this is the name of the website
 attr_accessor :url, :parsed_html, :total, :per_page, :last_page, :list
 attr_reader :city, :state
 
 def city=(city)
 @city = city.chomp.downcase.gsub(' ', '-')
 end
 
 def state=(state)
 @state = state.chomp.downcase
 end
 def build_url
 @url = 'https://www.infojobs.com.br/empregos-em-' + @city + ',-' + @state + '.aspx'
 end
 def parsing(url) #since I need to parse many urls I decided to make a function
 begin
 html = RestClient.get(url)
 rescue => e
 puts "ERROR on #{url}"
 puts "Exception Class:#{e.class.name}"
 puts "Exception Message:#{e.message}"
 end
 @parsed_html = Nokogiri::HTML(html)
 end
 def get_page_values #to the initial values to know how many pages to scrape and use these values to build a loop
 self.parsing(@url)
 @total = @parsed_html.css('.js_xiticounter').text.gsub('.', '')
 @per_page = @parsed_html.css('.element-vaga').count
 @last_page = (@total.to_f / @per_page).round
 end
 def scraping
 @list = []
 page = 1
 @url = @url + "?Page=" 
 until page >= @last_page
 @url + page.to_s
 jobs = self.parsing(@url).css('.element-vaga') #in this part of program, the instance variable @url is no longer that previosly page we sent to get_page_values method. Is ok to use in this way the same instance variable ? or the best practice is to create another instance variable called url_pagination?
 jobs.each do |job|
 company = job.css('div.vaga-company > a').text.strip()
 company = company.empty? ? "Confidential Company" : company
 data = {
 title: job.css('div.vaga > a > h2').text.strip(),
 company: company,
 city: job.css('p.location2 > span > span > span').text,
 area: job.css('p.area > span').text
 }
 @list << data
 puts "The page #{page} was successfully saved."
 page+=1
 end
 end
 end
 def writing
 CSV.open("list_jobs.csv", "wb", {headers: list.first.keys} ) do |csv|
 csv << ['Title', 'Company', 'City', 'Area']
 list.each do |hash|
 csv << hash
 end
 end
 end
 def run
 self.build_url
 self.parsing(@url)
 self.get_page_values
 self.scraping
 self.writing
 end
end
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
test.run

The user input is to set a State and a city. And then build the url with these values.

Just a note...Previously to call each class method I had something like this (below). In the code above I made a method 'run' to keep the object calling the methods inside the class. I really don't know if this is a correct approach or not...

teste = InfoJobs.new
test = InfoJobs.new
puts "[ Type a city ] "
test.city = gets
puts "[ Type the state ]"
teste.state = gets
puts "Processing..."
teste.build_url
teste.get_page_values
teste.scraping
teste.writing

NOTE: Each code runs perfectly fine.

Question 2

In general, your OOP approach looks good to me.
Please check the comments to know more about the changes/details I found:

require 'nokogiri'
require 'rest-client'
require 'csv'
# Just as a general reference, I changed all the code's indentation from 4 to 2
# spaces as it's the most used way, also suggested by the ruby style guide
# https://rubystyle.guide/#spaces-indentation , but it's up to you to use 4 if
# you feel more comfortable with it.
class InfoJobs
 # If you're not really reading/writing object attributes when you create an
 # instance, like:
 #
 # test = InfoJobs.new(city, state)
 # test.total = 12345
 # puts test.parsed_html
 #
 # there's no need to expose them for the object's instance. That's why I
 # remove all `attr_accessor`s. I have the feeling `city` and `state` are the
 # only attributes should be accessible, but only read access. Another option
 # would be to declare attributes as private `attr_reader`s, but I leave that
 # to you.
 attr_reader :city, :state
 # city/state attributes should be initialized in the constructor. Also is more
 # like an OOP approach.
 def initialize(city, state)
 @city = city.chomp.downcase.tr(' ', '-')
 @state = state.chomp.downcase
 end
 def run
 # Removed:
 # - `build_url`, as `base_url` will be called from every method when be
 # required.
 # - `parsing(@url)` as it's later called in `initial_page_values`, where it
 # really do something with the received html.
 # - All `self.` definitions, in general `self.` is required only when you're
 # modifying a value of the instance, but it's not needed to read values or
 # call methods.
 initial_page_values
 scraping
 ensure
 # With `ensure`, in case you ctrl+c the process or fail for some reason, you
 # "ensure" the gathered info so far is saved.
 writing
 end
 # The only accessible method for the class instance is `.run`, so the rest of
 # them should be kept as private.
 private
 # Changed method `build_url` to memoized method `base_url`. More info in
 # https://www.rubydoc.info/gems/rubocop/RuboCop/Cop/Naming/MemoizedInstanceVariableName
 def base_url
 # Changes:
 # - Using ||= to set a value only when this is empty,
 # - Using already defined city/state attr_readers instead of accessing
 # directly to the global variables @city and @state
 # - Instead of using + to concatenate, I'm interpolating, which is the most
 # prefered way. https://rubystyle.guide/#string-interpolation
 @base_url ||= "https://www.infojobs.com.br/empregos-em-#{city},-#{state}.aspx"
 end
 def parsing(url)
 # No need of `begin` when `rescue` contemplates the whole method
 html = RestClient.get(url)
 # No sense to have html parsing out of begin/rescue block. For `rescue`
 # case, the parsing would fail, so moving html parsing before rescue.
 # Also the result of this operation should be returned, not kept in a global
 # variable `@parsed_html`.
 Nokogiri::HTML(html)
 rescue StandardError => e
 puts "ERROR on #{url}"
 puts "Exception Class:#{e.class.name}"
 puts "Exception Message:#{e.message}"
 end
 # Minor change, but `get_` and `set_` method names are discouraged in ruby.
 # More info in https://rubystyle.guide/#accessor_mutator_method_names
 def initial_page_values
 # Following what was described in the previous method, you're going to need
 # the result of the parsing method only for the scope of this method, so
 # there's no need to save it as a global variable.
 parsed_html = parsing(base_url)
 # You're only using @last page out of this method, so there shouldn't be
 # necessary to have @total and @per_page as global variables.
 total = parsed_html.css('.js_xiticounter').text.delete('.')
 per_page = parsed_html.css('.element-vaga').count
 @last_page = (total.to_f / per_page).round
 puts "#{@last_page} pages found"
 end
 def scraping
 @list = []
 page = 1
 # Removed
 # @url += '?Page='
 # as it should be better to handle it directly when the parsing is invoked.
 until page >= @last_page
 # This is wrong:
 # @url + page.to_s
 # as you're adding the page to @url in a void context, after this line,
 # @url will keep having the same value as before this line.
 # This should be the url to parse:
 jobs = parsing("#{base_url}?Page=#{page}").css('.element-vaga')
 # Moving all the job parsing to its own method. Also to decrease ABC
 # metric https://en.wikipedia.org/wiki/ABC_Software_Metric and Cyclomatic
 # complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity for this
 # `scraping` method.
 jobs.each { |job| @list.push(job_data(job)) }
 # Page increasing should be out of the jobs loop.
 puts "The page #{page} was successfully saved"
 page += 1
 end
 end
 def job_data(job)
 company = job.css('div.vaga-company > a').text.strip
 company = company.empty? ? 'Confidential Company' : company
 {
 title: job.css('div.vaga > a > h2').text.strip,
 company: company,
 city: job.css('p.location2 > span > span > span').text,
 area: job.css('p.area > span').text
 }#.tap { |jd| puts jd } # Uncomment if you need to debug values.
 end
 def writing
 # If you don't have values to save, simply return
 return unless @list && @list.any?
 puts "saving #{@list.count} jobs."
 # Just a minor change to write to a different file every time you run this,
 # but feel free to remove it.
 filename = "list_jobs_#{Time.now.to_i}.csv"
 CSV.open(filename, 'wb', headers: @list.first.keys) do |csv|
 csv << %w[Title Company City Area]
 @list.each do |hash|
 csv << hash
 end
 end
 end
end
puts '[ Type a city ] '
city = gets
puts '[ Type the state ]'
state = gets
puts 'Processing...'
test = InfoJobs.new(city, state)
test.run

When I talk about void context, check the following example:

# First definition of variable with a value
@url = 'www.stackoverflow.com' # => "www.stackoverflow.com"
# Then I'm adding a text, which is returning a value in a void context, in other
# words, there's no context, like a variable or a method, which is receiving the
# resulting value.
@url + '/tags' # => "www.stackoverflow.com/tags"
# If I access to the variable again, it has the same initial value
@url # => "www.stackoverflow.com"
# A different situation is when there's a context, in this case the same @url
# variable, which it will receive the resulting value of the addition.
@url = @url + '/tags' # => "www.stackoverflow.com/tags"
@url # => "www.stackoverflow.com/tags"
# Or also using the shorter += version
@url += '?tab=name' # => "www.stackoverflow.com/tags?tab=name"
@url # => "www.stackoverflow.com/tags?tab=name"

I'm far to be an OOP expert (I've forgotten most of theory learned in Uni.😅), but the idea is try to keep only the essential information of your object as attributes.
For your case, I think city and state are essential to make all the object's functioning works as expected.
Also, could be good to keep as attributes information to be accessed multiple times and that takes a while to get it. For your case, the jobs list was saved to be written later to a file (actually, rethinking the problem, maybe it wasn't necessary at all to have @list if after writing to a file isn't going to be used anymore).
Finally, you must to think that any created instance variable isn't garbage collected and will keep living in memory as long as your object's instance is alive. For your particular case, maybe this isn't a problem, but in the eventual case where you have thousands of those objects created, they should be as lightweight as possible or you could run out of memory.

Question 3

You have my sincere thanks. I really appreciate the break down you have done in my code. I could grasp most of all you wrote, but I couldn't understand what do you mean by "void context". Another thing, instance variables are the DNA of each object, right? So you don't keep changing their values like a did in the previously code right? like I was setting different values to my '@url' and '@parsed_html' variables each time the page changes.

Alter Lagos Alter Lagos 5212 silver badges9 bronze badges · Accepted Answer · 2020-02-26 22:50:48Z

In general, your OOP approach looks good to me.
Please check the comments to know more about the changes/details I found:

require 'nokogiri'
require 'rest-client'
require 'csv'
# Just as a general reference, I changed all the code's indentation from 4 to 2
# spaces as it's the most used way, also suggested by the ruby style guide
# https://rubystyle.guide/#spaces-indentation , but it's up to you to use 4 if
# you feel more comfortable with it.
class InfoJobs
 # If you're not really reading/writing object attributes when you create an
 # instance, like:
 #
 # test = InfoJobs.new(city, state)
 # test.total = 12345
 # puts test.parsed_html
 #
 # there's no need to expose them for the object's instance. That's why I
 # remove all `attr_accessor`s. I have the feeling `city` and `state` are the
 # only attributes should be accessible, but only read access. Another option
 # would be to declare attributes as private `attr_reader`s, but I leave that
 # to you.
 attr_reader :city, :state
 # city/state attributes should be initialized in the constructor. Also is more
 # like an OOP approach.
 def initialize(city, state)
 @city = city.chomp.downcase.tr(' ', '-')
 @state = state.chomp.downcase
 end
 def run
 # Removed:
 # - `build_url`, as `base_url` will be called from every method when be
 # required.
 # - `parsing(@url)` as it's later called in `initial_page_values`, where it
 # really do something with the received html.
 # - All `self.` definitions, in general `self.` is required only when you're
 # modifying a value of the instance, but it's not needed to read values or
 # call methods.
 initial_page_values
 scraping
 ensure
 # With `ensure`, in case you ctrl+c the process or fail for some reason, you
 # "ensure" the gathered info so far is saved.
 writing
 end
 # The only accessible method for the class instance is `.run`, so the rest of
 # them should be kept as private.
 private
 # Changed method `build_url` to memoized method `base_url`. More info in
 # https://www.rubydoc.info/gems/rubocop/RuboCop/Cop/Naming/MemoizedInstanceVariableName
 def base_url
 # Changes:
 # - Using ||= to set a value only when this is empty,
 # - Using already defined city/state attr_readers instead of accessing
 # directly to the global variables @city and @state
 # - Instead of using + to concatenate, I'm interpolating, which is the most
 # prefered way. https://rubystyle.guide/#string-interpolation
 @base_url ||= "https://www.infojobs.com.br/empregos-em-#{city},-#{state}.aspx"
 end
 def parsing(url)
 # No need of `begin` when `rescue` contemplates the whole method
 html = RestClient.get(url)
 # No sense to have html parsing out of begin/rescue block. For `rescue`
 # case, the parsing would fail, so moving html parsing before rescue.
 # Also the result of this operation should be returned, not kept in a global
 # variable `@parsed_html`.
 Nokogiri::HTML(html)
 rescue StandardError => e
 puts "ERROR on #{url}"
 puts "Exception Class:#{e.class.name}"
 puts "Exception Message:#{e.message}"
 end
 # Minor change, but `get_` and `set_` method names are discouraged in ruby.
 # More info in https://rubystyle.guide/#accessor_mutator_method_names
 def initial_page_values
 # Following what was described in the previous method, you're going to need
 # the result of the parsing method only for the scope of this method, so
 # there's no need to save it as a global variable.
 parsed_html = parsing(base_url)
 # You're only using @last page out of this method, so there shouldn't be
 # necessary to have @total and @per_page as global variables.
 total = parsed_html.css('.js_xiticounter').text.delete('.')
 per_page = parsed_html.css('.element-vaga').count
 @last_page = (total.to_f / per_page).round
 puts "#{@last_page} pages found"
 end
 def scraping
 @list = []
 page = 1
 # Removed
 # @url += '?Page='
 # as it should be better to handle it directly when the parsing is invoked.
 until page >= @last_page
 # This is wrong:
 # @url + page.to_s
 # as you're adding the page to @url in a void context, after this line,
 # @url will keep having the same value as before this line.
 # This should be the url to parse:
 jobs = parsing("#{base_url}?Page=#{page}").css('.element-vaga')
 # Moving all the job parsing to its own method. Also to decrease ABC
 # metric https://en.wikipedia.org/wiki/ABC_Software_Metric and Cyclomatic
 # complexity https://en.wikipedia.org/wiki/Cyclomatic_complexity for this
 # `scraping` method.
 jobs.each { |job| @list.push(job_data(job)) }
 # Page increasing should be out of the jobs loop.
 puts "The page #{page} was successfully saved"
 page += 1
 end
 end
 def job_data(job)
 company = job.css('div.vaga-company > a').text.strip
 company = company.empty? ? 'Confidential Company' : company
 {
 title: job.css('div.vaga > a > h2').text.strip,
 company: company,
 city: job.css('p.location2 > span > span > span').text,
 area: job.css('p.area > span').text
 }#.tap { |jd| puts jd } # Uncomment if you need to debug values.
 end
 def writing
 # If you don't have values to save, simply return
 return unless @list && @list.any?
 puts "saving #{@list.count} jobs."
 # Just a minor change to write to a different file every time you run this,
 # but feel free to remove it.
 filename = "list_jobs_#{Time.now.to_i}.csv"
 CSV.open(filename, 'wb', headers: @list.first.keys) do |csv|
 csv << %w[Title Company City Area]
 @list.each do |hash|
 csv << hash
 end
 end
 end
end
puts '[ Type a city ] '
city = gets
puts '[ Type the state ]'
state = gets
puts 'Processing...'
test = InfoJobs.new(city, state)
test.run

When I talk about void context, check the following example:

# First definition of variable with a value
@url = 'www.stackoverflow.com' # => "www.stackoverflow.com"
# Then I'm adding a text, which is returning a value in a void context, in other
# words, there's no context, like a variable or a method, which is receiving the
# resulting value.
@url + '/tags' # => "www.stackoverflow.com/tags"
# If I access to the variable again, it has the same initial value
@url # => "www.stackoverflow.com"
# A different situation is when there's a context, in this case the same @url
# variable, which it will receive the resulting value of the addition.
@url = @url + '/tags' # => "www.stackoverflow.com/tags"
@url # => "www.stackoverflow.com/tags"
# Or also using the shorter += version
@url += '?tab=name' # => "www.stackoverflow.com/tags?tab=name"
@url # => "www.stackoverflow.com/tags?tab=name"

I'm far to be an OOP expert (I've forgotten most of theory learned in Uni.😅), but the idea is try to keep only the essential information of your object as attributes.
For your case, I think city and state are essential to make all the object's functioning works as expected.
Also, could be good to keep as attributes information to be accessed multiple times and that takes a while to get it. For your case, the jobs list was saved to be written later to a file (actually, rethinking the problem, maybe it wasn't necessary at all to have @list if after writing to a file isn't going to be used anymore).
Finally, you must to think that any created instance variable isn't garbage collected and will keep living in memory as long as your object's instance is alive. For your particular case, maybe this isn't a problem, but in the eventual case where you have thousands of those objects created, they should be as lightweight as possible or you could run out of memory.

You have my sincere thanks. I really appreciate the break down you have done in my code. I could grasp most of all you wrote, but I couldn't understand what do you mean by "void context". Another thing, instance variables are the DNA of each object, right? So you don't keep changing their values like a did in the previously code right? like I was setting different values to my '@url' and '@parsed_html' variables each time the page changes.

Stack Exchange Network

Tentative to transform this scrape script into a class object program

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Tentative to transform this scrape script into a class object program

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions