Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

mguideng/gdscrapeR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

10 Commits
R
R

Repository files navigation

gdscrapeR: scrape Glassdoor company reviews in R

GitHub Release Date

ABOUT

gdscrapeR is an R package that scrapes company reviews from Glassdoor using a single function: get_reviews. It returns a data frame structure for holding the text data, which can be further prepped for text analytics learning projects.

INSTALL & LOAD

The latest version from GitHub:

install.packages("devtools")
devtools::install_github("mguideng/gdscrapeR")
library(gdscrapeR)

USAGE

Example

The URL to scrape the awesome SpaceX company will be: www.glassdoor.com/Reviews/SpaceX-Reviews-E40371.htm.

spacex-url

Function

Pass the company number through the get_reviews function. The company number is a string representing a company's unique ID number. Identified by navigating to a company's Glassdoor reviews web page and reviewing the URL for characters between "Reviews-" and ".htm" (usually starts with an "E" and followed by digits).

# Create data frame of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- get_reviews(companyNum = "E40371")

This will scrape the following variables:

  • Date - of when review was posted
  • Summary - e.g., "Great People"
  • Rating - star rating between 1.0 and 5.0
  • Title - e.g., "Current Employee - Manager in Hawthorne, CA"
  • Pros - upsides of the workplace
  • Cons - downsides of the workplace
  • Helpful - count marked as being helpful, if any
  • (and other info related to the source link)

Result

spacex-results

PREP FOR TEXT ANALYTICS

RegEx & Export

Use regular expressions to clean and extract additional variables and then export:

#### REGEX ####
# Package
library(stringr) # pattern matching functions
# Add: PriKey (uniquely identify rows 1 to N, sorted from first to last review by date)
df$rev.pk <- as.numeric(rownames(df))
# Extract: Year, Status, Position, Location 
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))
df$rev.stat <- str_match(df$rev.title, ".+?(?= Employee -)")
df$rev.pos <- str_replace_all(df$rev.title, ".* Employee - |\\sin .*|\\s$", "")
df$rev.loc <- sub(".*\\sin ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
 (grep("Former Employee|Current Employee|^+$", df$rev.loc, value = T)), 
 "Not Given", df$rev.loc)
# Clean: Pros, Cons, Helpful
df$rev.pros <- gsub("&amp;", "&", df$rev.pros)
df$rev.cons <- gsub("&amp;", "&", df$rev.cons)
df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))
#### EXPORT ####
write.csv(df, "df-results.csv", row.names = F)

Exploration ideas

gdscrapeR was made for learning purposes. Analyze the unstructured text, extract relevant information, and transform it into useful insights.

  • Apply Natural Language Processing (NLP) methods to show what is being written about the most.
  • Sentiment analysis by categorizing the text data to determine whether a review is considered positive, negative, or neutral as a way of deriving the emotions and attitudes of employees. Here's a sample project: "Text Mining Company Reviews (in R) - Case of MBB Consulting".
  • Create a metrics profile for a company to track how star rating distributions are changing over time.
  • The "Text Mining with R" book by Julia Silge and David Robinson is highly recommended for further ideas.

If you find this package useful, feel free to star ⭐ it. Thanks for visiting ❀️ .

NOTES

  • Uses the rvest and purrr packages to make it easy to scrape company reviews into a data frame.
  • A common issue with scraping is the need to keep up with changes made to a website's pages. The Glassdoor site will change. Errors due to CSS selector changes will be shown as "Could not scrape data from website.".
  • Be polite.
    • A system sleeper is built in so there will be delays to slow down the scraper (expect ~1.5 minutes for every 100 reviews).
    • Also, saving the data frame to avoid redundant scraping sessions is suggested.
  • To contact maintainer: [imlearningthethings at gmail].

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /