5
\$\begingroup\$

I wanted to create a scraper in python which can fetch required data from LinkedIn. I tried with python in many different ways but I could not make it until I used selenium in combination with. However, I have created it and got it working as I wanted.

The most difficult part I had to face while making this crawler is that there are hundreds of profile pages which can be located with mostly three different XPath patterns. I somehow managed to cement the three different XPath patterns into one. Now it is working great.

This scraper firstly clicks on the view all recommendation tab in home page then parse 200 profiles [customized in this case] by going to the main page of each profile. I've tried to make it error-free. Here is what I've done:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def producing_links(driver, wait):
 driver.get('https://www.linkedin.com/')
 driver.find_element_by_xpath('//*[@id="login-email"]').send_keys('someusername')
 driver.find_element_by_xpath('//*[@id="login-password"]').send_keys('somepassword')
 driver.find_element_by_xpath('//*[@id="login-submit"]').click()
 wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follows-module__view-all')]")))
 driver.find_element_by_xpath("//a[contains(@class,'feed-s-follows-module__view-all')]").click()
 while True:
 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")))
 links = [item.get_attribute("href") for item in driver.find_elements_by_xpath("//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")]
 if (len(links) == 200): 
 break
 for link in links:
 get_docs(driver, wait, link)
def get_docs(driver, wait, name_link):
 driver.get(name_link)
 try:
 for item in driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
 name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
 title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
 except Exception as e:
 print(e)
 finally:
 try:
 print(name, title)
 except Exception as ex:
 print(ex)
if __name__ == '__main__':
 driver = webdriver.Chrome()
 wait = WebDriverWait(driver, 10)
 try:
 producing_links(driver, wait)
 finally:
 driver.quit()
alecxe
17.5k8 gold badges52 silver badges93 bronze badges
asked Jul 17, 2017 at 13:15
\$\endgroup\$
1
  • \$\begingroup\$ Is this your own code? This question appears to have been posted before, under a different account. \$\endgroup\$ Commented Jul 17, 2017 at 22:13

1 Answer 1

5
\$\begingroup\$

I would recommend a more modular design - having a LinkedInScraper class, initialized with a login and password and with separate methods for logging in and getting profile links.

Also, I think you are overusing the XPaths overall - whenever possible, first explore if you can use "by id", "by name" or "by css selector" locators and fall back to XPath only if you cannot get to the element with other locators.

Also note that wait.until combined with built-in expected conditions returns a WebElement instance - if you are waiting for a specific element and then clicking it - you can do it in one go without re-finding the element.

Unfortunately, cannot test the below code (for some reason, I don't see the recommendation link on the main page when logging in with my credentials), but hope this is still useful:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class LinkedInScraper:
 def __init__(self, username, password):
 self.driver = webdriver.Chrome()
 self.wait = WebDriverWait(self.driver, 10)
 self.login(username, password)
 def __del__(self):
 self.driver.close()
 def login(self, username, password):
 self.driver.get('https://www.linkedin.com/')
 self.driver.find_element_by_id('login-email').send_keys(username)
 self.driver.find_element_by_id('login-password').send_keys(password)
 self.driver.find_element_by_id('login-submit').click()
 def links(self):
 follow_link = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follows-module__view-all")))
 follow_link.click()
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follow-recommendation-card__profile-link")))
 links = [item.get_attribute("href") for item in self.driver.find_elements_by_css_selector("a.feed-s-follow-recommendation-card__profile-link")]
 if len(links) == 200:
 break
 return links
 def profiles(self):
 for link in self.links():
 yield from self.profile(link)
 def profile(self, profile_link):
 self.driver.get(profile_link)
 for item in self.driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
 name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
 title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
 yield (name, title)
if __name__ == '__main__':
 scraper = LinkedInScraper(username='username',
 password='password')
 for profile in scraper.profiles():
 print(profile)

I am pretty sure we can also refactor the profile() method, but I cannot get to that page in order to see if locators can be simplified.

answered Jul 17, 2017 at 13:56
\$\endgroup\$
7
  • \$\begingroup\$ Thanks sir alecxe, for caring to answer. A little problematic results I'm getting. "<generator object LinkedInScraper.profile at 0x02277090>". Moreover, is it possible to exhaust all the records other than limiting to 200? Thanks again. \$\endgroup\$ Commented Jul 17, 2017 at 14:12
  • \$\begingroup\$ @ShahinIqbal ah, sure, needed the yield from, fixed, please try again. \$\endgroup\$ Commented Jul 17, 2017 at 14:13
  • \$\begingroup\$ @ShahinIqbal yes, we can go over all the records. I'd love to help but, for some reason, I cannot see that "recommendations" link on the main page. Could you send me a direct link to that page which you then scroll? Thanks. \$\endgroup\$ Commented Jul 17, 2017 at 14:14
  • \$\begingroup\$ I thought it is available to everyone cause even if one has a single connection he should have recommendation to broaden the area of connectivity. However, if you can see here then give it a try otherwise just leave it. Thanks for everything. "linkedin.com/feed". \$\endgroup\$ Commented Jul 17, 2017 at 14:26
  • \$\begingroup\$ I like the way you write classes, sir alecxe. It is easy to understand. This "yield from" is new to me. Is there any major difference between "yield" and "yield from"? A one-liner explanation would be much appreciable. \$\endgroup\$ Commented Jul 17, 2017 at 16:06

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.