Extracting necessary records from LinkedIn

Question 1

I wanted to create a scraper in python which can fetch required data from LinkedIn. I tried with python in many different ways but I could not make it until I used selenium in combination with. However, I have created it and got it working as I wanted.

The most difficult part I had to face while making this crawler is that there are hundreds of profile pages which can be located with mostly three different XPath patterns. I somehow managed to cement the three different XPath patterns into one. Now it is working great.

This scraper firstly clicks on the view all recommendation tab in home page then parse 200 profiles [customized in this case] by going to the main page of each profile. I've tried to make it error-free. Here is what I've done:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def producing_links(driver, wait):
 driver.get('https://www.linkedin.com/')
 driver.find_element_by_xpath('//*[@id="login-email"]').send_keys('someusername')
 driver.find_element_by_xpath('//*[@id="login-password"]').send_keys('somepassword')
 driver.find_element_by_xpath('//*[@id="login-submit"]').click()
 wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follows-module__view-all')]")))
 driver.find_element_by_xpath("//a[contains(@class,'feed-s-follows-module__view-all')]").click()
 while True:
 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")))
 links = [item.get_attribute("href") for item in driver.find_elements_by_xpath("//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")]
 if (len(links) == 200): 
 break
 for link in links:
 get_docs(driver, wait, link)
def get_docs(driver, wait, name_link):
 driver.get(name_link)
 try:
 for item in driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
 name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
 title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
 except Exception as e:
 print(e)
 finally:
 try:
 print(name, title)
 except Exception as ex:
 print(ex)
if __name__ == '__main__':
 driver = webdriver.Chrome()
 wait = WebDriverWait(driver, 10)
 try:
 producing_links(driver, wait)
 finally:
 driver.quit()

Question 2

Is this your own code? This question appears to have been posted before, under a different account.

Question 3

I would recommend a more modular design - having a LinkedInScraper class, initialized with a login and password and with separate methods for logging in and getting profile links.

Also, I think you are overusing the XPaths overall - whenever possible, first explore if you can use "by id", "by name" or "by css selector" locators and fall back to XPath only if you cannot get to the element with other locators.

Also note that wait.until combined with built-in expected conditions returns a WebElement instance - if you are waiting for a specific element and then clicking it - you can do it in one go without re-finding the element.

Unfortunately, cannot test the below code (for some reason, I don't see the recommendation link on the main page when logging in with my credentials), but hope this is still useful:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class LinkedInScraper:
 def __init__(self, username, password):
 self.driver = webdriver.Chrome()
 self.wait = WebDriverWait(self.driver, 10)
 self.login(username, password)
 def __del__(self):
 self.driver.close()
 def login(self, username, password):
 self.driver.get('https://www.linkedin.com/')
 self.driver.find_element_by_id('login-email').send_keys(username)
 self.driver.find_element_by_id('login-password').send_keys(password)
 self.driver.find_element_by_id('login-submit').click()
 def links(self):
 follow_link = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follows-module__view-all")))
 follow_link.click()
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follow-recommendation-card__profile-link")))
 links = [item.get_attribute("href") for item in self.driver.find_elements_by_css_selector("a.feed-s-follow-recommendation-card__profile-link")]
 if len(links) == 200:
 break
 return links
 def profiles(self):
 for link in self.links():
 yield from self.profile(link)
 def profile(self, profile_link):
 self.driver.get(profile_link)
 for item in self.driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
 name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
 title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
 yield (name, title)
if __name__ == '__main__':
 scraper = LinkedInScraper(username='username',
 password='password')
 for profile in scraper.profiles():
 print(profile)

I am pretty sure we can also refactor the profile() method, but I cannot get to that page in order to see if locators can be simplified.

Question 4

Thanks sir alecxe, for caring to answer. A little problematic results I'm getting. "<generator object LinkedInScraper.profile at 0x02277090>". Moreover, is it possible to exhaust all the records other than limiting to 200? Thanks again.

Question 5

@ShahinIqbal ah, sure, needed the yield from, fixed, please try again.

Question 6

@ShahinIqbal yes, we can go over all the records. I'd love to help but, for some reason, I cannot see that "recommendations" link on the main page. Could you send me a direct link to that page which you then scroll? Thanks.

Question 7

I thought it is available to everyone cause even if one has a single connection he should have recommendation to broaden the area of connectivity. However, if you can see here then give it a try otherwise just leave it. Thanks for everything. "linkedin.com/feed".

Question 8

I like the way you write classes, sir alecxe. It is easy to understand. This "yield from" is new to me. Is there any major difference between "yield" and "yield from"? A one-liner explanation would be much appreciable.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-07-17 13:56:56Z

I would recommend a more modular design - having a LinkedInScraper class, initialized with a login and password and with separate methods for logging in and getting profile links.

Also, I think you are overusing the XPaths overall - whenever possible, first explore if you can use "by id", "by name" or "by css selector" locators and fall back to XPath only if you cannot get to the element with other locators.

Also note that wait.until combined with built-in expected conditions returns a WebElement instance - if you are waiting for a specific element and then clicking it - you can do it in one go without re-finding the element.

Unfortunately, cannot test the below code (for some reason, I don't see the recommendation link on the main page when logging in with my credentials), but hope this is still useful:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class LinkedInScraper:
 def __init__(self, username, password):
 self.driver = webdriver.Chrome()
 self.wait = WebDriverWait(self.driver, 10)
 self.login(username, password)
 def __del__(self):
 self.driver.close()
 def login(self, username, password):
 self.driver.get('https://www.linkedin.com/')
 self.driver.find_element_by_id('login-email').send_keys(username)
 self.driver.find_element_by_id('login-password').send_keys(password)
 self.driver.find_element_by_id('login-submit').click()
 def links(self):
 follow_link = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follows-module__view-all")))
 follow_link.click()
 while True:
 self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follow-recommendation-card__profile-link")))
 links = [item.get_attribute("href") for item in self.driver.find_elements_by_css_selector("a.feed-s-follow-recommendation-card__profile-link")]
 if len(links) == 200:
 break
 return links
 def profiles(self):
 for link in self.links():
 yield from self.profile(link)
 def profile(self, profile_link):
 self.driver.get(profile_link)
 for item in self.driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
 name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
 title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
 yield (name, title)
if __name__ == '__main__':
 scraper = LinkedInScraper(username='username',
 password='password')
 for profile in scraper.profiles():
 print(profile)

I am pretty sure we can also refactor the profile() method, but I cannot get to that page in order to see if locators can be simplified.

Thanks sir alecxe, for caring to answer. A little problematic results I'm getting. "<generator object LinkedInScraper.profile at 0x02277090>". Moreover, is it possible to exhaust all the records other than limiting to 200? Thanks again.
@ShahinIqbal ah, sure, needed the yield from, fixed, please try again.
@ShahinIqbal yes, we can go over all the records. I'd love to help but, for some reason, I cannot see that "recommendations" link on the main page. Could you send me a direct link to that page which you then scroll? Thanks.
I thought it is available to everyone cause even if one has a single connection he should have recommendation to broaden the area of connectivity. However, if you can see here then give it a try otherwise just leave it. Thanks for everything. "linkedin.com/feed".
I like the way you write classes, sir alecxe. It is easy to understand. This "yield from" is new to me. Is there any major difference between "yield" and "yield from"? A one-liner explanation would be much appreciable.

Stack Exchange Network

Extracting necessary records from LinkedIn

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting necessary records from LinkedIn

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions