I wanted to create a scraper in python which can fetch required data from LinkedIn. I tried with python in many different ways but I could not make it until I used selenium in combination with. However, I have created it and got it working as I wanted.
The most difficult part I had to face while making this crawler is that there are hundreds of profile pages which can be located with mostly three different XPath patterns. I somehow managed to cement the three different XPath patterns into one. Now it is working great.
This scraper firstly clicks on the view all recommendation tab in home page then parse 200 profiles [customized in this case] by going to the main page of each profile. I've tried to make it error-free. Here is what I've done:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def producing_links(driver, wait):
driver.get('https://www.linkedin.com/')
driver.find_element_by_xpath('//*[@id="login-email"]').send_keys('someusername')
driver.find_element_by_xpath('//*[@id="login-password"]').send_keys('somepassword')
driver.find_element_by_xpath('//*[@id="login-submit"]').click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follows-module__view-all')]")))
driver.find_element_by_xpath("//a[contains(@class,'feed-s-follows-module__view-all')]").click()
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")))
links = [item.get_attribute("href") for item in driver.find_elements_by_xpath("//a[contains(@class,'feed-s-follow-recommendation-card__profile-link')]")]
if (len(links) == 200):
break
for link in links:
get_docs(driver, wait, link)
def get_docs(driver, wait, name_link):
driver.get(name_link)
try:
for item in driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
except Exception as e:
print(e)
finally:
try:
print(name, title)
except Exception as ex:
print(ex)
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
try:
producing_links(driver, wait)
finally:
driver.quit()
-
\$\begingroup\$ Is this your own code? This question appears to have been posted before, under a different account. \$\endgroup\$200_success– 200_success2017年07月17日 22:13:23 +00:00Commented Jul 17, 2017 at 22:13
1 Answer 1
I would recommend a more modular design - having a LinkedInScraper
class, initialized with a login and password and with separate methods for logging in and getting profile links.
Also, I think you are overusing the XPaths overall - whenever possible, first explore if you can use "by id", "by name" or "by css selector" locators and fall back to XPath only if you cannot get to the element with other locators.
Also note that wait.until
combined with built-in expected conditions returns a WebElement
instance - if you are waiting for a specific element and then clicking it - you can do it in one go without re-finding the element.
Unfortunately, cannot test the below code (for some reason, I don't see the recommendation link on the main page when logging in with my credentials), but hope this is still useful:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class LinkedInScraper:
def __init__(self, username, password):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
self.login(username, password)
def __del__(self):
self.driver.close()
def login(self, username, password):
self.driver.get('https://www.linkedin.com/')
self.driver.find_element_by_id('login-email').send_keys(username)
self.driver.find_element_by_id('login-password').send_keys(password)
self.driver.find_element_by_id('login-submit').click()
def links(self):
follow_link = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follows-module__view-all")))
follow_link.click()
while True:
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.feed-s-follow-recommendation-card__profile-link")))
links = [item.get_attribute("href") for item in self.driver.find_elements_by_css_selector("a.feed-s-follow-recommendation-card__profile-link")]
if len(links) == 200:
break
return links
def profiles(self):
for link in self.links():
yield from self.profile(link)
def profile(self, profile_link):
self.driver.get(profile_link)
for item in self.driver.find_elements_by_xpath("//div[contains(@class,'pv-top-card-section__information') or contains(@class,'org-top-card-module__details') or (@class='org-top-card-module__main-column')]"):
name = item.find_element_by_xpath(".//h1[@title]|.//h1[contains(@class,'pv-top-card-section__name')]").text
title = item.find_element_by_xpath(".//span[contains(@class,'company-industries')]|.//h2[contains(@class,'pv-top-card-section__headline')]").text
yield (name, title)
if __name__ == '__main__':
scraper = LinkedInScraper(username='username',
password='password')
for profile in scraper.profiles():
print(profile)
I am pretty sure we can also refactor the profile()
method, but I cannot get to that page in order to see if locators can be simplified.
-
\$\begingroup\$ Thanks sir alecxe, for caring to answer. A little problematic results I'm getting. "<generator object LinkedInScraper.profile at 0x02277090>". Moreover, is it possible to exhaust all the records other than limiting to 200? Thanks again. \$\endgroup\$MITHU– MITHU2017年07月17日 14:12:29 +00:00Commented Jul 17, 2017 at 14:12
-
\$\begingroup\$ @ShahinIqbal ah, sure, needed the
yield from
, fixed, please try again. \$\endgroup\$alecxe– alecxe2017年07月17日 14:13:21 +00:00Commented Jul 17, 2017 at 14:13 -
\$\begingroup\$ @ShahinIqbal yes, we can go over all the records. I'd love to help but, for some reason, I cannot see that "recommendations" link on the main page. Could you send me a direct link to that page which you then scroll? Thanks. \$\endgroup\$alecxe– alecxe2017年07月17日 14:14:20 +00:00Commented Jul 17, 2017 at 14:14
-
\$\begingroup\$ I thought it is available to everyone cause even if one has a single connection he should have recommendation to broaden the area of connectivity. However, if you can see here then give it a try otherwise just leave it. Thanks for everything. "linkedin.com/feed". \$\endgroup\$MITHU– MITHU2017年07月17日 14:26:20 +00:00Commented Jul 17, 2017 at 14:26
-
\$\begingroup\$ I like the way you write classes, sir alecxe. It is easy to understand. This "yield from" is new to me. Is there any major difference between "yield" and "yield from"? A one-liner explanation would be much appreciable. \$\endgroup\$MITHU– MITHU2017年07月17日 16:06:13 +00:00Commented Jul 17, 2017 at 16:06
Explore related questions
See similar questions with these tags.