The task is simple: to extract the link of an audio of pronunciation for a word from a Yahoo Dictionary Webpage: e.g. Yahoo's Dictionary@ "real"
Using "Chropath", I can locate the Xpath of the element that contains the ".mp3" src link. The Xpath is
//div[@class='compText ml-10 d-ib']//span[contains(@class,'d-ib dict-sound va-mid audio-0')]
However, when I try to use the below coding, it seems that the find_element_by_Xpath method returns nothing. (Remarks: note the "SoundURL " part)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import os
# instantiate a chrome options object so you can set the size and headless preference
options = Options()
options.add_argument("--headless")
word = "real"
print("start driver...", end='')
driver = webdriver.Chrome(options=options, executable_path="F:\Python_Module\chromedriver.exe")
driver.get('https://hk.dictionary.search.yahoo.com/search?p='+ word)
Pronunciation = driver.find_element_by_class_name(" fz-14").text
Meaning = driver.find_element_by_xpath("//div[@class='compList mb-25 ml-25 p-rel']//ul").get_attribute('innerHTML')
SoundURL = driver.find_element_by_xpath("//div[@class='compText ml-10 d-ib']//span[contains(@class,'d-ib dict-sound va-mid audio-0')]").get_attribute('innerHTML')
print("Print Function started")
print("begin pronunciation")
print(Pronunciation)
print("begin pronunciation")
print("begin Meaning")
print(Meaning)
print("end Meaning")
print("begin sound")
print(SoundURL)
print("end sound")
As shown in the screencap, I would like to extract the following element:
<audio src="https://s.yimg.com/bg/dict/ox/mp3/v1/real@_us_2.mp3" xpath="1"></audio>
1 Answer 1
The problem is that for 1-2 seconds, the span
element is present in the DOM, but the audio
child element hasn't been injected yet.
You can verify this by adding a time.sleep(3)
before grabbing your soundURL
var.
How you want to solve this problem in your script depends on your requirements. There's basically 3 sets of options:
- time.sleep() - simple but inefficient
- selenium implicit wait
- selenium explicit wait - more complicated to setup but efficient
If you want to learn about Selenium waits, refer here: link
With a wait strategy, you'll probably want to find the audio
element itself rather than getting it thru the containing span
element. Here's an example along those lines (using implicit wait):
driver.implicitly_wait(3)
sound_url = driver.find_element_by_tag_name('audio').get_attribute('src')
# sound_url now contains 'https://s.yimg.com/bg/dict/ox/mp3/v1/real@_us_2.mp3'
-
Thanks. I try your method, but it seems the element I can get using Xpath of the Span Class after delaying (10 seconds) is the following :<selenium.webdriver.remote.webelement.WebElement (session="0ceb1228ebde4600e44bb05202a7f2f0", element="0.2536322356105647-3")>KCT– KCT2019年02月06日 01:45:29 +00:00Commented Feb 6, 2019 at 1:45
-
@KCT Is that not the element you want? I can't tell which element that is without looking at its attributes.Mike B– Mike B2019年02月06日 01:51:17 +00:00Commented Feb 6, 2019 at 1:51
-
No. That's not the element I want. The element I want is the audio src element containing the mp3 URL, as encircled in my screencap. Even though I have used delay, the element still is not containing any mp3 link as it should have been.KCT– KCT2019年02月06日 12:39:21 +00:00Commented Feb 6, 2019 at 12:39
-
Okay. I try again and used your code (using tag name) and got the URL. However, in the same page, there are two tag names that too goes by the name 'audio', meaning the tag name 'audio' is not unique, so how do I get the second URL? Thanks again.KCT– KCT2019年02月06日 14:21:20 +00:00Commented Feb 6, 2019 at 14:21
-
@KCT you can use a CSS Selector or Xpath to find the 2nd one specifically, or you can use the same approach as above but change
element
toelements
so it returns a list of all matching elements, e.g.audio_elems = driver.find_elements_by_tag_name('audio')
, then to get the 2nd URL, doprint(audio_elems[1].get_attribute('src'))
Mike B– Mike B2019年02月06日 22:01:35 +00:00Commented Feb 6, 2019 at 22:01