I am trying to scrape the YouTube video streaming page to get the metadata of the video. I am considering this YouTube page as an example. You can find the HTML contents of that page over here (I have removed some not-so-useful JavaScript and provided the relevant HTML content). I am using Jsoup (a Java HTML parser) for this. I am getting the content what I want but I just wanna know is this the right way to do it?
public VideoData getVideoData(String videoUrl) throws IOException {
Document doc = Jsoup.connect(videoUrl).header("User-Agent", "Chrome").get();
Element body = doc.body();
String videoThumbnail = body.getElementsByAttributeValue("itemprop", "thumbnailUrl").get(0).attr("href");
String videoEmbedUrl = body.getElementsByAttributeValue("itemprop", "embedURL").get(0).attr("href");
String videoTitle = body.getElementById("eow-title").attr("title");
String userLink = body.getElementById("watch7-user-header").getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = body.getElementById("watch7-user-header").getElementsByTag("img").attr("data-thumb");
String channelLink = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(0).attr("href");
String channelName = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(0).wholeText();
boolean isChannelVerified;
try {
isChannelVerified = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(1).attr("aria-label").equalsIgnoreCase("Verified") ? true : false;
} catch (Exception e) {
isChannelVerified = false;
}
String noOfSubs = body.getElementsByClass("yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count").attr("title");
String viewCount = body.getElementsByClass("watch-view-count").text();
String noOfLikes = body.getElementsByAttributeValue("title", "I like this").get(0).text();
String noOfDislikes = body.getElementsByAttributeValue("title", "I dislike this").get(0).text();
String publishedOn = body.getElementById("watch-uploader-info").text().replace("Published on ", "");
String description = body.getElementById("watch-description-text").children().text();
boolean isFamilyFriendly = body.getElementsByAttributeValue("itemprop", "isFamilyFriendly").attr("content").equalsIgnoreCase("True") ? true : false;
String genre = body.getElementsByAttributeValue("itemprop", "genre").attr("content");
VideoData videoData=new VideoData(videoThumbnail,videoEmbedUrl,videoTitle,userLink,userPhoto,channelLink,channelName,isChannelVerified,noOfSubs,viewCount,noOfLikes,noOfDislikes,publishedOn,description,isFamilyFriendly,genre);
return videoData;
}
-
\$\begingroup\$ You need to define "right way". The only problem I see is that your code is brittle, it will break as soon as the website changes a tiny bit (like a class name change). So you would need to program defensively and test more things instead of believing the structure will never change. Also you shouldn't reuse names like "watch7-user-header", etc... put them as constants and use them everywhere. This will simplify maintenance. Also, Youtube has a specific API for such retrieval, look if it suits you it will be more robust than HTML scraping. \$\endgroup\$Patrick Mevzek– Patrick Mevzek2018年03月06日 14:57:54 +00:00Commented Mar 6, 2018 at 14:57
-
\$\begingroup\$ @PatrickMevzek You said, "it will break as soon as the website changes a tiny bit". This is exactly what I am scared of. I don't want to use YouTube API. I was particularly interested in scarping and chose YouTube for it. I wanted to know the "right way" to scrape the web page so that there is lesser chance of my program to break when the website modifies it's HTML source code a little bit. Can you give me some suggestions on how to lower the probability of breakage of my program? \$\endgroup\$Nandan Desai– Nandan Desai2018年03月06日 17:12:24 +00:00Commented Mar 6, 2018 at 17:12
-
1\$\begingroup\$ You are at the mercy of the website, whatever you do... up to banning you or imposing captchas. I have only 2 generic ideas in mind: 1) do not repeat the same ID/classnames in your code, put that aside as constants 2) try more relative paths instead of starting from the root (body) each time (depending on your library, have a look at things like XPath or CSS selectors). This should lower (but not remove) the amount of changes needed if the website changes. Your try/catch structure is probably something to apply to each case. \$\endgroup\$Patrick Mevzek– Patrick Mevzek2018年03月07日 23:03:19 +00:00Commented Mar 7, 2018 at 23:03
1 Answer 1
There is nothing special to say. This is a big bunch of procedural code. You can maybe improve it by keeping references to items or navigating in them instead of re-reading them from the root :
String userLink = body.getElementById("watch7-user-header").getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = body.getElementById("watch7-user-header").getElementsByTag("img").attr("data-thumb");
// Can be
Element user = body.getElementById("watch7-user-header");
String userLink = user.getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = user.getElementsByTag("img").attr("data-thumb");
If you want to change the way you parse that, you can introduce a parsing object. (https://www.javacodegeeks.com/2018/03/dont-parse-use-parsing-objects.html)
-
\$\begingroup\$ Thanks! I will definitely incorporate the changes you've mentioned. But, I want to know if this is the right way to scrape a webpage. I want to wait for some more answers before accepting yours. \$\endgroup\$Nandan Desai– Nandan Desai2018年03月06日 14:01:46 +00:00Commented Mar 6, 2018 at 14:01
-
\$\begingroup\$ Also, will there be any performance improvements by changing the code as you suggested or is it just the improvement of the code structure? \$\endgroup\$Nandan Desai– Nandan Desai2018年03月06日 14:11:24 +00:00Commented Mar 6, 2018 at 14:11
-
\$\begingroup\$ I does not think there is a "right" way to scrape a webpage. There is no "right" way to do stuffs, only "wrong" ones. Again, you can improve the readability of your code by introducing some intermediate classes that encapsulate the scraping details, you can learn more on that by searching for "page object". You may have a performance improvement if you navigate inside elements instead of re-evaluating them all the time. But I am not a Jsoup expert. \$\endgroup\$gervais.b– gervais.b2018年03月06日 14:16:31 +00:00Commented Mar 6, 2018 at 14:16
-
\$\begingroup\$ If she is helpful enough could you consider marking it as answered ? \$\endgroup\$gervais.b– gervais.b2018年03月08日 21:38:02 +00:00Commented Mar 8, 2018 at 21:38