YouTube page scraping using Jsoup

Question 1

I am trying to scrape the YouTube video streaming page to get the metadata of the video. I am considering this YouTube page as an example. You can find the HTML contents of that page over here (I have removed some not-so-useful JavaScript and provided the relevant HTML content). I am using Jsoup (a Java HTML parser) for this. I am getting the content what I want but I just wanna know is this the right way to do it?

public VideoData getVideoData(String videoUrl) throws IOException {
 Document doc = Jsoup.connect(videoUrl).header("User-Agent", "Chrome").get();
 Element body = doc.body();
 String videoThumbnail = body.getElementsByAttributeValue("itemprop", "thumbnailUrl").get(0).attr("href");
 String videoEmbedUrl = body.getElementsByAttributeValue("itemprop", "embedURL").get(0).attr("href");
 String videoTitle = body.getElementById("eow-title").attr("title");
 String userLink = body.getElementById("watch7-user-header").getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
 String userPhoto = body.getElementById("watch7-user-header").getElementsByTag("img").attr("data-thumb");
 String channelLink = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(0).attr("href");
 String channelName = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(0).wholeText();
 boolean isChannelVerified;
 try {
 isChannelVerified = body.getElementById("watch7-user-header").getElementsByClass("yt-user-info").get(0).child(1).attr("aria-label").equalsIgnoreCase("Verified") ? true : false;
 } catch (Exception e) {
 isChannelVerified = false;
 }
 String noOfSubs = body.getElementsByClass("yt-subscription-button-subscriber-count-branded-horizontal yt-subscriber-count").attr("title");
 String viewCount = body.getElementsByClass("watch-view-count").text();
 String noOfLikes = body.getElementsByAttributeValue("title", "I like this").get(0).text();
 String noOfDislikes = body.getElementsByAttributeValue("title", "I dislike this").get(0).text();
 String publishedOn = body.getElementById("watch-uploader-info").text().replace("Published on ", "");
 String description = body.getElementById("watch-description-text").children().text();
 boolean isFamilyFriendly = body.getElementsByAttributeValue("itemprop", "isFamilyFriendly").attr("content").equalsIgnoreCase("True") ? true : false;
 String genre = body.getElementsByAttributeValue("itemprop", "genre").attr("content");
 VideoData videoData=new VideoData(videoThumbnail,videoEmbedUrl,videoTitle,userLink,userPhoto,channelLink,channelName,isChannelVerified,noOfSubs,viewCount,noOfLikes,noOfDislikes,publishedOn,description,isFamilyFriendly,genre);
 return videoData;
}

Question 2

You need to define "right way". The only problem I see is that your code is brittle, it will break as soon as the website changes a tiny bit (like a class name change). So you would need to program defensively and test more things instead of believing the structure will never change. Also you shouldn't reuse names like "watch7-user-header", etc... put them as constants and use them everywhere. This will simplify maintenance. Also, Youtube has a specific API for such retrieval, look if it suits you it will be more robust than HTML scraping.

Question 3

@PatrickMevzek You said, "it will break as soon as the website changes a tiny bit". This is exactly what I am scared of. I don't want to use YouTube API. I was particularly interested in scarping and chose YouTube for it. I wanted to know the "right way" to scrape the web page so that there is lesser chance of my program to break when the website modifies it's HTML source code a little bit. Can you give me some suggestions on how to lower the probability of breakage of my program?

Question 4

You are at the mercy of the website, whatever you do... up to banning you or imposing captchas. I have only 2 generic ideas in mind: 1) do not repeat the same ID/classnames in your code, put that aside as constants 2) try more relative paths instead of starting from the root (body) each time (depending on your library, have a look at things like XPath or CSS selectors). This should lower (but not remove) the amount of changes needed if the website changes. Your try/catch structure is probably something to apply to each case.

Question 5

There is nothing special to say. This is a big bunch of procedural code. You can maybe improve it by keeping references to items or navigating in them instead of re-reading them from the root :

String userLink = body.getElementById("watch7-user-header").getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = body.getElementById("watch7-user-header").getElementsByTag("img").attr("data-thumb");
// Can be 
Element user = body.getElementById("watch7-user-header");
String userLink = user.getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = user.getElementsByTag("img").attr("data-thumb");

If you want to change the way you parse that, you can introduce a parsing object. (https://www.javacodegeeks.com/2018/03/dont-parse-use-parsing-objects.html)

Question 6

Thanks! I will definitely incorporate the changes you've mentioned. But, I want to know if this is the right way to scrape a webpage. I want to wait for some more answers before accepting yours.

Question 7

Also, will there be any performance improvements by changing the code as you suggested or is it just the improvement of the code structure?

Question 8

I does not think there is a "right" way to scrape a webpage. There is no "right" way to do stuffs, only "wrong" ones. Again, you can improve the readability of your code by introducing some intermediate classes that encapsulate the scraping details, you can learn more on that by searching for "page object". You may have a performance improvement if you navigate inside elements instead of re-evaluating them all the time. But I am not a Jsoup expert.

Question 9

If she is helpful enough could you consider marking it as answered ?

gervais.b gervais.b 2,14712 silver badges13 bronze badges · Accepted Answer · 2018-03-06 07:17:23Z

There is nothing special to say. This is a big bunch of procedural code. You can maybe improve it by keeping references to items or navigating in them instead of re-reading them from the root :

String userLink = body.getElementById("watch7-user-header").getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = body.getElementById("watch7-user-header").getElementsByTag("img").attr("data-thumb");
// Can be 
Element user = body.getElementById("watch7-user-header");
String userLink = user.getElementsByAttributeValue("class", "yt-user-photo yt-uix-sessionlink spf-link").attr("href");
String userPhoto = user.getElementsByTag("img").attr("data-thumb");

If you want to change the way you parse that, you can introduce a parsing object. (https://www.javacodegeeks.com/2018/03/dont-parse-use-parsing-objects.html)

Thanks! I will definitely incorporate the changes you've mentioned. But, I want to know if this is the right way to scrape a webpage. I want to wait for some more answers before accepting yours.
Also, will there be any performance improvements by changing the code as you suggested or is it just the improvement of the code structure?
I does not think there is a "right" way to scrape a webpage. There is no "right" way to do stuffs, only "wrong" ones. Again, you can improve the readability of your code by introducing some intermediate classes that encapsulate the scraping details, you can learn more on that by searching for "page object". You may have a performance improvement if you navigate inside elements instead of re-evaluating them all the time. But I am not a Jsoup expert.
If she is helpful enough could you consider marking it as answered ?

Stack Exchange Network

YouTube page scraping using Jsoup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

YouTube page scraping using Jsoup

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions