Web based application - HTML parsing

Question 1

I'm working on a web based applcation, which loads the HTML content of an URL using the call made to http://www.whateverorigin.org/ This avoids the same origin policy violation

url = 'http://' + document.getElementById("urlText").value
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent(url) + '&callback=?', function(data){
var doc = new DOMParser().parseFromString(data.contents, 'text/html');

If I would need to extract the meaningful visible text from this html string, is there a way that I can do this like how beautifulsoup would do in python? I'm more a beginner to javascript.

Question 2

Use jQuery in order to find and iterate over the appropriate elements. Then you can decide what to print out - for example: show the text-node of visible items. Here is a jsfiddle with a working script example: http://jsfiddle.net/w147o9f6/1/

<body>
 <div id="outputTexts">OUTPUT:</div>
</body>

javascript:

var parser = new DOMParser();
var doc;
var meaningfulTexts = [];
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('https://www.facebook.com') + '&callback=?', function(data){
 doc = parser.parseFromString(data.contents, "text/html");
 var ELMS = $(doc).find("div, p, a, span");
 ELMS.each(function(index, element) {
 if(element.style.display != "none" && $(element).text() != "") {
 $("#outputTexts").append('<br>'+ element.tagName + ' - '+$(element).text());
 meaningfulTexts.push( $(element).text() );
 }
 });
});

Question 3

I happen to see the css styling info as a part of the meaningul text. Is there a way I can remove them?

Question 4

I checked my code against facebook and some other websites and it worked very well. When i checked it against google it showed those CSS codes (saying they sit inside a span tag). I don't know if it's a problem with my code or with google's site. Is google.com the website you intend to work with?

Question 5

The web based application would be fetching the visible text from any site. $(doc).find("p, a"); I made this change. This seemed to work better.

Question 6

It looks like this is what you need? The code below parses google.nl with the whateverorigin.org website and adds it to a div. If not, please try to explain what more you need!

jQuery:

$(document).ready(function() { $.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('http://www.google.nl') + '&callback=?', function(data){ $('.result').html(data.contents); }); });

HTML:

<div class="result"></div>

Example: http://jsfiddle.net/qddekhnc/1/

Question 7

Thanks a lot Jeffrey. I would need the meaningful text information as raw strings.

Noam L 1294 bronze badges · Accepted Answer · 2014-11-28 14:52:07Z

Use jQuery in order to find and iterate over the appropriate elements. Then you can decide what to print out - for example: show the text-node of visible items. Here is a jsfiddle with a working script example: http://jsfiddle.net/w147o9f6/1/

<body>
 <div id="outputTexts">OUTPUT:</div>
</body>

javascript:

var parser = new DOMParser();
var doc;
var meaningfulTexts = [];
$.getJSON('http://whateverorigin.org/get?url=' + encodeURIComponent('https://www.facebook.com') + '&callback=?', function(data){
 doc = parser.parseFromString(data.contents, "text/html");
 var ELMS = $(doc).find("div, p, a, span");
 ELMS.each(function(index, element) {
 if(element.style.display != "none" && $(element).text() != "") {
 $("#outputTexts").append('<br>'+ element.tagName + ' - '+$(element).text());
 meaningfulTexts.push( $(element).text() );
 }
 });
});

I happen to see the css styling info as a part of the meaningul text. Is there a way I can remove them?
I checked my code against facebook and some other websites and it worked very well. When i checked it against google it showed those CSS codes (saying they sit inside a span tag). I don't know if it's a problem with my code or with google's site. Is google.com the website you intend to work with?
The web based application would be fetching the visible text from any site. $(doc).find("p, a"); I made this change. This seemed to work better.

CollectivesTM on Stack Overflow

Web based application - HTML parsing

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related