1
\$\begingroup\$

I have a file containing lines having this structure :

var marker25 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Katz\'s Deli</h3>205 E. Houston Street, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/When_Harry_Met_Sally/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/64-title.jpg" /></a></div>');
var marker26 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Rockerfeller Roof Gardens</h3>5th Avenue between 49th &amp; 50th Streets, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/Spider-man/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/8-title.jpg" /></a></div>');

I want to quickly be able to have as a result line of this type:

Katz\'s Deli 205 E. Houston Street, Manhattan, New York City,New York, USA,When_Harry_Met_Sally
Rockerfeller Roof Gardens,5th Avenue between 49th &amp; 50th Streets,,New York, USA, Spider-man

To reiterate, I want to remove all HTML/JS strings to keep only addresses, country and so on.

What I did so far is :

grep 'createMarker' New_York_Filming_Locations.php 
cut -d'<' -f1,2,3,4,5,6,7,8 
sed 's/<br \/>/,/g;s/<h3>/,/g;s/<\/h3>/,/g' 
cut -d',' -f3,4,5,7,8,9,11,12 
sed 's/<a href="\///g;s/\/filming_locations">//g' 
cut -d',' -f1,2,4,5,6,7

It is done in a pipe, but for the clarity of the questions, I put each command on a separate line.

Is this the best way to do it? It is a small file, so I do not care about performance.

The sed syntax can be hard to understand. Do you guys know another way to do it using bash?

rolfl
98.1k17 gold badges219 silver badges419 bronze badges
asked Dec 25, 2012 at 10:46
\$\endgroup\$
3
  • 2
    \$\begingroup\$ Bash, sed and regexps are notoriously not suited for parsing html. Use a specialized library in another language instead. \$\endgroup\$ Commented Dec 25, 2012 at 20:37
  • \$\begingroup\$ @gniourf_gniourf, I guess for parsing php it is even less suitable. But as I understand this is expecting to be fragile. \$\endgroup\$ Commented Dec 25, 2012 at 20:44
  • \$\begingroup\$ It is just to parse a really small file quickly, without having to use any library or anything. I totally know that it is not suitable, but it is possible. I am interested in the how, not the why :) \$\endgroup\$ Commented Dec 25, 2012 at 23:11

1 Answer 1

2
\$\begingroup\$

Consider removing tags with

sed 's/<[^>]*>/ /g'
answered Dec 25, 2012 at 20:40
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.