I have a file containing lines having this structure :
var marker25 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Katz\'s Deli</h3>205 E. Houston Street, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/When_Harry_Met_Sally/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/64-title.jpg" /></a></div>'); var marker26 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Rockerfeller Roof Gardens</h3>5th Avenue between 49th & 50th Streets, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/Spider-man/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/8-title.jpg" /></a></div>');
I want to quickly be able to have as a result line of this type:
Katz\'s Deli 205 E. Houston Street, Manhattan, New York City,New York, USA,When_Harry_Met_Sally
Rockerfeller Roof Gardens,5th Avenue between 49th & 50th Streets,,New York, USA, Spider-man
To reiterate, I want to remove all HTML/JS strings to keep only addresses, country and so on.
What I did so far is :
grep 'createMarker' New_York_Filming_Locations.php
cut -d'<' -f1,2,3,4,5,6,7,8
sed 's/<br \/>/,/g;s/<h3>/,/g;s/<\/h3>/,/g'
cut -d',' -f3,4,5,7,8,9,11,12
sed 's/<a href="\///g;s/\/filming_locations">//g'
cut -d',' -f1,2,4,5,6,7
It is done in a pipe, but for the clarity of the questions, I put each command on a separate line.
Is this the best way to do it? It is a small file, so I do not care about performance.
The sed
syntax can be hard to understand. Do you guys know another way to do it using bash?
-
2\$\begingroup\$ Bash, sed and regexps are notoriously not suited for parsing html. Use a specialized library in another language instead. \$\endgroup\$gniourf_gniourf– gniourf_gniourf2012年12月25日 20:37:24 +00:00Commented Dec 25, 2012 at 20:37
-
\$\begingroup\$ @gniourf_gniourf, I guess for parsing php it is even less suitable. But as I understand this is expecting to be fragile. \$\endgroup\$ony– ony2012年12月25日 20:44:12 +00:00Commented Dec 25, 2012 at 20:44
-
\$\begingroup\$ It is just to parse a really small file quickly, without having to use any library or anything. I totally know that it is not suitable, but it is possible. I am interested in the how, not the why :) \$\endgroup\$Jeremy D– Jeremy D2012年12月25日 23:11:25 +00:00Commented Dec 25, 2012 at 23:11
1 Answer 1
Consider removing tags with
sed 's/<[^>]*>/ /g'