Parsing of a file using sed

Asked 12 years, 8 months ago

Viewed 252 times

\$\begingroup\$

I have a file containing lines having this structure :

var marker25 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Katz\'s Deli</h3>205 E. Houston Street, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/When_Harry_Met_Sally/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/64-title.jpg" /></a></div>');
var marker26 = createMarker(point, '<div id="infowindow" style="white-space: nowrap;"><h3>Rockerfeller Roof Gardens</h3>5th Avenue between 49th &amp; 50th Streets, Manhattan, New York City,<br />New York, USA<br /><br /> <a href="/Spider-man/filming_locations"><img style="margin-right: 5px; float: left; width: 100px; height: 150px" src="/images/posters/8-title.jpg" /></a></div>');

I want to quickly be able to have as a result line of this type:

Katz\'s Deli 205 E. Houston Street, Manhattan, New York City,New York, USA,When_Harry_Met_Sally
Rockerfeller Roof Gardens,5th Avenue between 49th &amp; 50th Streets,,New York, USA, Spider-man

To reiterate, I want to remove all HTML/JS strings to keep only addresses, country and so on.

What I did so far is :

grep 'createMarker' New_York_Filming_Locations.php 
cut -d'<' -f1,2,3,4,5,6,7,8 
sed 's/<br \/>/,/g;s/<h3>/,/g;s/<\/h3>/,/g' 
cut -d',' -f3,4,5,7,8,9,11,12 
sed 's/<a href="\///g;s/\/filming_locations">//g' 
cut -d',' -f1,2,4,5,6,7

It is done in a pipe, but for the clarity of the questions, I put each command on a separate line.

Is this the best way to do it? It is a small file, so I do not care about performance.

The sed syntax can be hard to understand. Do you guys know another way to do it using bash?

edited Mar 5, 2014 at 18:42

rolfl's user avatar

rolfl

98.1k17 gold badges219 silver badges419 bronze badges

asked Dec 25, 2012 at 10:46

Jeremy D's user avatar

Jeremy D Jeremy D

1456 bronze badges

\$\endgroup\$

2

\$\begingroup\$ Bash, sed and regexps are notoriously not suited for parsing html. Use a specialized library in another language instead. \$\endgroup\$

gniourf_gniourf
– gniourf_gniourf

2012年12月25日 20:37:24 +00:00
Commented Dec 25, 2012 at 20:37
\$\begingroup\$ @gniourf_gniourf, I guess for parsing php it is even less suitable. But as I understand this is expecting to be fragile. \$\endgroup\$

ony
– ony

2012年12月25日 20:44:12 +00:00
Commented Dec 25, 2012 at 20:44
\$\begingroup\$ It is just to parse a really small file quickly, without having to use any library or anything. I totally know that it is not suitable, but it is possible. I am interested in the how, not the why :) \$\endgroup\$

Jeremy D
– Jeremy D

2012年12月25日 23:11:25 +00:00
Commented Dec 25, 2012 at 23:11

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Consider removing tags with

sed 's/<[^>]*>/ /g'

answered Dec 25, 2012 at 20:40

ony's user avatar

ony ony

6393 silver badges5 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-bash

Stack Exchange Network

Parsing of a file using sed

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing of a file using sed

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions