Bash script to extract HTML comment into a Markdown file

Question 1

I learned Bash a million years ago. I just wrote this simple script used to get the first lot of HTML comments from a file, and spit it out in order to create a README.md file.

It is just. So. Ugly. I read bits and pieces over the years, and I am sure it can be improved so much...

Here we go:

#!/bin/bash
IFS=''
active='0';
cat hot-form-validator.html | while read "line";do
 echo $line | grep '\-\->' > /dev/null
 if [ $active = '1' -a $? = '0' ];then 
 exit 0;
 fi;
 suppress=0;
 echo $line | grep '^ *@' > /dev/null
 if [ $? = '0' ];then suppress='1'; fi;
 if [ $active = '1' -a $suppress = '0' ];then echo $line;fi;
 echo $line | grep "<!--" > /dev/null
 if [ $? = '0' ];then active='1'; fi;
done

Questions:

Is there a better way to do grep and then check $?? Back in the day it was the way to go, but...
Should active be a proper number rather than a string with a number? I know, it could be anything... but having a string that can be 0 or 1 just feels wrong.
Is there a better way to preserve spaces, rather than zapping IFS?
Any more pearls of wisdom, other than quitting my (short lived) career of bash scripter?

Question 2

A bug

If your html contains a backspace character e.g

<p>The special character \n is a way to include read line</p>

your code interprets it a

The special character n is a way to include read line.

to avoid this, add "-r" switch to your read line

cat hot-form-validator.html | while read -r "line";do

You want to check if your file exists before reading it

if [ -f $filename ]; then
#do something
else
#do something
fi

Question 3

A Bash loop is not the best approach. If you are doing a lot of line-by-line processing, and invoking grep a lot, then awk would be a more appropriate tool.

awk '/<!--/ { ACTIVE = 1; next }
 /-->/ { exit }
 ACTIVE { print }' < hot-form-validator.html

But line-by-line processing is not an appropriate way to parse HTML. xsltproc, for example, could do a proper job of extracting the first comment in an HTML file:

xsltproc --html first-comment.xsl hot-form-validator.html

... where first-comment.xml contains:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>
 <xsl:template match="/"><xsl:value-of select="//comment()[1]"/></xsl:template>
</xsl:stylesheet>

Question 4

I don't really want turn xsltproc a requirement to generate the documentation -- especially since I control the contents of the starting file. But, I am SOLD on AWK! What's the best resource to learn it?

Question 5

But also... what about the script in the original post? What if I really wanted to improve it?

Question 6

This tutorial seems good. Or, if you are a masochist, you could try to figure it all out from the man page.

Question 7

Note that you can get a bunch of Bash tips instantly by pasting your code on http://www.shellcheck.net/

A couple of things jump in the eye:

Too many unnecessary ; at the end of lines
Instead of grep ... > /dev/null, it would be shorter to write grep -q ...
Too many statements on a line when it would be more readable if split to multiple lines
Instead of cat file | while ..., it would be better to write while ... < file
Instead of echo ... | grep ..., it would be better to use here-strings, grep ... <<< ...
Instead of grep ... and then checking the value of $? in an if, you could move the grep inside the if
Quoting where it's not needed
Not quoting where it's needed

With the above points improved:

#!/bin/bash
IFS=
active=0
while read "line"; do
 if [ $active = 1 ] && grep -q '\-\->' <<< "$line"; then 
 exit 0
 fi
 suppress=0
 grep -q '^ *@' <<< "$line" && suppress=1
 [ $active = 1 -a $suppress = 0 ] && echo $line
 grep -q "<!--" <<< "$line" && active=1 
done < hot-form-validator.html

I rewrote using grep -q ... <<< ... only for the sake of an example of using grep inside if statements. But a better solution is to use native Bash pattern matching with [[ ... ]], for example instead of:

 grep -q '^ *@' <<< "$line" && suppress=1

This is better, as it doesn't spawn a grep process:

 [[ "$line" =~ ^\ *@ ]] && suppress=1

Tolani Tolani 2,5017 gold badges31 silver badges49 bronze badges · Answer 1 · 2016-07-18 18:03:27Z

A bug

If your html contains a backspace character e.g

<p>The special character \n is a way to include read line</p>

your code interprets it a

The special character n is a way to include read line.

to avoid this, add "-r" switch to your read line

cat hot-form-validator.html | while read -r "line";do

You want to check if your file exists before reading it

if [ -f $filename ]; then
#do something
else
#do something
fi

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 2 · 2016-07-18 19:01:32Z

A Bash loop is not the best approach. If you are doing a lot of line-by-line processing, and invoking grep a lot, then awk would be a more appropriate tool.

awk '/<!--/ { ACTIVE = 1; next }
 /-->/ { exit }
 ACTIVE { print }' < hot-form-validator.html

But line-by-line processing is not an appropriate way to parse HTML. xsltproc, for example, could do a proper job of extracting the first comment in an HTML file:

xsltproc --html first-comment.xsl hot-form-validator.html

... where first-comment.xml contains:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>
 <xsl:template match="/"><xsl:value-of select="//comment()[1]"/></xsl:template>
</xsl:stylesheet>

I don't really want turn xsltproc a requirement to generate the documentation -- especially since I control the contents of the starting file. But, I am SOLD on AWK! What's the best resource to learn it?
But also... what about the script in the original post? What if I really wanted to improve it?
This tutorial seems good. Or, if you are a masochist, you could try to figure it all out from the man page.

janos janos 113k15 gold badges154 silver badges396 bronze badges · Answer 3 · 2016-07-30 21:45:02Z

Note that you can get a bunch of Bash tips instantly by pasting your code on http://www.shellcheck.net/

A couple of things jump in the eye:

Too many unnecessary ; at the end of lines
Instead of grep ... > /dev/null, it would be shorter to write grep -q ...
Too many statements on a line when it would be more readable if split to multiple lines
Instead of cat file | while ..., it would be better to write while ... < file
Instead of echo ... | grep ..., it would be better to use here-strings, grep ... <<< ...
Instead of grep ... and then checking the value of $? in an if, you could move the grep inside the if
Quoting where it's not needed
Not quoting where it's needed

With the above points improved:

#!/bin/bash
IFS=
active=0
while read "line"; do
 if [ $active = 1 ] && grep -q '\-\->' <<< "$line"; then 
 exit 0
 fi
 suppress=0
 grep -q '^ *@' <<< "$line" && suppress=1
 [ $active = 1 -a $suppress = 0 ] && echo $line
 grep -q "<!--" <<< "$line" && active=1 
done < hot-form-validator.html

I rewrote using grep -q ... <<< ... only for the sake of an example of using grep inside if statements. But a better solution is to use native Bash pattern matching with [[ ... ]], for example instead of:

 grep -q '^ *@' <<< "$line" && suppress=1

This is better, as it doesn't spawn a grep process:

 [[ "$line" =~ ^\ *@ ]] && suppress=1

Stack Exchange Network

Bash script to extract HTML comment into a Markdown file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bash script to extract HTML comment into a Markdown file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions