Friday, January 4, 2008
Getting Rid Of HTML Tags And Leaving The Comments
Hey there,
If you've ever found yourself stuck in a spot where you needed to convert an html page to straight up text (for whatever reason), today's script might be a nice addition to your tool-kit. It's been tested in ash and bash on Linux and on sh, ksh and bash for Solaris.
The main reason I wrote this script was to break down html pages (at the code level) by removing all the html tags, while leaving all the comments. I primarily prefer to do this, when I can, in the shell rather than cut-and-paste from Firefox or Internet Explorer since the cut-and-paste method has a nasty habit of removing all formatting. Well, that's not entirely true. The carriage returns do usually manage to make it through unscathed ;)
This shell script is basically a wrapper for a few sed commands. While most readers are probably somewhat familiar with sed and how to use it, I find that a lot of folks on the forums request help when it comes to using some of its more advanced functionality, like matching patterns across multiple lines. This script matches (and removes) any tags that begin with the "<" character and end with the ">" character. I use it for html, but it could easily be used (without any porting at all) on any markup language file that uses the same tagging convention.
For clarity, I split up the single-line match from the multi-line match. You'll note that the second invocation of sed is where we write our expression to work on a html tag-pair that spans any number of lines greater than one. We'll go into that more in a future post, but for now, note that we basically find all opening "<" characters that don't have an ending ">" character on the same line, and consume everything on every line all the way up to, and including, the ending ">" character. The simplest solution, once we've determined that we're going to traverse multiple lines, is to replace all newlines ("\n") with spaces. So, basically, when we do the multi-line match, we're converting that multi-line entry into a single line so that the match can be made. As I said, this subject can be a little too convoluted to go into too much detail for the purposes of this post ;)
I've called this script htstrip (call it whatever you like) and you can invoke it like this:
host # ./htstrip filea fileb filec <-- Doesn't matter what the file names are; they don't have to be .htm files, etc.
And, as an example. This is the type of output you could expect to see:
host # cat filea.htm
<a> bob1 </a>
<a> bob2
</a>
<a> bob3
bob4
</a>
<!-- this is
a comment
-->
<a> bob5
bob6
bob7
</a>
host # ./htstrip filea.htm
host # cat filea.htm <-- Note that our original file is saved as filea.htm.old, just in case
bob1
bob2
bob3
bob4
<!-- this is
a comment
-->
bob5
bob6
bob7
Note, that, as I mentioned in the title, I've purposefully made it so that standard html comments will remain. If you don't want these in your output either, a small modification to the script can make sure those go away too. Note also that we use shorthand in the "for x" loop. This post has nothing to do with that, but it's nice to know that you don't need to write "for x in BLAH" if you're using the default input. Technically, it's a better idea to use the full form. Especially if you're writing a huge script or are dealing with issues of scope, etc. ...For another day.
Hope this helps you out :)
Creative Commons License
This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License #!/bin/sh
#
# htstrip - 2008 - Mike Golvach - eggi@comcast.net
#
# Usage: htstrip filea fileb filen
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License;
#
trap 'rm -f temp temp2 temp3;exit 1' 1 2 3 9 15
if [ $# -lt 1 ]
then
echo "htstrip needs to know what files you want to strip!"
exit 1
fi
for x
do
sed -e 's/<[^>]*>//g' $x >> temp
sed -e '/<[^>]*$/{
$!N
/^[^>]*>/{
s/\n/ /
s/<[^>]*>//g
}
}' temp >> temp2
sed -e '/^ *$/d' temp2 >> temp3
rm temp temp2
mv $x ${x}.old
mv temp3 $x
done
, Mike
linux unix internet technology
[フレーム]
Tuesday, October 30, 2007
An Easy To Understand Shell Script To Save You Some Hassle
Hey again,
Just to round off the day, I thought this might be of interest to someone other than me.
Personally, I hate editing a shell script and then writing it to disk, chmod'ing it and then running it. I know it's petty and not that big of a deal, but that's what shell scripting's for, right? Taking care of all those boring things you have to do over and over and over again.
This little script I call "vix" and I call it instead of "vi" or "vim" (Your preference; just modify the script - use emacs if you want ;) when I'm going to be writing a new executable script.
Creative Commons License
This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#!/bin/ksh
#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#
/usr/bin/touch $@
/usr/bin/chmod 755 $@
/usr/bin/vi $@
Simple, right? $@ is just your command line arguments (the names of the scripts you're going to edit/write), so you can call:
vix scripta.sh scriptb.sh ...
instead of
vi scripta.sh scriptb.sh ...
and your script will be executable when you're done editing it. Silly and simplistic? Yes. Useful, too, if you're sick and tired of typing chmod 755 (or whatever) after every script you write :)
, Mike
linux unix internet technology
affiliate program
Posted by Mike Golvach at 8:59 PM
administration, advice, edit, executable script, linux, script, scripting, scripts, technology, tips, touch, tricks, unix, vi
Sunday, October 28, 2007
Floating Point Arithmetic and Percentage Comparison in Ksh!
Hello again,
Today, we've got a somewhat heavy script. If you're an administrator, or just a concerned user, you've probably run into a situation where you needed to be notified if a file got too large. This is a fairly common concern if you're going to have to do lots of extra work if you don't get some fair warning before the file fills up the partition it resides on!
Below is a little something I whipped up to keep tabs on pretty much any file. It involves several concepts, which we'll go over in the future. These include, mailing out to users from within the script, watching a file's size and comparing it with an "expected" size and using the shell's limited integer arithmetic to evaluate size and comparative percentage with simulated floating-point decimals.
Below: The script. Over the next few days or weeks, I'll revisit some of the finer points involved here, as they require way too much space when dealt with all at once!
Creative Commons License
This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#!/bin/ksh
#
# 2007 - Mike Golvach - eggi@comcast.net
#
# Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
#
concerned_users="user1@xyz.com,user2@xyz.com"
if [ ! -f $db_file ]
then
(echo "Subject: Database File Missing!";echo;echo;echo "The file $db_file does not seem to exist!?";echo "From: root@xyz.com";for x in $concerned_users;do echo "To: $x";done;echo "Please check this out a.s.a.p. unless it is a known issue!")|/usr/lib/sendmail -t
exit 1
fi
db_file=/path/to/db.file
db_dusk=`/usr/bin/du -sk $db_file`
db_size=`echo $db_dusk|/usr/bin/awk '{print 1ドル}'`
db_limit=2097152.20
let threshold=`echo 4 k ${db_size} $db_limit / p |/usr/bin/dc|/usr/bin/sed 's/\.//'`
threshold_pct=`echo $threshold|/usr/bin/sed 's/^\(.*\)\(..\)$/1円\.2円/'`
# Assuming 2Gb Crash Point - 2097152.2 Kb
# -- Start Warning At 85%
if [ $db_size -gt 1782579 ]
then
(echo "Subject: Database Approaching Maximum Capacity";echo "From: root@xyz.com";for x in $concerned_users;do echo "To: $x";done;echo "";echo "";echo "Database $db_file is currently at";echo "$db_size Kb";echo "This exceeds the 2Gb 85% threshold at ${threshold_pct}%.";echo "Please reduce the size immediately, if possible")|/usr/lib/sendmail -t
fi
Hopefully, this script will be helpful to you. I look forward to digging deeper into the specifics in future posts :)
, Mike
linux unix internet technology
affiliate program
Posted by Mike Golvach at 1:44 AM
administration, advice, arithmetic, comparison, executable script, floating point, linux, percent, percentage, script, scripting, scripts, technology, tips, tricks, unix
Saturday, October 27, 2007
Getting SSH to Run in a Shell Loop
This is something I hear a lot, because it's still the norm for SSH.
A lot of times, you'll want to run a command remotely, and to do so expediently, you'll want to run that command in a shell loop. This works perfectly well for rsh (we're assuming for rsh that you've set up your .rhosts files and for SSH you have passwordless key exchange set up so these loops don't require user interaction):
for x in hosta hostb hostc
do
rsh $x "hostname" <----------- Or whatever command you want to run
done
This produces: hosta hostb hostc
That will run the command on every machine in your "for x" line and spit out the results. Interestingly enough, if you do this with SSH, only the first host will get processed, and the script will exit, producing only this: hosta
The issue here is with the way SSH deals with tty's. When that first loop terminates, SSH closes the tty and exits hard, breaking out of the script loop and terminating the script.
The solution is simple enough; just set up your SSH command so that it treats its terminal, in each invocation of the loop, as a "null terminal," like so (the "-n" option is available as a standard, but may vary depending on what release of SSH you're using):
for x in hosta hostb hostc
do
ssh -n $x "hostname"
done
Problem solved :)
I usually add a little extra to the end of the SSH line: ssh -n $x "hostname" 2>/dev/null
The "2>/dev/null" eliminates STDERR output so you don't have to see all the connect and disconnect information. It's useful for simple debugging, but, once your script is working okay, it's just more clutter.
Hopefully this will help you automate processes more securely and more easily in the future!
, Mike
linux unix internet technology
affiliate program
Posted by Mike Golvach at 10:33 PM
administration, advice, executable script, linux, non-interactive, script, scripting, scripts, shell, shell loop, shell script, ssh, technology, tips, tricks, unix
Friday, October 19, 2007
Skipping Blank Lines in Input Files Using Perl
Hey There,
This is all based on the presumptions that you've read in a file, examined the comma-delimited fields of the lines and put them back together (you've performed whatever actions you needed to on the existing lines and now want to move or print the output, but don't want to print any lines that don't have any content after your parsing).
After you've executed the line of code that puts your line back together (added the values back into a comma delimited line), you can do a chomp on the scalar value (the $ string).
So, rather than just add an eol character (\n,\r, etc), make it conditional to remove thost blank lines (below example might not be exactly what you're working with, but the spirit's the same).
I've used SPACE to denote an actual space key type and TAB to denote an actual tab key type. You can use \s and \t for this also, but it's my habit. Had to use the SPACE and TAB words here because the space and tab key strikes won't show in this post)
Run this in a loop on your input (while
chomp($line);
if ( $line !~ /^SPACE*TAB*$/ ) {
print "$line\n";
}
And you've got your output, parsed per your requirements and no empty lines mucking it up :)
, Mike
linux unix internet technology
affiliate program
Posted by Mike Golvach at 9:13 PM
administration, advice, blank lines, chomp, eol, executable script, linux, perl, scripting, skipping, technology, tips, tricks, unix
Sunday, October 14, 2007
Using Cron To Send Out Raw Email
Hey there,
It's been a while since work hasn't gotten in the way of my blog-baby here ;) Unfortunately, I've been on-call and it's taken every moment of my time. I'll be handing off the pager here in just about an hour. Yay!!
Got a question about how to use "cron" to send out emails without calling a separate script and using a linux/unix built-in mailer. It's actually very simple to do; it just looks complicated.
For instance, if you wanted to send out an email every 30 minutes on the hour and half hour, with a Subject of "Your Regular Email" and a body with just a few lines in it, like:
"Howdy,
Hope you like this email!
, Mike"
You could just add this entry to your crontab:
0,30 * * * * (echo "Subject: Your Regular Email";echo;echo "Howdy";echo " Hope you like this email";echo;echo " , Mike")|/usr/lib/sendmail you@email.com
It's a simple concept - You create the email on the left side of the pipe (|) and use sendmail on the right side to mail that out to the address (you@email.com).
The most important thing to remember is to write out the entire contents of your email in a subshell (in parentheses). If you wrote all those echo statements outside of a subshell, only the last echo would get passed to Sendmail, and you'd get an email with no subject and a body that said:
" , Mike"
Generally, you only send mail this way for really simple stuff, but it can be as complicated as you like :)
, Mike
linux unix internet technology
affiliate program
Posted by Mike Golvach at 4:15 AM
administration, advice, cron, crontab, email, executable script, linux, raw email, scripting, sendmail, technology, tips, unix