Wednesday, June 17, 2009
Easily Removing VCS resources From A Running Configuration On Unix Or Linux
Hey there,
Today's post is going to be a bit brief (which is a relative statement. If you've never been here before, this post may seem incredibly long ;). The subject matter (as noted in the Subject ;) deals with how to easily, and safely, remove resources from a running VCS (Veritas Cluster Server) configuration. And, just as a matter of course, I feel I must qualify the previous statement by noting that if you remove a vital resource from your running VCS configuration, no matter how safely and correctly you do it, you may end up with a major headache ;) (Please see our other posts on VCS for Linux or Unix if any part of this run-down requires further explanation. Hopefully, we've already covered it :)
And here we go. The method to doing this is so simple, I will be writing the rest of this post in "Dick and Jane" style. See VCS configuration. See VCS configuration running. See Sysadmin. See Sysadmin answer 15 Instant Messages and respond to 3 emails while fielding questions on a conference call... No. That won't work ;)
Here we go, for real. Removing VCS resources from a running VCS configuration can be extremely simple. In fact, in order to make sure that you not only do it simply, but also correctly, you're going to make use of a file that VCS creates automatically when you save your main.cf file. That file is called "main.cmd" The "main.cmd" file should exist already, but if there's any reason you have to doubt that your main.cmd is correct (or you just don't have one - it should be located in /etc/VRTSvcs/conf/config), you can always create one from the running configuration by doing the following with the "hacf" command (I do this, usually, just to be sure):
host # cp /etc/VRTSvcs/conf/config/main.cf /var/tmp/main.cf
host # hacf -cftocmd /var/tmp
The above sequence of commands will create a file named "main.cmd" in your /var/tmp directory. This file contains every single command line (from the most basic to the most specific) that you would need to completely recreate your VCS configuration (of course, using the "-cmdtocf" flag might be a bit easier ;) As such, you can, of course, also use this file as a guide to easily "add" resources to your running VCS config, but that's beyond the scope of this post. "hacf" also allows you to use the "-dest" option so you can put the main.cmd that you created in a separate location, like so:
host # hacf -cftocmd /var/tmp -dest /home/user1/.
And here's where it gets easy :)
Let's say that you wanted to remove a Veritas disk volume from your running VCS configuration (which would, also, usually mean that you would want to remove the mount point from VCS control and remove any dependencies). Assuming that you know the volume's name, you can find out everything you need to know about the volume and the mountpoint, including dependency linkage, by simply using the "grep" command against your main.cmd file (Note that, even if you don't know your volume's name, you can figure that out just as easily by simply grepping for "Volume" - the VCS resource type, which is case sensitive - in your main.cmd file, and looking for the name of the disk volume that way).
Then let's say that your volume's name is VOL_volume54_host. And let's also say that you have the privilege and means to execute the VCS commands you'll be discovering (I'm heavily fighting a strong Airplane joke urge right now ;)
host # grep VOL_volume54 /var/tmp/main.cmd
hares -add VOL_volume54_host Volume SG_host
hares -modify VOL_volume54_host Critical 0
hares -modify VOL_volume54_host Volume volume54
hares -modify VOL_volume54_host DiskGroup hostdg
hares -modify VOL_volume54_host Enabled 1
hares -link MNT_mount54_host VOL_volume54_host
hares -link VOL_volume54_host DG_hostdg_host
For now, we'll just use this information to find the disk mount (Mount) name and grep for that, as well:
host # grep MNT_mount54_host /var/tmp/main.cmd
hares -add MNT_mount54_host Mount SG_host
hares -modify MNT_mount54_host Critical 0
hares -modify MNT_mount54_host MountPoint "/disk54/files"
hares -modify MNT_mount54_host BlockDevice "/dev/vx/dsk/hostdg/volume54"
hares -modify MNT_mount54_host FSType vxfs
hares -modify MNT_mount54_host MountOpt largefiles
hares -modify MNT_mount54_host FsckOpt "%-y"
hares -modify MNT_mount54_host CkptUmount 1
hares -modify MNT_mount54_host SecondLevelMonitor 0
hares -modify MNT_mount54_host SecondLevelTimeout 30
hares -modify MNT_mount54_host Enabled 1
hares -link MNT_mount54_host VOL_volume54_host
hares -link RANDOMRESOURCE_resourcename_host MNT_mount54_host
Now you'll put these all together in order (I prefer to remove link dependencies first, then remove the mount followed by the volume - doing otherwise may cause problems for you). Note that one dependency link showed up in both greps. You can run the same command twice without doing any harm, but I'm stripping the duplicate out for neatness' sake. You're almost ready to go. At this point, you'll end up with this list of VCS commands (which you shouldn't run, but - if you do, accidentally - won't cause any harm):
hares -link MNT_mount54_host VOL_volume54_host
hares -link VOL_volume54_host DG_hostdg_host
hares -link RANDOMRESOURCE_resourcename_host MNT_mount54_host
hares -add MNT_mount54_host Mount SG_host
hares -modify MNT_mount54_host Critical 0
hares -modify MNT_mount54_host MountPoint "/disk54/files"
hares -modify MNT_mount54_host BlockDevice "/dev/vx/dsk/hostdg/volume54"
hares -modify MNT_mount54_host FSType vxfs
hares -modify MNT_mount54_host MountOpt largefiles
hares -modify MNT_mount54_host FsckOpt "%-y"
hares -modify MNT_mount54_host CkptUmount 1
hares -modify MNT_mount54_host SecondLevelMonitor 0
hares -modify MNT_mount54_host SecondLevelTimeout 30
hares -modify MNT_mount54_host Enabled 1
hares -add VOL_volume54_host Volume SG_host
hares -modify VOL_volume54_host Critical 0
hares -modify VOL_volume54_host Volume volume54
hares -modify VOL_volume54_host DiskGroup hostdg
hares -modify VOL_volume54_host Enabled 1
Now, you'll strip this down and "reverse" the commands. And by "reverse" I mean reverse the intent, and not the order ;) So, all -link options will become -unlink options, etc. This will leave you with the following (we're removing every "unnecessary" command, since, for instance, removing a volume automatically removes all of its attributes):
hares -unlink MNT_mount54_host VOL_volume54_host
hares -unlink VOL_volume54_host DG_hostdg_host
hares -unlink RANDOMRESOURCE_resourcename_host MNT_mount54_host
hares -delete MNT_mount54_host
hares -delete VOL_volume54_host
and that's a much shorter, and much nicer, list. Now all you have to do to remove the resources successfully is to do the following (Also, consider that you may want to shut down VCS on all but the primary node, if possible, so that your new configuration doesn't get overwritten, and that all your secondary, tertiary, etc, VCS nodes do a remote build of the new configuration from your primary node):
host # haconf -makerw
host # hares -unlink MNT_mount54_host VOL_volume54_host
host # hares -unlink VOL_volume54_host DG_hostdg_host
host # hares -unlink RANDOMRESOURCE_resourcename_host MNT_mount54_host
host # hares -delete MNT_mount54_host
host # hares -delete VOL_volume54_host
host # haconf -dump -makero
And you're all set. Easy Peasy :)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Monday, April 20, 2009
Prepping For Setting Up VCS NFS Clustering On Solaris 10
Hey there,
Hope your work week is beginning swimmingly :) Mine is kind of like the end of last week, although that sentence is a bit of a non sequitur. I actually haven't stopped working, so it still is last week this week (???) You know what I mean (Although, I hope that you don't ;)
Today we're going to look at preparing your Solaris 10 system for clustering NFS (The Network File System ... the "The" is silent ;) on VCS (Veritas Cluster Server). In many ways, it's the same as setting it up on previous versions of Solaris, but it differs in many ways, as well. Apparently, I spend way too much time looking at this issue in many many ways ;)
NOTE: This post is kind of a wrapper around our previous posts on adding NFS to an existing VCS cluster and adding NFS to a VCS cluster with no down time. You can check either of those out if you want to read up on doing the VCS configuration part of the "VCS NFS" setup. This post is of purely a preparatory nature (with, admittedly, some post-installation test steps and a pointer to another old post on an uncommon Solaris 10 VCS NFS error and how to fix it ).
1. First, we'll do the Solaris 10 setup. This is very important, since the SMF (Soul Macerating Futility or, possibly, Service Management Facility) has changed the way in which "services" and "run levels" are either dealt with or completely subverted ;)
a. If you're going to be depending on VCS for NFS management, it will interest you to know that VCS won't have anything to do with NFS if you want to use it on your own (on the same machine). For that reason, we're going to use svccfg to "delete" the following services, rather than using svcadm to "disable" them.
host # svccfg delete -f svc:/network/nfs/server:default
host # svccfg delete -f svc:/network/nfs/mapid:default
host # svccfg delete -f svc:/network/nfs/status:default
host # svccfg delete -f svc:/network/nfs/nlockmgr:default
b. Doing the above may (actually, should) kill the lockd and statd daemons that are probably already running (I'm trying not to be too presumptuous ;) If that's the case, you'll need to start those up again.
host # (/usr/lib/nfs/lockd &)
host # (/usr/lib/nfs/statd &)
VCS will take care of starting them once it's good to go.
If the above way of starting those services from the command line seems goofy, check out our aging post on what to do when nohup hangs up anyway from way back when. It's a fair read and may still be interesting to a certain degree ;)
c. Finally, you just need to make one directory, for convenience's sake (and also so that the "NFSRestart" resource will actually work):
host # mkdir /opt/VRTSvcs/lock
2. Now (whooshing right past the "actual" VCS NFS setup (referenced above from two previous posts on that subject), you're ready to do a few simple tests.
a. Once you have NFS running on VCS on your primary node, pick another node (we'll just assume you picked the secondary) and test that the NFS mount is up and working like VCS says it is (You can't always take it at its word):
host2 # showmount -e host
export list for host:
/that/vcs/nfs/directory/share (everyone)
Then just make sure you can actually mount the share.
b. Then fail over to your secondary node and run the same test on the primary:
host # showmount -e host2
export list for host2:
/that/vcs/nfs/directory/share (everyone)
Again, mount the share just to be sure everything's in working order.
c. If you encounter an error while running "showmount" on either server, like this:
showmount: host: RPC: Rpcbind failure - RPC: Authentication error
accompanied by your being able to generate this error (although, not necessarily):
host # rpcinfo -p host
rpcinfo: can't contact portmapper: RPC: Authentication error; why = Failed (unspecified error)
be sure to check out our previous post on this little Solaris 10 VCS NFS gotcha and, hopefully, you'll end up knowing more than you ever wanted to about how to straighten that out :)
And, finally, in an out-of-sequence-series of only four posts, you're finally done setting up NFS in VCS on Solaris 10. Hopefully, you finished a long time ago. It's been six months since some of the referenced posts were originally published. Over time, it gets getting harder and harder to dot all the i's on this blog ;)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Thursday, April 9, 2009
A Few More Obscure NetBackup Command Line Quickies
Hey There,
To finish off a Hellish week, or so, of NetBackup insanity (that's gone from not too slickly creating a command line NetBackup Activity Monitor to copying and modifying policies, schedules and clients between NetBackup hosts all the way to a simple way to get better NetBackup support for VCS ), we bring you the final NetBackup post of the week (and, probably, for a while :)
Today's subject is going to be a little more scatter-brained than I usually am (??? ;) The following are a few extra little command lines you can use to make your NetBackup (Linux and Unix) command line experience somewhat less unenjoyable :) The location of the the binaries (though spelled out fully here) may differ depending on your installation. This post contains the standard defaults for NetBackup on Solaris 10 and OpenSolaris:
1. List out all your backup pools:host # /usr/openv/volmgr/bin/vmpool -list_all -bx
pool index max partially full description
---------------------------------------------------------------------------------------------
None 0 0 the None pool
NetBackup 1 0 the NetBackup pool
2. Create new backup pools:host # /usr/openv/volmgr/bin/vmpool -create -pn MY_Backups -description "Not Yours" -mpf 0
host # /usr/openv/volmgr/bin/vmpool -create -pn MY_Database_Backups -description "I said MINE" -mpf 0
and then make sure they're there:host # /usr/openv/volmgr/bin/vmpool -list_all -bx
pool index max partially full description
---------------------------------------------------------------------------------------------
None 0 0 the None pool
NetBackup 1 0 the NetBackup pool
MY_backups 4 0 Not Yours
MY_Database_backups 5 0 I Said MINE
3. Inventory your robot (the robot type is tld and the robot number is 0):host # /usr/openv/volmgr/bin/vmcheckxxx -rt tld -rn 0 -list
Robot Contents
Slot Tape Barcode
==== ==== ============
1 Yes CLN099
2 Yes CLN098
3 Yes C04440
4 Yes C04441
5 Yes C04442
4. List out your tape pools and the number of tapes each has:host # while read a b c;do echo "Pool no. $b = $a";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1);done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`"
Pool no. 0 = None
2
Pool no. 1 = NetBackup
0
Pool no. 2 = MY_backups
1
Pool no. 3 = My_Database_backups
2
5. Do the same thing a different way:host # while read a b c;do echo -n "$b ";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1);done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`"
0 2
1 0
2 1
3 2
6. And again (All this "proof of concept" stuff is great for keeping management at bay ;)host # count=0;for x in $(while read a b c;do echo -n "$b ";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1);done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`");do if [[ $count -eq 0 ]];then echo -n "Pool No. $x = ";count=$count+1;else echo "Count $x";count=0;fi;done
Pool No. 0 = Count 2
Pool No. 1 = Count 0
Pool No. 2 = Count 1
Pool No. 3 = Count 2
7. Show the commands you'd need to run to recreate your current setup:host # count=0;for x in $(while read a b c;do echo -n "$b ";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1);done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`");do if [[ $count -eq 0 ]];then poolnum=$x;count=$count+1;else numtapes=$x;count=0;if [ $numtapes -gt 0 ];then tape_array=($(/usr/openv/volmgr/bin/vmquery -p $poolnum -b|sed 1,3d|awk '{print 1ドル}'|xargs echo));while [ $numtapes -gt 0 ];do let numtapes=$numtapes-1;if [[ $(expr "${tape_array[$numtapes]}" : 'CLN') -eq 3 ]];then echo "CLN /usr/openv/volmgr/bin/vmchange -p $poolnum -m ${tape_array[$numtapes]}";else echo "POOL $poolnum /usr/openv/volmgr/bin/vmchange -p $poolnum -m ${tape_array[$numtapes]}";fi;done;fi;fi;done
CLN /usr/openv/volmgr/bin/vmchange -p 0 -m CLN099
CLN /usr/openv/volmgr/bin/vmchange -p 0 -m CLN098
POOL 4 /usr/openv/volmgr/bin/vmchange -p 2 -m C04440
POOL 4 /usr/openv/volmgr/bin/vmchange -p 3 -m C04441
POOL 4 /usr/openv/volmgr/bin/vmchange -p 3 -m C04442
8. Yet another way to list your tapes, and the pools to which they belong, using your existing setup as input:host # while read a b c;do echo "Pool no. $b = $a";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1);done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`"
Pool no. 0 = None
2
Pool no. 1 = NetBackup
0
Pool no. 2 = MY_backups
1
Pool no. 3 = MY_database_backups
2
9. Create an array of volume pools - each with a number of members equal to the number of members in the pool (you'll see why in a second):host # tanum=0;while read a b c;do tnum=$(/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID'|echo $(expr `echo $(wc -l)` - 1));if [[ $tnum -gt 0 ]];then while [[ $tnum -gt 0 ]];do let tnum=$tnum-1;tarray[$tanum]=$b;let tanum=$tanum+1;done;fi;done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`";echo ${tarray[@]}
0 0 2 3 3
10. Use the array you just created to create new media pools on a separate host (note that this will only assign the CLN tapes to the 0 pool - specifically - and will randomly (in order ;) assign the number of tapes necessary to each of your pools. Don't forget to set those up first (as above :))
PROOF OF CONCEPT RUNhost # tarray=(0 0 2 3 3);count=0;for x in $(echo ${tarray[@]});do tape_array=($(/usr/openv/volmgr/bin/vmquery -a -b|sed 1,3d|awk '{print 1ドル}'|sort -rn|xargs echo));if [[ $(expr "${tape_array[$count]}" : 'CLN') -eq 3 ]];then echo "CLN $x /usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$count]}";let count=$count+1;else echo "POOL $x /usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$count]}";let count=$count+1;fi;done
CLN 0 /usr/openv/volmgr/bin/vmchange -p 0 -m CLN009
CLN 0 /usr/openv/volmgr/bin/vmchange -p 0 -m CLN010
POOL 2 /usr/openv/volmgr/bin/vmchange -p 4 -m C08880
POOL 3 /usr/openv/volmgr/bin/vmchange -p 4 -m C08881
POOL 3 /usr/openv/volmgr/bin/vmchange -p 4 -m C08882
AND THE REAL DEAL:host # tarray=(0 0 4 4 4 4 4 4 4 4 4 5 5 7 7 7 7 7 7 7 7 7);count=0;for x in $(echo ${tarray[@]});do tape_array=($(/usr/openv/volmgr/bin/vmquery -a -b|sed 1,3d|awk '{print 1ドル}'|sort -rn|xargs echo));if [[ $(expr "${tape_array[$count]}" : 'CLN') -eq 3 ]];then echo "Running /usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$tcount]}";/usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$count]};let count=$count+1;else echo "Running /usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$count]}";/usr/openv/volmgr/bin/vmchange -p $x -m ${tape_array[$count]};let count=$count+1;fi;done
Running /usr/openv/volmgr/bin/vmchange -p 0 -m CLN009
Running /usr/openv/volmgr/bin/vmchange -p 0 -m CLN010
Running /usr/openv/volmgr/bin/vmchange -p 2 -m C08880
Running /usr/openv/volmgr/bin/vmchange -p 3 -m C08881
Running /usr/openv/volmgr/bin/vmchange -p 3 -m C08882
11. Verify that your tape assignments worked:host # /usr/openv/volmgr/bin/vmpool -list_all -bx
pool index max partially full description
---------------------------------------------------------------------------------------------
None 0 0 the None pool
NetBackup 1 0 the NetBackup pool
MY_backups 2 0 Not Yours
MY_Database_backups 3 0 I said MINE
host # while read a b c;do echo "Pool no. $b = $a";/usr/openv/volmgr/bin/vmquery -p $b -b|awk '{print 1ドル}'|egrep -v '^media|^ID';done <<<"`/usr/openv/volmgr/bin/vmpool -list_all -bx|sed 1,2d`"
Pool no. 0 = None
-------------------------------------------------------------------------------
CLN009
CLN010
Pool no. 1 = NetBackup
-------------------------------------------------------------------------------
Pool no. 2 = MY_backups
CO8880
-------------------------------------------------------------------------------
Pool no. 3 = MY_Database_backups
C08881
CO8882
And so much more... but not for today ;)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Wednesday, April 8, 2009
A Simple Way To Get Better Symantec NetBackup Support For VCS
Hey there,
This post is a little gift for those of you out there who run clustered NetBackup on VCS to go along with our earlier posts on emulating the NetBackup activity monitor on the command line and copying and modifying NetBackup policies, schedules and clients between hosts:)
As most of you who've run NetBackup in a VCS cluster may know, sometimes getting support for NetBackup (used with VCS) can be a royal pain; never mind that NetBackup (Symantec or, previously, Veritas) can be "purchased" as a VCS cluster add-on. The problem (and, to me, this is inexplicable, since NetBackup used to be a Veritas product) is that, if you ever experience issues with your NetBackup cluster and need to call Symantec for help, they generally have an entirely contrary attitude if they find out that you're actually running NetBackup in a VCS cluster that includes other "certified" add-on components (?)
And, yes, you read that correctly: You're going to have problems getting reasonable support for NetBackup if you've incorporated any other VCS modules in your cluster. If you want to get decent support for your clustered NetBackup setup (in my experience) your cluster should "only" include the NetBackup add-on Module. This is hardly a realistic scenario, but (for instance), if you're running the Oracle and NetBackup add-on modules in the same cluster, your support experience will be less than pleasant.
Hurdling right over the argument that Symantec "should" enthusiastically support NetBackup in a Cluster, no matter what else is clustered with it, we'll proceed directly to the solution. It's not exactly Kosher, but it works :)
Let's say, for instance, that you have a 2 node VCS cluster. Let's also say that your cluster has both Oracle and NetBackup add-on modules installed and operational. Then, finally, let's say the NetBackup component begins to fail miserably. What then?
As a little aside, in the spirit of Airplane:"For instance, that you have a 2 node VCS cluster. That your cluster has both Oracle and NetBackup add-on modules installed and operational. The NetBackup component begins to fail miserably. What then?"
Sorry; couldn't help it ;)
Basically, in the situation described above (and repeated, per instruction, directly below ;) you'll get half-hearted support for your NetBackup problem, at best, if (and this is important) you "tell the truth about your setup." I'm not advocating dishonesty (although, as they say, it "is" the second best policy ;), but, if you want Symantec to fully support your NetBackup issue and not waste a lot of time pointing fingers at Veritas and, generally, giving you the run-around, you'll need to assure them that you're "only" running NetBackup on your cluster (no matter how cost-inefficient that may be ;)
Please be sure, before you use the method below, that you have conclusively ruled out other add-on modules in your cluster as being a part of (or the entirety of) the problem.
Once you've determined that NetBackup is, indeed, the issue, you'll need to call Symantec for that support you're "paying for" (another thing that makes this whole ordeal ridiculous). And, in order to get that full support, you'll need to implement the following two methods:
1. Always play your cards very close to the vest. Don't divulge unnecessary information and only provide very specific responses to questions asked. For instance, if NetBackup Support wants to see the permissions of your main.cf file, only do an "ls -l" of the one specific main.cf. If they want to see the contents, only show them that "one" file's contents. And, of course, don't let them "take over" your machine via WebEx or any other "remote control" mechanism.
2. Make sure that you have a "prop" main.cf to show them (This is the gift attached below. It may only be a template that needs customizing to suit your own site's naming conventions, etc, but it's a nice gesture, at least ;)
Feel free to manipulate the following workable example as necessary. The only things you should need to change would be the specific "names" that you've given your resources (and NIC device names, IP addresses, etc). And, remember, this main.cf is for "support eyes only" - it probably doesn't do everything you need to do in your "real" cluster configuration (which is why you should be sure you rename it to something else before you fire up your first node ;), but it "does" do the most important thing of all when it comes to getting support for NetBackup on VCS: It describes a VCS cluster that contains the NetBackup add-on module and no others!
Enjoy :)
include "types.cf"
include "/usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf"
cluster MyClusterName (
)
system MyHost1 (
)
system MyHost2 (
)
group MyMainServiceGroup (
SystemList = { MyHost1 = 0, MyHost2 = 1 }
AutoStartList = { MyHost1 }
)
// resource dependency tree
//
// group MyMainServiceGroup
// {
// }
group MyNetBackupServiceGroup (
SystemList = { MyHost1 = 0, MyHost2 = 1 }
AutoStartList = { MyHost1, MyHost2 }
)
DiskGroup MyNetBackupDiskGroup (
DiskGroup = netbackupdg
)
IP MyNetBackupIP (
Device = ce0
Address = "10.99.99.99"
NetMask = "255.255.255.0"
)
Mount MyNetBackupMount (
MountPoint = "/shared/opt/VRTSnbu"
BlockDevice = "/dev/vx/dsk/netbackupdg/sharedoptvrtsnbu"
FSType = vxfs
MountOpt = largefiles
FsckOpt = "-y"
)
NIC MyNetBackupNIC (
Device = ce0
)
NetBackup MyNetBackupServer (
ServerName = MyNetBackupServer
ServerType = NBUMaster
)
Volume MyNetBackupVolume (
Volume = sharedoptvrtsnbu
DiskGroup = netbackupdg
)
MyNetBackupIP requires MyNetBackupNIC
MyNetBackupMount requires MyNetBackupVolume
MyNetBackupServer requires MyNetBackupIP
MyNetBackupServer requires MyNetBackupMount
MyNetBackupVolume requires MyNetBackupDiskGroup
// resource dependency tree
//
// group MyNetBackupServiceGroup
// {
// NetBackup MyNetBackupServer
// {
// IP MyNetBackupIP
// {
// NIC MyNetBackupNIC
// }
// Mount MyNetBackupMount
// {
// Volume MyNetBackupVolume
// {
// DiskGroup MyNetBackupDiskGroup
// }
// }
// }
// }
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Thursday, February 19, 2009
Cluster Server Failover Testing On Linux And Unix
A fine how do you do :)
WARNING/GUARANTEE: Today's post is the product of a tired mind that just finished working and didn't have much time to think beyond the automatic. If you feel you may be entertained, please continue reading. If you want to learn some really useful tricks to test a two-node cluster's robustness, this article may be for you, too. If you're looking for information you can apply in the workplace without being escorted from the building by armed guards, proceed no further ;)
As today's post title obliquely suggests, I'm going to take another day to properly formulate my response to our "F" reading experiment (not to be confused with the anti-literacy initiative ;) that we began on Monday. I've received a number of very interesting comments on the subject of the article that got the whole idea rolling around in that butter churn in between my ears. Although none of the responses have radically changed my feelings on the subject, they have augmented them and provided some fresh perspective. Although I still intend to throw a little signature "meta" style into the post (because if we all read in the F formation, my post is going to have to play off of that to whatever degree I can manage :), I'm now reconsidering my original rough-draft and, possibly, working some additional angles into it. I've got some emails out there (as I always request permission to use other folks' opinions when they're kind enough to share) and hope to hear back soon. Worst case, I'll post the original tomorrow and add the comments as their own entities (attached, of course) at a later date.
Also, as this post's title overtly suggests, I spent most of my day testing cluster failover scenario's at work. I won't mention any proprietary or freeware brand names, as this post isn't specific enough to warrant the reference, but, after today's exercise (which, of course, I've had to do more than a couple different ways at a couple of different places of employment) I decided to put together a small comprehensive list of two-node cluster disaster/failure/failover scenarios that one should never push a cluster into production without performing.
It goes without saying that the following is a joke. Which is, of course, why I "wrote" it with my lips sealed ;)
Comprehensive Two-Node Cluster Failover Testing Procedure - v0.00001alpha
Main assumption: You have a two-node cluster all set up in a single rack, all service groups and resources are set to critical, no service groups or resources are frozen and pretty much everything should cause flip-flop (technical term ;)
1. Take down one service within each cluster service group (SG), one at a time. Expected result: Each cluster service group should fault and failover to the secondary node. The SG's should show as faulted in your cluster status output on the primary node, and online on the secondary.
2. Turn all the services, for each SG, back on, one by one, on the primary node. Expected result: All of the SG's should offline on the secondary node and come back up on the primary.
3. Do the same thing, but on the secondary. Expected result for the first test: Nothing happens, except the SG's show as faulted on the secondary node. Expected result for the second test: Nothing happens, except the SG's show as back offline on the secondary node.
4. Switch SG's from the primary to secondary node cleanly. Expected result: What did I just write?
5. Switch SG's from the secondary node back to the primary node cleanly. Expected result: Please don't make me repeat myself ;)
6. Unplug all heartbeat cables (serial, high priority ethernet, low priority, disk, etc) except one on the primary node. Expected result: Nothing happens except, if you're on the system console, you can't type anything anymore because the cluster is going freakin' nuts with all of its diagnostic messages!
7. Plug all those cables back in. Expected result: Everything calms down, nothing happens (no cluster failover) except you realize that you accidentally typed a few really harmful commands and may have hit enter while your screen was draped with garbage characters. The primary node may be making strange noises now ;)
8. Do the same thing on the secondary node. Expected result: No cluster failover, but the secondary node may now be making strange low beeping sounds and visibly shaking ;)
9. Pull the power cords out of the primary node. Expected result: Complete cluster failover to the secondary node.
10. Put the plugs back in. Expected result: Complete cluster failback to the primary node.
11. Do the same thing to the secondary node. Expected results for both actions: Absolutely nothing. But you knew this already. Are you just trying to waste the company's time? ;)
12. Kick the rack, containing the primary and secondary node, lightly. Expected results: Hopefully, the noises will stop now...
13. Grab a screwdriver and repeatedly stab the primary node. Expected Result: If you're careful you won't miss and cut yourself on the razor sharp rack mounts. Otherwise, everything should be okay.
14. Pull the fire alarm and run. Expected result: The guy you blame it on may have to spend the night in lock-up ;)
15. Tell everyone everything's fine and the cluster is working as expected. Expected result: General contentment in the ranks of the PMO.
16. Tell everyone something's gone horribly wrong and you have no idea what. Use the console terminal window on your desktop and export it via WebVNC so that everyone can see the output from it. Before exporting your display, start up a program you wrote (possibly using script and running it with the "-t" option to more accurately reflect realistic timing, although a bit faster. Ensure that this program runs in a continuous loop. Expected Result: General pandemonium. Emergency conference calls, 17 or 18 chat sessions asking for status every 5 seconds and dubious reactions to your carefully pitched voice, which should speak in reassuring terms, but tremble just slightly like you're a hair's breadth away from a complete nervous breakdown.
17. Go out to lunch. Expected Result: What do you care? Hopefully, you'll feel full afterward ;)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Tuesday, February 17, 2009
Adding Slightly Different Types In VCS On Linux And Unix
Hey there,
Today we're going to take a look at creating new "types" for use with Veritas Cluster Server (VCS). In a broad sense of the term, almost everything you'll ever define in your main.cf (the main configuration file for VCS) is based on a specific "type," which is actually described in the only standard include file in that configuration file: types.cf - Note that both main.cf and types.cf are located in /etc/VRTSvcs/conf/config. You could move the types.cf to an alternate location fairly easily and only have to modify the "include" line in main.cf, but it's not recommended. VCS has the potential to make your life miserable in many built-in ways ;)
For instance, if you had an entry like this in your main.cf:Apache httpd_server (
httpdDir = "/usr/local/apache"
HostName = www
User = nobody
ConfigFile = "/usr/local/apache/conf/httpd.conf"
)
that Apache instance you described (its name "httpd_server" is arbitrary and/or up to you, but is how you would reference that instance of that type later on in the config file) would actually be based on the "Apache" type (all types in VCS are "cAse SensiTIVe ;) in types.cf, which is described thusly (as you can see, it has many attributes that, mostly, are left at their defaults. We've italicized the points that we've specifically defined above):type Apache (
static str ArgList[] = { ResLogLevel, State, IState, httpdDir, SharedObj
Dir, EnvFile, HostName, Port, User, SecondLevelMonitor, SecondLevelTimeout, Conf
igFile, EnableSSL, DirectiveAfter, DirectiveBefore }
str ResLogLevel = INFO
str httpdDir
str SharedObjDir
str EnvFile
str HostName
int Port = 80
str User
boolean SecondLevelMonitor = 0
int SecondLevelTimeout = 30
str ConfigFile
boolean EnableSSL = 0
str DirectiveAfter{}
str DirectiveBefore{}
)
As you may have noted, a lot of defaults are set in the standard type definition. For instance, the Port is set to 80 by default. You could override that in your main.cf by simply including a line in your "Apache httpd_server {" definition that read: Port = 8080
or whatever you preferred.
Assuming that you will only be running Apache web servers on either port 80 or 8080 (we're going to skip 443, since it "silently" gets defined if you set "EnableSSL = 1" which includes that port automatically - although we may be putting that in a way that seems slightly off ;) you can either do things the easy way and just describe two differently-named Apache instances in your main.cf, like so (Be sure to check out our older posts on modifying the main.cf file properly if you're uncertain as to whether or not you're updating and/or distributing config files appropriately):Apache regular_server (
httpdDir = "/usr/local/apache"
HostName = www
User = nobody
Port = 80
ConfigFile = "/usr/local/apache/conf/httpd.conf"
)
Apache alternate_server (
httpdDir = "/usr/local/apache"
HostName = www
User = nobody
Port = 8080
ConfigFile = "/usr/local/apache/conf/httpd.conf"
)
Or do things the hard way. The hard way can be dangerous (especially if you make typos and/or any other serious errors editing the types.cf file - back it up before you muck with it ;) and is generally not necessary. We just made it the topic of today's post to show you how to do it in the event that you want to customize to that degree and/or need to. Keep in mind that (if you change, or add to, types.cf) you should always keep a backup of both the original and your modified version of types.cf handy. I you ever apply a patch or service/maintenance pack from Veritas, it may very well overwrite your custom types.cf file.
Assuming you've read the preceding paragraph and are willing to take the risk, you might modify your types.cf file to include two different versions of the Apache type: One for servers running on port 80 (the default) and one for servers running on port 8080. As we mentioned, types in types.cf are "case sensitive," which makes them simpler to make similarly unique. This works out well in Unix and Linux, since most types are associated with an actual binary directory in /opt/VRTSvcs/bin (which gets the OS involved, and the OS is case sensitive).
So, assuming we wanted to add an "ApachE8080" type to types.cf, our first move would be to duplicate/modify the binary directory in /opt/VRTSvcs. In our example, this is very simplistic, since we're not creating a new type from scratch and doing everything the hack'n'slash way (If you prefer order over speedy-chaos, check out the "hatype," "haattr" and "hares" commands, although not necessarily in that order ;)host # ls /opt/VRTSvcs/bin/Apache
Apache.pm ApacheAgent monitor online
Apache.xml clean offline
The /opt/VRTSvcs/bin/Apache directory contains a number of programs and generic "methods" required by VCS for most resource types (As a counter-example, online-only types like NIC don't require the "start" or "stop" scripts/methods listed under Apache, since they're "always on" in theory ;). In order for us to create our ApachE8080 type, we'll need to do this one simple step first:
<-- In this case, no news is good news :)host # cp -pr /opt/VRTSvcs/bin/Apache /opt/VRTSvcs/bin/ApachE8080
host # ls /opt/VRTSvcs/bin/ApachE8080
Apache.pm ApacheAgent monitor online
Apache.xml clean offline
host # diff -r /opt/VRTSvcs/bin/Apache /opt/VRTSvcs/bin/ApachE8080
host #
Now, all we have to do is modify our types.cf file. We're not sure if it's required, but we always duplicate the empty lines between type definitions just in case (It doesn't hurt). This is what our new type will look like (Note that the only line changed - aside from the name of the type - is the "int Port" line which we've italicized again):type ApachE8080 (
static str ArgList[] = { ResLogLevel, State, IState, httpdDir, SharedObj
Dir, EnvFile, HostName, Port, User, SecondLevelMonitor, SecondLevelTimeout, Conf
igFile, EnableSSL, DirectiveAfter, DirectiveBefore }
str ResLogLevel = INFO
str httpdDir
str SharedObjDir
str EnvFile
str HostName
int Port = 8080
str User
boolean SecondLevelMonitor = 0
int SecondLevelTimeout = 30
str ConfigFile
boolean EnableSSL = 0
str DirectiveAfter{}
str DirectiveBefore{}
)
And, you're all set. Now you can change your main.cf file to look like the following and everything should work just as if you had done it the easy way (Be sure to check out our older posts on modifying the main.cf file properly if you're uncertain as to whether or not you're updating and/or distributing config files appropriately):Apache regular_server (
httpdDir = "/usr/local/apache"
HostName = www
User = nobody
ConfigFile = "/usr/local/apache/conf/httpd.conf"
)
ApachE8080 alternate_server (
httpdDir = "/usr/local/apache2"
HostName = www
User = nobody
ConfigFile = "/usr/local/apache2/conf/httpd.conf"
)
and, now, you no longer have to go through all the trouble of adding that "Port = 80" (or "Port = 8080") line to your main.cf type specification...
Six of one, half dozen of another? Whatever works for you, depending on your situation, is our basic rule. Or, in other words, 13 of one, baker's dozen of another ;)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Monday, February 9, 2009
Getting Faster Support For Your VCS-Clustered NetBackup Servers
Hey There,
Today's post is a little trick that anyone running Veritas/Symantec NetBackup (Linux or Unix) on VCS - Veritas Cluster Server - should know. As the title suggests, doing this one little thing will almost guarantee you more responsive support from Symantec (given the highly specific situation outlined in the first sentence, of course ;). The funny thing, though, is that many people have this problem already; they just may not have had to deal with Symantec, with regards to it, yet.
The problem you have (or will have) to face is that Symantec "does not" fully support VCS Clustered NetBackup installations if that same cluster has other Service Groups and/or major resources active on it. I'd almost say that they don't support it at all, but that's not entirely true (just keep complaining. You'll see ;). However, Symantec's stated position is that they really "don't" support it at all (and this is only hearsay, of course. Something I've heard while on the phone with another person who answered the phone at the number listed on our contract for Symantec NetBackup Support. So, of course, I may not be entirely correct, here ...).
Basically, what this means is that Symantec has VCS Cluster support for NetBackup (They own both NetBackup and Veritas Cluster Server), have VCS modules (types files, pkg's, etc) for NetBackup and even have full documentation online to help you setup your NetBackup Server(s) in VCS . Even given that fact, they do not support NetBackup in a VCS cluster that also runs, say, NFS or Oracle or even a shared mountpoint that isn't NB-related. NOTE: They especially don't support IPmultiNIC_B (I'm not sure why, since NIC auto-failback seems like a great idea to me)!! Their contention is that NetBackup is meant to function on VCS as the "only service or resource." This is fine, if you can spare 2 or 3 servers to just run NetBackup (larger companies, and data security/storage/backup companies, probably have no issue with this). However, most businesses can't afford to spare the extra servers "just" for NetBackup and will often have NetBackup in the same VCS cluster that runs an NFS group/resource, other shared mountpoints, web servers, databases, etc. Hopefully not "all" of those on one machine, but times are tight; you never know ;)
Now, here's the trick (it's not even a trick, really, just some good advice to help you out) to getting speedy support from Symantec when you call them about an issue with NetBackup on VCS: You're going to have to do a little bit of prep work and (even if it hurts) lie a little bit. If you're not comfortable "bending the truth," just remember that dishonesty is the second-best policy. Things could be worse ;)
1. Before you call, have a backup main.cf ready to go. And, this is very important if you're working on a cluster that's been around for a while, make sure that you do one of two things when Symantec Support asks for anything like an "ls" of your /etc/VRTSvcs/conf/config directory:
a. Do a very specific ls:
host # ls /etc/VRTSvcs/conf/config/main.cf <-- They don't need to see the other 500 hundred backups that are three times the size of your faked-up file ;)
b. Create a staging directory, copy only the minimal files you want into it, and then give them an "ls" of that:
host # cd /etc/VRTSvcs/conf/config/stage <-- Don't show them this!
host # ls -l *
2. As noted above, you need a clean main.cf for NetBackup Support, or you're going to end up getting bounced around from department to department, at best. If you're running anything aside from NetBackup in one cluster, you'll need to remove it (not literally, of course). Here's a quick example (This setup is going to be very basic, just to keep it short - which means the commented dependency trees that VCS puts into the conf file are removed, for this example, as well) - BTW, feel free to use this exact main.cf (just change a few names and numbers) if it works for you. It's been verified on the system I'm goofing with, but it probably won't work out-of-the-box on your system :)
Since this example is so long, I'll bid you farewell for the day, now. Hope this helps you if you ever get in a "We don't support such-and-such" or "You really shouldn't be doing such-and-such" or "It's a VCS Problem <--> It's a NetBackup Problem" sort-of-situation. And, remember, even if you goof up once and tell them the truth (who amongst us isn't guilty of being upfront every once in a while ;), you can always ask them to cancel your request ("Oh wait, I see what the problem is... Sorry, you can close the ticket" etc) and then call back later and go through normal channels so you'll hopefully get a different person (and, when you do, be sure to intimate that the problem you're having now is totally unrelated to the previous ticket, if they're efficient and happen to look for, and find, it. It's worth a shot ;)
Cheers,
--- Stripped-Down main.cf ---include "types.cf"
include "/usr/openv/netbackup/bin/cluster/vcs/NetBackupTypes.cf"
cluster VCScluster1 (
UserNames = { admin = EncryptedPassword }
ClusterAddress = "10.10.10.199"
Administrators = { admin }
)
system VCShost1 (
)
system VCShost2 (
)
group ClusterService (
SystemList = { VCShost1 = 0, VCShost2 = 1 }
AutoStartList = { VCShost1 }
)
IP webip (
Device = bge0
Address = "10.10.10.199"
NetMask = "255.255.255.0"
)
NIC webnic (
Device = bge0
)
webip requires webnic
group SG_VCSNB1 (
SystemList = { VCShost1 = 0, VCShost2 = 1 }
AutoStartList = { VCShost1 }
)
IPMultiNIC nbuIP (
Address = "10.10.10.199"
MultiNICResName = mnic
)
MultiNICA mnic (
Device @VCShost1 = { bge1 = "10.10.10.197", bge2 = "10.10.10.198" }
Device @VCShost2 = { bge2 = "10.10.10.198", bge1 = "10.10.10.197" }
ArpDelay = 5
IfconfigTwice = 1
PingOptimize = 0
)
DiskGroup VCShostDG (
DiskGroup = VCShostDG
)
Mount MNT_NBmount_VCSNB1 (
MountPoint = "/opt/VRTSnbu"
BlockDevice = "/dev/vx/dsk/VCShostDG/NBmount"
FSType = vxfs
MountOpt = largefiles
FsckOpt = "-y"
)
NetBackup APP_nbu_VCSNB1 (
Critical = 0
ServerName = NBU_Server
ServerType = NBUMaster
)
Volume VOL_NBmount_VCSNB1 (
Volume = NBmount
DiskGroup = VCShostDG
)
APP_nbu_VCSNB1 requires nbuIP
APP_nbu_VCSNB1 requires MNT_NBmount_VCSNB1
MNT_NBmount_VCSNB1 requires VOL_NBmount_VCSNB1
VOL_NBmount_VCSNB1 requires VCShostDG
nbuIP requires mnic
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Wednesday, January 14, 2009
How Many Different Ways Can You Stop VCS On Linux Or Unix?
Hey there,
This little rundown may seem trivial in comparison to some of our older posts on Veritas Cluster Server (VCS) , as they all attacked a very specific aspect of VCS functionality and, mostly, were aimed at either explaining more advanced concepts or showing you how to get away with stuff that you're not supposed to do ;)
However, even in its simplicity, today's topic is just as valid as our previous posts, and (quite probably) more useful to the admin or user who wants to know enough to administrate VCS well, but doesn't care about all the nitty-gritty. Because, let's face it, the nitty-gritty doesn't really matter all the time. I find that I almost never think of the nitty gritty every morning when I start my car (I reserve that special time for praying that it won't explode ;)
So, before I veer too far off the beaten path, we'll get down to today's subject: How to shutdown a Veritas Cluster, using VCS, the many different ways to do it and what they all mean.
Of course, the base command for stopping VCS (Sometimes also referred to as HAD for the High Availability Daemon, which explains why all the commands for VCS start with "ha" :) is "hastop." hastop has a number of ways in which you can call it; all of which can have a significant impact on your day if used arbitrarily ;)
1. hastop -local. This usage stops VCS on your local machine (or the machine on which you're running it - same thing). It also causes any service groups on your local system to be taken offline.
2. hastop -local -evacuate. This invocation is very similar to the previous, except that it migrates (if possible) all of your service groups to the default failover system before it stops the VCS services on your local machine. This option isn't available if you use the -all flag (since, obviously, there won't be any systems left up in the cluster to fail over to ;)
3. hastop -local -noautodisable. The -noautodisable option (also not available for use in conjunction with the -all argument) makes it so that none of your service groups are set to disabled.
4. As an aside of sorts, it should be noted that you can use -evacuate and -noautodisable together, although you can't use either with other arguments, like -force or -all.
3. hastop (-all|-local) -force. These options, when presented to hastop, cause it to stop VCS services on all systems in your cluster (-all) or just your local system (-local), but it does not do anything to your service groups or offline their resources. This is convenient when you have planned maintenance on VCS-managed resources and don't want to create a fault condition (and all its accompanying alarms, bells and whistles ;) When VCS is stopped in this manner, hastart merely resumes running VCS services as normal (It doesn't bring VCS up with an overt assumption that something wrong has happened). One downside of using the -force option is that it doesn't check whether or not you may be editing the main.cf file at the time you decide to call it quits. If you have it open, in read/write mode, and you stop VCS this way, you may be in for an unpleasant restart.
4. hastop -all. This stops VCS on all systems in your cluster, as the previous command does, but it doesn't ignore your system groups and will take them offline.
5. hastop -sys SYSTEMNAME. This way of executing hastop causes VCS to react in much the same way as the -local flag does (including having the -evacuate, -noautodisable and -force arguments - the same rules, listed above, apply to using combinations of -evacuate and -noautodisable (ok) and -force (not ok to mix with either of the others). One thing to note is that you can't use the -local and/or -all flags with the -sys flag (again, for obvious reasons. It's nice to know that some things in this world make sense ;), since the -sys flag requires you to specify a specific system (or host) in your cluster on which to execute the command.
6. hastop -help. This is a great option to use when you have no idea who you are, where you are, how you got there and/or what to do ;)
One final (and, of course, interesting ;) note about modifying hastop's behaviour is that you can modify the EngineShutdown "cluster attribute" (outside the scope of this little post) to set different default behaviours for hastop. This can be a big help if you always do things "the not normal way" ;) The EngineShutdown cluster attribute(s) (ESCA from now on, since I'm getting tired of mistyping the name of this thing ;) that you can set (with the haclus command) are:
a. Enable: This ESCA indicates that all hastop commands should be processed. This is the default.
b. Disable: This ESCA indicates that all hastop commands will not be processed. The only exception to this setting is if you use the -force argument to hastop.
c. DisableClusStop: This ESCA makes it so that the "hastop -all" command is rejected or not processed. All other hastop commands are processed as they normally would be.
d. PromptClusStop: This ESCA will cause the operator/administrator to be prompted (Do you really want to do this? Are you sure? C'mon, seriously? ;) before it executes the "hastop -all" command. All other hastop commands are processed normally and don't require answering any prompts.
e. PromptLocal: This ESCA prompts the admin whenever "hastop -local" is invoked (Are you positive you want to do this? Have you given this any thought at all? Remember; it's you're arse. Do you still want to do this? ;). All other hastop commands are processed normally.
f. PromptAlways: This ESCA (which is also referred to in non-academic circles as the Big PITA ;) causes the admin/qualified-user to be prompted when executing any and all hastop commands. When you get sick of answering questions, and you have the permissions, you might want to turn this off since it's not default and someone you work with may have turned it on just to eventually drive you nuts ;)
And, there you have it; hastop in a big complicated nutshell (Or outside of a nutshell ...whatever the opposite of in-a-nutshell is ;) One thing I should mention if you work with cluster attributes or any parts of the main.cf, types.cf or other VCS configuration files; it's always a good idea to make sure you know what the defaults for your version are. VCS's configuration file style will just "not include" most values that are set to their defaults, so you would never see the Enable variable and value in the ESCA, since it's the default.
A somewhat cool trick (if you're not sure if a particular variable/value pair is set to its default) is to add it to your configuration file(s). After you rebuild your config files the proper way (or manually edit it/them the somewhat-frowned-upon way), if the variable/value pair you entered is not set to default, it will have been removed!
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Friday, October 17, 2008
Configuring A Basic HACMP Cluster On AIX Unix
Hey there,
Today we're going to go back to the AIX well (which we haven't visited since our fairly-old post on working with AIX LVM . That post links back to a bunch of other posts in the series, but it "has" been about 3 months since we've touched on the AIX OS, which is grossly disproportionate to the amount of attention Solaris, Linux and Open Source Software get. Perhaps, someday, I'll work on it enough that I'll feel more comfortable digging into its guts. ...figuratively, of course ;)
This little how-to post isn't meant to cover all the angles of cluster setup and configuration that there are to cover. If it were, that would be (aside from slightly-more-than-presumptuous) a long long long post. Your children would have children while you tried to remain interested ;) This is a simple little "get started" guide and only meant for a simple 2 node cluster. If you want to get fancier, visit IBM's HACMP Knowledge Base where they've got more documentation than you could ever wish for. Check the bottom of the post for some other good resource sites.
First things first (before we get to step one), as with any other sort of cluster setup (VCS, SunCluster , etc) you'll need to determine what you want from your cluster. That's beyond the scope of this post (or the other ones we just referenced), but you should have a clear plan of attack. Know what you want and everything you can think of that it might take to get there (in this case 2 nodes for redundancy and uptime - On the flipside, what we're looking to do here, which is also very important, is to just set up the basics of our cluster and disk heartbeats. The network is up and running, cabled and doing well. This is one "huge" assumption, but it just means less for us to have to soak up in one sitting :) It's always a good idea to write a list. ...unless you're in the habit of losing your lists, in which case you should keep everything in your head where you'll retain most everything until you lose "it." (And you can take that either way ;)
NOTE: We're going to be running through this using an AIX CLI utility called "smit." Of course, it's more of a TUI (Terminal Based User Interface) than real CLI, but, while you're using it, you can make it show you the underlying commands it's running with judicious use of the F6 key. Copy and paste all that stuff into a script and amaze your friends :)
1. Choose one node from which to do your configuration. It can be either of the two we're going to use (host1 or host2). It's important that you stick with your choice, though.
2. Check for any possible HACMP configuration already on the host (just in case) by using smit (called as smitty):
NOTE: For this how-to, we'll represent menu choices (drill-down) using tabs for each level down, and set them in boldface type. Oh, yes, and step two will fail if there are no cluster components to remove. It'll make you feel all warm and cozy inside, though ;)host1 # smitty hacmp
Extended Configuration
Extended Topology Configuration
Configure an HACMP Cluster
Remove an HACMP Cluster
3. Then, proceed to configure the topology of your cluster (For us, this simply means two nodes sharing a heartbeat address)host1 # smitty hacmp
Extended Configuration
Extended Topology Configuration
4. Here you have a few different things to do (under Extended Topology Configuration):
<-- You'll need at least two hosts to create a cluster and you can't have more than one cluster per pair of hosts (Only 1 for 2). Configure an HACMP Cluster
<-- Be sure to use your boot IP address in the "communication path" field. We haven't gotten to the virtual IP yet Configure HACMP Nodes
5. Now, on each node of your cluster (host1 and host2), be sure to start the clcomdES service if it isn't running already (might be on host1 since we're configuring the cluster on it, but shouldn't be on host2)
To check for it, type:
host1 # lssrc -s clcomdES
To start it , type:
host1 # startsrc -s clcomdES
6. Now go back up one level (to Extended Configuration) and run through these options (You can just ignore the ones we don't mention):
<-- Add your public, or virtual Ethernet network information here. Be sure to configure the subnet mask appropriately and set "Enable IP Address Takeover via IP Aliases" equal to "yes." Also, unset (leave or make blank) the "IP Address Offset for Heartbeating over IP Aliases" setting. Configure HACMP Network
<-- if you find any old, or strange, node names in here, delete them. They're probably from a previous setup. You can keep them, but they might cause you headache later on. Note that you'll generally only see this type of behaviour if the machine has been cloned. You can also manually define interfaces here (if you choose), but you should have already done that under "Configure HACMP Network" above. If you need to add any interfaces here, just choose "Add Pre-defined interface" and go to town :) Configure HACMP Interfaces and Devices
<-- This should be the address associated with your host's DNS name (not the cluster's DNS name!). Configure HACMP Persistent Node IP Label/Address
7. Now add service addresses for all of the service types you want to host on your cluster (Services are generally disk, network, etc - resources you want to share within your cluster and keep highly available) Add Service addresses for each of the services you want to host - we're only dealing with disk and IP here.host1 # smitty hacmp
Extended Configuration
Extended Resource
Configuration HACMP Extended Resources Configuration
Configure HACMP Service IP Labels/Addresses
Here, add a volume group solely dedicated to providing disk-based heartbeat! Name it whatever you want, like "host1-diskhb" and make sure it's type is set to "Enhanced-Capable Concurrent Volume Group." You won't ever actually need this volume group for anything (disk storage, etc, I mean), but making it this type of volume prevents it from being discovered. You "Do Not" want to share your host's disk heartbeat device!
8. Now, proceed to configure your disk heartbeat network (this part practically does the work for you. Just keep stabbing at those keys ;)host1 # smitty hacmp
Extended Configuration
Extended Topology Configuration
Configure HACMP Networks
Add a Network
Discovered
host1-diskhb
9. Next, add devices to our disk heartbeat network:host1 # smitty hacmp
Extended Configuration
Extended Topology Configuration
Configure HACMP Communication Interfaces/Devices
Add Communication Interfaces/Devices
Add Pre-defined Communication Interfaces and Devices
Communication Devices
Choose the (hopefully, at this point) only heartbeat network available :) Name the device whatever you want (again, we'll go with host1-diskhb), supply the logical path to the device (if necessary) and supply the name of one of your nodes when required to. Note that this will trigger an error the first time you run it (This is perfectly normal!). It will state the obvious in a slightly different manner, but your error message will basically say that you can't have only one device defined, since two are required). The simple fix: Run step 9 again exactly, except pick your other node's name when required to. Now you have two members in your disk heartbeat network.
10. Now, set up your basic networking:host1 # smitty hacmp
tcpip
Further Configuration
Static routes
Add a Static Route
Make sure to set your "Destination Type" to "net," your "Destination Address" to "default" and set the "Default Gateway" to the IP address of your default router (Same as when you set up your persistent host address in step 4 - Note that it's not shown in the documentation here, but will be there if you follow the smitty-steps down to the correct menu level :)
11. And you're set. Just for caution's sake, you should do some simple testing by running:host1 # smitty hacmp
Extended Configuration
Extended Verify and Synchronization
Accept the defaults here and your changes and setup will be verified and synchronized with every node in your cluster (In our case today, the "other one" ;)
And you're good to go :)
Remember, our little experiment here was a very quick and dirty how-to with very limited scope. If you want an equally expedient, but more comprehensive and wide-ranging, look at setting up HACMP fast, check out TriParadigm's HACMP Configuration Page . It's a lot better than this one, but may be too much if you're just getting started (?) :)
Cheers,
Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Tuesday, September 30, 2008
How To Resolve Veritas Disk Group Cluster Volume Management Problems On Linux or Unix
Hey There,
Today we're going to look at an issue that, while it doesn't happen all that often, happens just enough to make it post-worthy. I've only seen it a few times in my "career," but I don't always have access to the fancy software, so this problem may be more widespread than I've been lead to believe ;) The issue we'll deal with today is: What do you do when disk groups, within a cluster, conflict with one another? Or, more correctly, what do you do when disk groups within a cluster conflict with one another even though all the disk is being shared by every node in the cluster? If that still doesn't make sense (and I'm not judging "you," it just doesn't sound right to me, yet ;) what do you do in a situation where every node in a cluster shares a common disk group and, for some bizarre reason, this creates a conflict between nodes in the cluster and some of them refuse to use the disk even though it's supposed to be accessible through every single node? Enough questions... ;)
Check out these links for a smattering of other posts we've done on dealing with Veritas Volume Manager and fussing with Veritas Cluster Server . Some of the material covered may be useful if you have problems with any of the concepts glossed over in the problem resolution at the end.
Like I mentioned, this "does" happen from time to time, and not for the reasons you might generally suspect (like one node having a lock on the disk group and refusing to share, etc). In fact, the reason this happens sometimes (in this very particular case) is quite interesting. Even quite disturbing, since you'd expect that this shouldn't be able to happen.
Here's the setup, and another reason this problem seems kind of confusing. A disk group (we'll call it DiskGroupDG1 because we're all about creativity over here ;) is being shared between 2 nodes in a 2 node cluster. Both nodes have Veritas Cluster Server (VCS) set up correctly and no other problems with Veritas exist. If the DiskGroupDG1 disk group is imported on Node1, using the Cluster Volume Manager (CVM), it can be mounted and accessed by Node2 without any issues. However, if DiskGroupDG1 is imported on Node2, using CVM, it cannot be mounted and/or access by Node1.
All things being equal, this doesn't readily make much sense. There are no disparities between the nodes (insofar as the Veritas Cluster and Volume Management setup are concerned) and things should be just peachy going one way or the other. So, what's the deal, then?
The problem, actually, has very little to do with VCS and/or CVM (Although they're totally relevant and deserve to be in the title of the post -- standard disclaimer ;). The actual issue has to do, mostly, with minor disk numbering on the Node1 and Node2 servers. What???
Here's what happens:
In the first scenario (where everything's hunky and most everything's dorey) the DiskGroupDG1 disk group is imported by CVM on Node1 and Node1 notices that the "minor numbers" of the disks in the disk group are exactly the same as the "minor numbers" on disk it already has mounted locally. You can always tell a disk's (or any other device's) minor number by using the ls command on Linux or Unix, like so:
<-- In this instance, the device's "major number" is 32 and the device's "minor number" is 0. Generally, with virtual disks, etc, you won't see numbers that low.host # /dev/dsk # ls -ls c0t0d0s0
2 lrwxrwxrwx 1 root root 41 May 11 2001 c0t0d0s0 -> ../../devices/pci@1f,4000/scsi@3/sd@0,0:a
host # /dev/dsk # ls -ls ../../devices/pci@1f,4000/scsi@3/sd@0,0:a
0 brw-r----- 1 root sys 32, 0 May 11 2001 ../../devices/pci@1f,4000/scsi@3/sd@0,0:a
Now, on Node1, since it recognizes this conflict on import, does what Veritas VM naturally does to avoid conflict; it renumbers the imported volumes ("minor number" only) so that the imported volumes won't conflict with volumes in another disk group that's already resident on the system it's managing. Therefore, when Node2 attempts to mount, with CVM, the command is successful.
In the second scenario (where thing are a little bit hunky, but not at all dorey), Node2 imports the DiskGroupDG1 disk group and none of the minor numbers in that disk group's volumes conflict with any of its local (or already mounted) disk. The disk group volumes are imported with no error, but, the "minor numbers" are not temporarily changed, either. You see where this is going. It's a freakin' train wreck waiting to happen ;)
Now, when Node1 attempts to mount, it determines there's a conflict, but can't renumber the "minor numbers" on the disk group's volumes (since they're already imported and mounted on Node2) and, therefore, takes the only other course of action it can think of and bails completely.
So, how do you get around this for once and all time? Well, I'm not sure it's entirely possible to anticipate this problem with a variable number of nodes in a cluster, all with independent disk groups and, also, sharing volume groups between nodes, although you could take simple measures to prevent it most of the time (like running ls against every volume in every disk group in a cluster every now and again and making sure no conflicts existed. The script should be pretty easy to whip up).
Basically, in this instance (and any like it), the solution involves doing what Veritas VM did in the first scenario; except doing it all-the-way. No temporary-changing of "minor numbers." For our purposes, we'd like to change them permanently, so that they never conflict again! It can be done in a few simple steps.
1. Stop VCS on the problem node first.
2. Stop any applications using the local disk group whose "minor numbers" conflict with the "minor numbers" of the volumes in DiskGroupDG1.
3. Unmount (umount) the filesystems and deport the affected disk group.
4. Now, pick a new "minor number" that won't conflict with the DiskGroupDG1 "minor numbers." Higher is generally better, but I'd check the minor numbers on all the devices in my device tree just to be sure.
5. Run the following command against your local disk group (named, aptly, LocalDG1 ;) :
host # vxdg reminor LocalDG1 3900 <-- Note that this number is the base, so every volume, past the initial, within the disk group will have a "minor number" one integer higher than the last (3900, 3901, etc)
6. Reimport the LocalDG1 disk group
7. Remount your filesystems, restart your applications and restart VCS on the affected node.
8. You don't have to, but I'd do the same thing on all the nodes, if I had a window in which to do it.
And, that would be that. Problem solved.
You may never ever see this issue in your lifetime. But, if you do, hopefully, this page (or one like it) will still be cyber-flotsam on the info-sea ;)
Cheers,
, Mike
Please note that this blog accepts comments via email only . See our Mission And Policy Statement for further details.
Posted by Mike Golvach at 12:31 AM
cluster, disk, group, linux, minor number, unix, veritas, volume manager
Friday, May 30, 2008
Troubleshooting Veritas Cluster Server LLT Issues On Linux and Unix
Hey There,
Today's post is going to steer away from the Linux and/or Unix Operating Systems just slightly, and look at a problem a lot of folks run into, but have problems diagnosing, when they first set up a Veritas cluster.
Our only assumptions for this post are that Veritas Cluster Server is installed correctly on a two-node farm, everything is set up to failover and switch correctly in the software and no useful information can be obtained via the standard Veritas status commands (or, in other words, the software thinks everything's fine, yet it's reporting that it's not working correctly ;)
Generally, with issues like this one (the software being unable to diagnose its own condition), the best place to start is at the lowest level. So, we'll add the fact that the physical network cabling and connections have been checked to our list of assumptions.
Our next step would be to take a look at the next layer up on the protocol stack, which would be the LLT (low latency transport protocol) layer (which, coincidentally, shares the same level as the MAC, so you may see it referred to, elsewhere, as MAC/LLT, or just MAC, when LLT is actually meant!) This is the base layer at which Veritas controls how it sends its heartbeat signals.
The layer-2 LLT protocol is most commonly associated with the DLPI (all these initials... man. These stand for the Data Link Provider Interface). Which brings us around to the point of this post ;)
Veritas Cluster Server comes with a utility called "dlpiping" that will specifically test device-to-device (basically NIC-to-NIC or MAC-to-MAC) communication at the LLT layer. Note that if you can't find the dlpiping command, it comes standard as a component in the VRTSllt package and is generally placed in /opt/VRTSllt/ by default. If you want to use it without having to type the entire command, you can just add that directory to your PATH environment variable by typing:
host # PATH=$PATH:/opt/VRTSllt;export PATH
In order to use dlpiping to troubleshoot this issue, you'll need to set up a dlpiping server on at least one node in the cluster. Since we only have two nodes in our imaginary cluster, having it on only one node should be perfect.
To set up the dlpiping server on either node, type the following at the command prompt (unless otherwise noted, all of these Veritas-specific commands are in /opt/VRTSllt and all system information returned, by way of example here, is intentionally bogus):
host # getmac /dev/ce:0 <--- This will give use the MAC address of the NIC we want to set the server up on (ce0, in this instance). For this command, even if your device is actually named ce0, eth0, etc, you need to specify it as "device:instance"
/dev/ce:0 00:00:00:FF:FF:FF
Next, you just have to start it up and configure it slightly, like so (Easy peasy; you're done :)
host # dlpiping -s /dev/ce:0
This command runs in the foreground by default. You can background it if you like, but once you start it running on whichever node you start it on, you're better off leaving that system alone so that anything else you do on it can't possibly affect the outcome of your tests. Since our pretend machine's cluster setup is completely down right now anyway, we'll just let it run in the foreground. You can stop the server, at any time, by simply typing a ctl-C:
^C
host #
Now, on every other server in the cluster, you'll need to run the dlpiping client. We only have one other server in our cluster, but you would, theoretically, repeat this process as many times as necessary; once for each client. Note, also, that for the dlpiping server and client setups, you should repeat the setup-and-test process for at least one NIC on every node in the cluster that forms a distinct heartbeat-chain. You can determine which NIC's these are by looking in the /etc/llttab file.
host # dlpiping -c /dev/ce:0 00:00:00:FF:FF:FF <--- This is the exact output from the getmac command we issued on the dlpiping server host.
If everything is okay with that connection, you'll see a response akin to a Solaris ping reply:
0:00:00:FF:FF:FF is alive
If something is wrong, the output is equally simple to decipher:
no response from 00:00:00:FF:FF:FF
Assuming everything is okay, and you still have problems, you should check out the support site for Veritas Cluster Server and see what they recommend you try next (most likely testing the IP layer functionality - ping! ;)
If things don't work out, and you get the error, that's great (assuming you're a glass-half-full kind of person ;) Getting an error at this layer of the stack greatly reduces the possible-root-cause pool and leaves you with only a few options that are worth looking into. And, since we've already verified physical cabling connectivity (no loose or poorly fitted ethernet cabling in any NIC) and traced the cable (so we know NICA-1 is going to NICB-1, as it should), you can be almost certain that the issue is with the quality or type of your ethernet cabling.
For instance, your cable may be physically damaged or improperly pinned-out (assuming you make your own cables and accidentally made a bad one - mass manufacturers make mistakes, too, though). Also, you may be using a standard ethernet cable, where a crossover (or, in some instances, rollover) cable is required. Of course, whenever you run into a seeming dead-end like this, double check your Veritas Cluster main.cf file to make sure that it's not in any way related to a slight error that you may have missed earlier on in the process.
In any event, you are now very close to your solution. You can opt to leave your dlpiping server running for as long as you want. To my knowledge it doesn't cause any latency issues that are noticeable (at least in clusters with a small number of nodes). Once you've done your testing, however, it's also completely useless unless you enjoy running that command a lot ;)
Cheers,
, Mike