Alan Hargreaves' Blog

Explorer generates error messages after 25.2.25.04.15.

We have a few Solaris systems that run “explorer -q -P” from cron. Since we upgraded I’m seeing these “quiet” runs generate root email containing

Your "cron" job on <hostname>
/usr/lib/explorer/bin/explorer -q -P
* *For more information on the error below, run rda.sh -E RDA-00013
 * For more information on the error below, run rda.sh -E RDA-20421
 RDA-20421: SDCL execution error
 * For more information on the error below, run rda.sh -E RDA-07605 in
  RDA:MCend near line 82
  RDA-07605: Error encountered in method
       "RDA::Object::Report::write_cmd_rpts"
  * For more information on the error below, run rda.sh -E RDA-22817
  RDA-22817: Invalid path "/usr/bin/ls -ld /var/adm"
  * For more information on the error below, run rda.sh -E RDA-22223
  RDA-22223: Error encountered in the block called

I was unable to find anything helpful on the net (and indeed “Doc ID 1492341.1 Troubleshooting of Oracle Explorer Data Collector as part of Oracle Services Tools Bundle (STB)” is WOEFULLY out of date). As such I logged a ticket.

It appears that this is a known bug that will be fixed in RDA 25.4 (whenever that is released).

Bug 38053397 Running Explorer VAR module - RDA-22817: Invalid path "/usr/bin/ls -ld /var/adm"

The error is informational and the explorer is still collected, though I’d be interested in knowing exactly what this error stops from running. I’m assuming that it’s a test just before it does something with /var/adm but have really not had a chance to look too deeply in to it.

I’ve asked when we might see RDA 25.4 and whether or not it will be able to be upgraded without installing the full SRU.

In the mean time I guess I just need to live with a bunch of extra emails.

Written by Alan

September 23, 2025 at 10:40 am

Posted in Solaris, Work

Tagged with Explorer, Solaris

Processes are cheap (but not free)

I did a talk at a Sun conference wit this title with a slant to pointing out how a single threaded shell script could bring an F15k to it’s knees because of the ecache flushing going on in a process knockdown, because of lots of short lived processes going on. That was actually what prompted me to look at the whole shell DTrace provider thing that ended up in an appendix in Brendan’s DTrace book.

In my current role one of the things I am currently doing is going through cron jobs that are being run on a number of systems under my purview, and I came across a monitoring script that is being run every five minutes on some systems and every minute on others. This script looks at the output of a “netstat -n” command, specifically using a particular interface, gathering the local port number and remote address. It takes this list and merges it with the existing list. I have no idea of the history of why this is here, but am assuming it is actually serving a purpose for the time being.

#!/bin/ksh
TF1=/tmp/netstat.tmp
TF2=/tmp/netstat2.tmp
> $TF2
OUTFILE=/export/corp/logs/netlist
netstat -n | grep -v stream-ord | grep "10." | grep -v 9100 > /tmp/netstat.tmp
cat $TF1 | sed 's/ */\ /g' | while read line
do
 PORT=`echo ${line} | cut -d'.' -f5 | cut -d' ' -f1`
 ADDR=`echo ${line} | cut -d' ' -f2`
 N1=`echo ${ADDR} | cut -d '.' -f1`
 N2=`echo ${ADDR} | cut -d '.' -f2`
 N3=`echo ${ADDR} | cut -d '.' -f3`
 N4=`echo ${ADDR} | cut -d '.' -f4`
 typeset -i OP=`echo ${ADDR} | cut -d '.' -f5`
 typeset -i PCOUNT=`echo ${PORT} | wc -c`
 typeset -i PTI=`echo ${PORT}`
 if [ ${PCOUNT} -lt 6 -a ${PTI} -ne 23 -a ${PTI} -ne 1400 ]; then
 echo "${N1}.${N2}.${N3}.${N4}:${PORT}" >> $TF2
 fi
done
cat $OUTFILE >> $TF2
cat $TF2 | sort | uniq > $OUTFILE

Yes, I shuddered when I saw it.

On one of our S7-2s, this script takes about a minute to run (which is problematical on those systems running it every minute), spawns a bit over 30,000 processes.

I could have simply moved to using ksh internals to do pretty much all of this work (e.g. the maths, and string manipulation), but I took a step further back and realised there was a better way to do this. I also moved to coding the intent, rather than the actuality, by comparing the actual port I was interested in, rather than just grepping it out.

I replaced the main loop with a bit of awk, giving us the following.

#!/bin/ksh
TF1=/tmp/netstat.tmp
trap "rm -f ${TF1}" 0
OUTFILE=/export/corp/logs/netlist
(netstat -n | awk -F'[. ]' '1ドル == 10 {
 port = 5ドル
 if (port<10000 && port != 23 && port != 1400 && port != 9100){
 printf("%d.%d.%d.%d:%d\n", 6,ドル 7,ドル 8,ドル 9,ドル port)}}'; \
 cat ${OUTFILE}) > ${TF1}
sort -u ${TF1} > ${OUTFILE}

It’s dropped now to only five spawned processes and runs in less than a second.

It’s all very well to know that you can use spawned processes to do things, but you also need to understand the overheads involved when you start putting them in tight loops.

For those interested, the DTrace for counting the spawned processes was:

# dtrace -qn '
BEGIN {c = 0}
proc:::create /pid == $target/{c++}
END {printf("%d spawned processes\n", c)}' -c ./monnet.12.new
5 spawned processes

As an aside, for shits and giggles, I redid the script in python. The python script runs in 0.1 – 0.2 seconds. I’m going with the python. I also added some code to the python so I could place the log file under logadm control. If the LOCKFILE exists, it simply sleeps for a second and tries again (it really shouldn’t need to go through the loop more than once, but to be sure I’ll make it exit after ten fails and syslog something.

Written by Alan

August 5, 2025 at 10:54 am

Posted in DTrace, Solaris, Work

Tagged with performance, shell, Solaris

Building Chef Inspec from git

So, I’m looking at using inspec for some checking on solaris systems, but I actually want to build it from gitlab, as the community version for down load is sufficiently old that it does not work with our current sshd configurations. Specifically, I’d have to add in

HostKeyAlgorithms +ssh-rsa

to my port 22 sshd_config, and I don’t really want to do that.

I set up WSL on my work desktop, as I already had some scripting for doing things across multiple hosts and besides, you all know I’m a UNIX guy from way back and I’m much more comfortable in at least a UNIX-like environment. As an aside, I did get it to build on my local solaris 11.4 x86 virtual box, but I don’t have one of those on the appropriate work networks.

The “Install it from source” instructions in the README.md at https://github.com/inspec/inspec/tree/main are mostly good. Unfortunately they leave out four gem commands. I’ll likely do a fuller explanation of the build, but after the “bundle install”, you need to run the following before it will let you do the final gem install that is listed.

$ cd inspec-bin
$ gem build inspec-bin.gemspec
$ gem install inspec-bin-6.8.38.gem
$ cd ..
$ gem build inspec-core.gemspec
$ gem install inspec-core-6.8.38.gem

before you can run the

$ gem build inspec.gemspec
$ gem install inspec-6.8.38.gem

Note that I’m also specifically listing the version in there (that’s where the main branch is at the time of writing), as I’d been experimenting with other version and the .gem files for those were still lying around and using ‘*’, like the documentation says would have picked them up too.

So far I’ve managed to do some libraries up for coreadm and logadm (Hmmm, I didn’t think to check if there is already a module for that) checking and I’m making use of the ‘sshd_config’ plugin for checking various the ‘sshd_config’s that we have.

Written by Alan

June 30, 2025 at 10:36 am

Posted in inspec, Work

Tagged with inspec, wsl

Openat(2) failing with EINTR

I’ve been dealing with a curious issue over the last week or so. We had an application attempting to open a zvol, but it was returning with an error, only on one of the attempts to open it. This had some flow on consequences for the run of the application.

Note that I’m changing the names of the application and other stuff so as to NOT identify the application.

I ran something similar to the following Dtrace to watch what the kernel was doing for the failed openat() system call.

#!/sbin/dtrace -Fs
 
BEGIN {printf("Monitoring, ...\n");}
 
syscall::openat:entry /execname == "appname" && copyinstr(arg1) == "/dev/zvol/dsk/rpool/volume"/ {
 printf("%s [%d] Opening /dev/zvol/dsk/rpool/volume\n",
 execname, pid);
 self->interest = 1
}
fbt:::entry /self->interest/ {}
fbt:::return /self->interest/ { printf("rc=%d errno=%d", arg1, errno); }
syscall::openat:return /self->interest && errno == 4/ {exit(0);}
syscall::openat:return /self->interest/ { self->interest = 0; }

Which gave us the following (trimmed for brevity):

 6 <- zfsdev_dispatch rc=4 errno=0 retaddr=zfsdev_dispatch+0x18c
 ...
 6 <- zfsdev_ioctl rc=4 errno=0 retaddr=zfsdev_ioctl+0x46c
 ...
 6 <- spec_cb_ioctl rc=4 errno=0 retaddr=spec_cb_ioctl+0x8c
 ...
 6 <- ldi_ioctl rc=4 errno=0 retaddr=ldi_ioctl+0x160
 ...
 6 <- devzvol_handle_ioctl rc=4 errno=0 retaddr=devzvol_handle_ioctl+0x150
...
 6 <- devzvol_objset_check rc=4 errno=0 retaddr=devzvol_objset_check+0x124
 ...
 6 <- devzvol_validate rc=4 errno=0 retaddr=devzvol_validate+0x1fc
 ...
 6 <- devname_lookup_func rc=4 errno=0 retaddr=devname_lookup_func+0x5fc
 ...
 6 <- devzvol_lookup rc=4 errno=0 retaddr=devzvol_lookup+0x3e4
 6 <- fop_lookup rc=4 errno=0 retaddr=fop_lookup+0x304
 ...
 0 <- lookuppnvp rc=4 errno=0
 0 <- lookuppnatcred rc=4 errno=0
 0 <- lookupnameatcred rc=4 errno=0
 0 <- lookupnameat rc=4 errno=0
 0 <- vn_openat rc=4 errno=0

We see that during the call we get a return code of 4 from call to zfsdev_dispatch(). In earlier code (looking at the opensolaris code), devzvol_lookup() ignored an upper return code of 4.

In Solaris 11.4 sru 24 the following change is listed.

30675505 devzvol_lookup should just pass EINTRs along

which explains why it passes it back up the call stack, resulting in the failure with an EINTR. The return code of 4 is an indication that a signal has been received while processing the openat() system call.

We now need to determine what the signal is and where it came from.

Inside the zfs_dispatch() call we get a signal so we end up in issig_forreal() to deal with it.

 6 -> issig_forreal
 6 -> schedctl_finish_sigblock
 6 <- schedctl_finish_sigblock rc=426610995637088 errno=0 retaddr=schedctl_finish_sigblock+0x4c
 6 -> fsig
 6 -> sigdiffset
 6 <- sigdiffset rc=2890567951920 errno=0 retaddr=sigdiffset+0x2c
 6 <- fsig rc=0 errno=0 retaddr=fsig+0x150
 6 -> fsig
 6 -> sigdiffset
 6 <- sigdiffset rc=2890567951920 errno=0 retaddr=sigdiffset+0x2c
 6 -> lowbit
 6 <- lowbit rc=18 errno=0 retaddr=lowbit+0x84
 6 <- fsig rc=18 errno=0 retaddr=fsig+0x170
 6 -> sigdeq
 6 <- sigdeq rc=426611868820408 errno=0 retaddr=sigdeq+0x138
 6 -> isjobstop
 6 <- isjobstop rc=0 errno=0 retaddr=isjobstop+0x124

If we look at the return code of fsig() we are told the signal that we received. It’s 18, or a SIGCHLD, which is

#define SIGCLD 18   /* child status change */

This means that we’ve had a child process exit and notify us of its exit. So, what was it?

With some updated DTrace we see the following (column 1 is a nanosecond timestamp) during one of the failed runs. Note that in this run application process that we are interested in is PID 4624.

1733723047273935775 application [4624] openat called
1733723047273935775 application [4624] Created PID 4626
 
 libc.so.1`__forkx+0xc
 ...
 application`main+0xc
 application`_start+0x64
1733723047273935775 application [4626] exec /etc/application_parallel_startup
1733723047273935775 application [4626] exec /usr/xpg4/bin/sh
1733723047283936690 sh [4628] exec /usr/bin/date
1733723047283936690 sh [4626] exit status 1 ppid=4624
1733723047283936690 sh [4626] post SIGCLD to 4624 (application) *****
1733723047283936690 application [4624] openat returns

We see that we’ve started up “/etc/application_parallel_startup”. Seeing it exec the pathname and then exec /bin/sh shows that it’s a shell script. We see it run a date and then exit.

Having a look at the script, that is basically what it does. It’s there for local customisation of other stuff we may want to do at the time we start the application.

It is noteworthy that the application has used functionality that it has to start longer running daemons to run this, and does NOT do a waitpid() on it.

It looks like we are simply unfortunate enough that about three times out of ten we get that SIGCLD in that 1.5ms window of the openat().

As a workaround, I’ve added a “sleep 5” to the end of that script so it finishes much later (I could probably have made it smaller, or even just run another command to give us a context switch, which would add up to 10ms), but this looks to work.

Thoughts

I think that it would be better practice for the application to actually do a waitpid() on this short lived process. That would definitely prevent this problem occurring.

After a little more thought, I am also concerned that this change to devzvol_lookup() has introduced a bug.

In the normal course of events, if the application was executing in user space, the receipt of a SIGCLD with the default actions would not impact the process. This change to devzvol_lookup() means tat if we happen to be in a system call that it DOES impact the process, and in this instance impacted it pretty badly.

My feeling is that the behaviour in devzvol_lookup() of signal handling should have the same results as handling that signal in user space; but we’ll have to see how far that opinion gets me with Solaris Engineering.

Written by Alan

December 11, 2024 at 3:55 pm

Posted in DTrace, Solaris, Work

Tagged with Solaris

Updating Ops Center Agents

It has been my (unfortunate) experience that after applying the Final Aggregate Patch for Ops Center, which delivers the v12.4.0.3201 Agent, that

We don’t always get the option to upgrade the target in the Web UI
Running “upgrade_all_agents” on the Ops Center CLI does not always do all of the targets
some of the targets have the agent in an unclear status

I found two documents particularly helpful here:

Doc ID 1991863.1 OPS Center Agent Configuration Fails with: “Cannot register SMF service for instance: [scn-agent]” – https://support.oracle.com/epmos/faces/SearchDocDisplay?_afrLoop=430249586014284&_afrWindowMode=0&_adf.ctrl-state=dbni93kzd_4
Chapter 2 “Manage Assets” of the Ops Center 12c Release 4 “Enterprise Manager Ops Center Configuration Reference” – https://docs.oracle.com/cd/ops-center-12.4/doc.1240/e59970/GUID-17FB2DF7-D3C0-4132-A65B-FBE1B557AFD6.htm#GUID-CBEC20F8-E69B-4ED6-B174-C4909DB2A2B3
It should be noted that the previous Doc points to the v12.2 version of this page.

In Short, I created a tarball (/tmp/OC.tar.gz) containing all of the following in a directory called “OC”

mytoken – The token on the last line of “/var/opt/sun/xvm/persistence/scn-proxy/connection.properties”, with
- backslash characters removed
- all text up to and including the first ‘=’ removed
The extracted zip file from “/var/opt/sun/xvm/images/agent/OpsCenterAgent.SolarisIPS.all.12.4.0.3201.zip”

Given a list of targets hosts in ${LDOMS} and the knowledge that the admin account on these targets can perform “svcadm” functions with “pfexec”, and that I have a script that can be used to disable puppet in the global zone and any non-global-zones (our puppet agents run every hour and do a few pkg commands that could interfere with running pkg updates in the global zone), we run

for ldom in ${LDOMS}; do echo === ${ldom}; scp /tmp/OC.tar.gz ${ldom}:/var/tmp; ssh ${ldom} pfexec puppetsvc.sh -d; done

Then we ssh to each LDOM and as root run

OPSCENTER_IP={IP Address of Ops Center Server}
cd /var/tmp ; tar zvxpf OC.tar.gz ; cd OC ;\
cacaoadm prepare-uninstall ;\
/opt/SUNWxvmoc/bin/agentadm unconfigure ;\
OpsCenterAgent/install -p ${OPSCENTER_IP} \;
/opt/SUNWxvmoc/bin/agentadm configure -t /var/tmp/OC/mytoken -x ${OPSCENTER_IP}

Answering any questions that may be posed. When done, clean up after ourselves and restart puppet.

for ldom in ${LDOMS}; do echo === ${ldom}; ssh ${ldom} 'pfexec puppetsvc.sh -e ; cd /var/tmp ; rm -rf OC OC.tar.gz' ; done

The above seems to have a 100% success rate for the targets that had not been updated by other means.

Written by Alan

May 24, 2024 at 2:12 pm

Posted in Ops Center, Solaris, Work

Tagged with OpsCenter, Solaris

Ops Center – Recovering Credential Information

I recently found myself in the position of needing to remove and reinstall Ops Center. While gathering information that I would need to recreate everything, I got stuck on some credentials (specifically IPMI and SNMPv3).

I can’t be the only one who has needed to try to pull this kind of information back out of the Ops Center database.

Now there is obviously no way to do this in the Ops Center BUI, so enter the Ops Center CLI.

# /opt/SUNWoccli/bin/oc

Oracle Enterprise Manager Ops Center
 Copyright (c) 2007, 2019 Oracle and/or its affiliates. All rights reserved.
 Use is subject to license terms.

 Use the -connect- command to get started
 The tab key will always show the available command set
 
xvmSh > connect
localhost > credentials
localhost/credentials > list -l
Driver Credentials:
 ID | Name | Type | Description | Attributes |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
166404852| OCopscenter | SNMPV3 | None |snmpUserName=OCopscenter,authPassword=*****,authProtocol=MD5,privPassword=*****,privProtocol=DES
14351 | SNMP1 | SNMPV3 | SNMPv3 Credentials |snmpUserName=OCipmi,authPassword=*****,authProtocol=SHA,privPassword=*****,privProtocol=AES
166404851| Prod IPMI | IPMI |Production IPMI Credentials| login=root,sharedSecret=*****

Now we see the sensitive information blanked out. The python code that does this is in /opt/SUNWoccli/share/lib/occli/xvm_discovery.py in the code:

 137 _hide_data=('protocol','id')
 138 _sensitive_data_display_value="*****"
 139 
 140 def strip_secure_data(key, value):
 141 """
 142 returns (boolean,string) True if value should be shown
 143 """
 144 if key in _hide_data:
 145 return (False,"")
 146 
 147 if not value:
 148 #nothing to be stripped
 149 return (True,value)
 150 
 151 from com.oracle.sysman.services.discovery import DriverCredentials
 152 if key in DriverCredentials.SENSITIVE_DATA:
 153 return (True,_sensitive_data_display_value)
 154 return (True, value)

Now, if we comment out that conditional

 152 #if key in DriverCredentials.SENSITIVE_DATA:
 153 # return (True,_sensitive_data_display_value)
 154 return (True, value)

Then we will get the hidden information

localhost/credentials > list -l
Driver Credentials:
 ID | Name | Type | Description | Attributes |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
166404852| OCopscenter | SNMPV3 | None |snmpUserName=OCopscenter,authPassword=PasswordInClearText,authProtocol=MD5,privPassword=PasswordInClearText,privProtocol=DES
14351 | SNMP1 | SNMPV3 | SNMPv3 Credentials |snmpUserName=OCipmi,authPassword=PasswordInClearText,authProtocol=SHA,privPassword=PasswordInClearText,privProtocol=AES
166404851| Prod IPMI | IPMI |Production IPMI Credentials| login=root,sharedSecret=PasswordInClearText

Don’t forget to change it back.

Written by Alan

March 27, 2024 at 1:25 pm

Posted in Ops Center, Security, Solaris

Tagged with OpsCenter, Solaris

How I addressed a geo-infrastructure geocluster failure

Description of Issue

We are running a Two Cluster geocluster; ProdNFS and DRNFS. On the production side we can no longer run any of the commands like geoadm or geopg. The geo-infrastructure resource group is in “Pending_online”. the geo-failovercontrol resource is faulted.

Investigation

It looks like we are stuck in the “infinite loop” in /usr/cluster/lib/rgm/rt/hamasa/cmas_service_ctrl_start with “cacaoadm list-trusted certs” returning status 11.

 39 CLASSPATH=${CMASSPATH}:${CACAOPATH}:${JDMKPATH}:${JMXPATH}:${OPENJDMKPATH}
 40 CACAOADM=${ROOT_DIR}/usr/sbin/cacaoadm
 41 
 42 # is cacao ready?
 43 if [ ! -x ${CACAOADM} ]
 44 then
 45 status=100 # exit err
 46 else
 47 ${CACAOADM} list-trusted-certs > /dev/null 2>&1
 48 status=$?
 49 
 50 if [ $status -ne 0 ]
 51 then
 52 
 53 if [ $status -eq 3 ] # cacao not running
 54 then
 55 ${CACAOADM} enable > /dev/null 2>&1
 56 ${CACAOADM} start > /dev/null 2>&1
 57 fi
 58 
 59 # infinite loop OK; rgm will kill us if we can't exit on our own
 60 while [ $status -ne 0 ]
 61 do
 62 /usr/bin/sleep 5
 63 ${CACAOADM} list-trusted-certs > /dev/null 2>&1
 64 status=$?
 65 done
 66 
 67 fi
 68 
 69 java $JAVA_OPTS -classpath ${CLASSPATH} ServiceControl $AGENT $COMMAND $FAILOVER_GROUP
 70 status=$?
 71 fi

Specifically, we are timing out in the infinite loop in lines 60-65. the cacaoadm command is returning status 13, for which the man page is not particularly helpful. I also notice if I run “cacaoadm list-trusted-certs -v” on the other side of the geocluster that I see we have the certificates from PRODNFS being listed as having expired mid last month, as well as the actual signing cacao certificate being expired. This last turns out to be our issue.

	
root@*****:/etc/cacao/instances/default/security/jsse# keytool -keystore /etc/cacao/instances/default/security/jsse/keystore -list -v
Enter keystore password:
Keystore type: jks
Keystore provider: SUN
Your keystore contains 1 entry
Alias name: cacao_agent
Creation date: 16/07/2020
Entry type: PrivateKeyEntry
Certificate chain length: 2
Certificate[1]:
Owner: CN=*****_agent
Issuer: CN=*****_ca
Serial number: 3a4c1ff9
Valid from: Tue Jun 16 20:52:46 PGT 2020 until: Fri Sep 16 20:52:46 PGT 2022
Certificate fingerprints:
SHA1: 13:BE:B5:FC:BE:E6:69:98:ED:76:F3:41:C0:16:0F:09:30:01:FF:19
SHA256: F4:B2:BF:EF:39:D8:17:2A:69:3E:D0:72:6C:FC:9D:30:6A:7B:00:7B:B3:35:F6:80:90:4F:95:1F:77:7F:34:8C
Signature algorithm name: SHA256withRSA
Subject Public Key Algorithm: 2048-bit RSA key
Version: 3
Extensions:
#1: ObjectId: 2.5.29.19 Criticality=false
BasicConstraints:[
CA:false
PathLen: undefined
]
#2: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 1A 7A 19 7B 64 66 90 C8 A3 69 05 3D 16 57 59 27 .z..df...i.=.WY'
0010: 07 04 A2 BB ....
]
]
Certificate[2]:
Owner: CN=*****_ca
Issuer: CN=*****_ca
Serial number: 6997e07d
Valid from: Tue Jun 16 20:52:46 PGT 2020 until: Fri Sep 16 20:52:46 PGT 2022
Certificate fingerprints:
SHA1: 84:DE:5D:CD:D9:99:2A:F3:81:75:7C:D0:0A:0E:0B:07:EC:73:36:D6
SHA256: 8C:47:32:FA:0E:2F:9A:48:8A:C8:59:C4:8B:16:89:34:63:DA:19:D1:32:3A:28:30:4E:2D:33:7D:FD:74:86:A4
Signature algorithm name: SHA256withRSA
Subject Public Key Algorithm: 2048-bit RSA key
Version: 3
Extensions:
#1: ObjectId: 2.5.29.19 Criticality=true
BasicConstraints:[
CA:true
PathLen:1
]
*******************************************
*******************************************
These are the certificates generated when we initialised this cacao instance. They appear to have been generated with a two year lifetime, which is a known issue with Oracle support. There are documents available to address the issue with Ops Center, but not with this Solaris Cluster and Geocluster.

Solution

We can actually kill two birds with one stone here. The original process that I got from Oracle had us using geops to remove trust and then use geops add-trust to regenerate the certificates. The problem with this approach is that given the cacao_agent certificate has also expired, so the geocluster commands can’t communicate with it.

If, however, we stop the cacao daemons and generate new certificates then we

Have a new agent certificate with a 20 year expiry
Remove the expired geocluster certificates from the trust store

This process is going to describe replacing all of the certificates as I note that the ones on DRNFS are due to expire in the next month or so as well.

On each side of the geocluster we need to

On the node that we originally had the certificates generated, shut down cacao, regenerate the agent certificate and restart it

# cacaoadm stop
# cacaoadm create-keys -f
# cacaoadm start

Make a tarball of /etc/cacao/instances/default/security and copy it to the other node of this cluster

# cd /etc/cacao/instances/default/security
# tar cf /tmp/cacao.tar *
# scp /tmp/cacao.tar othernode://tmp/cacao.tar

On the other node, stop cacao, extract the tarball and restart it.

# cacaoadm stop
# cd /etc/cacao/instances/default/security
# tar xpf /tmp/cacao.tar
# cacaoadm start

Once this has been done on each geocluster we can add trust again.

On the production geocluster (ProdNFS)

# geops add-trust -c DRNFS

On the DR geocluster (DRNFS)

# geops add-trust -c ProdNFS

ProdNFS and DRNFS are the names of the respective geoclusters, you’d need to replace those with the ones that you chose.

Once I had done this, I could again run the geocluster command and control programs.

Written by Alan

October 25, 2022 at 3:24 pm

Posted in geocluster, Solaris, Work

OBDX 18.0.0 Installer and Solaris 11.4 beyond sru 21

This one took us a couple of weeks to work out and I’m writing this to save others the pain we’ve just been through.

For whatever reason our customer is using this particular version of Oracle OBDX. The installer code is written in python 2.7.

We recently had the non-production systems upgraded to Solaris 11.4 sru 32 taking us to something much more current, with a view to pushing this out to pre-prod and then pro/dr and moving to more regular updates.

We noticed the installer failing, being unable to find the cx_Oracle packages for python 2.7.

To cut a long story short, in 11.4 sru21 we had the following package changes:

library/python/cx_oracle
- Among other things, a dependency on runtime/python-27 was removed
library/python/cx_oracle-27
- This package delivered the cx_Oracle.so python module and also had a requirement of developer/oracle/odpi-23. The files were removed from the package
developer/oracle/odpi-23
- Files required by cx_Oracle.so were removed and this became an empty package.

I can only assume that there was not a PSARC license between Solaris and the folks looking after this product installer or appropriate notifications would have occurred before the removal.

Because of the way that dependencies have been done, it is not possible to just re-install the packages.

It is possible to work around this issue by installing the following missing files on the system we wish to do the install on (and probably removing them afterwards as they are not being tracked in the packaging system). In my case I gathered them from a system running Solaris 11.4 SRU 20.

usr/lib/python2.7/vendor-packages/64/cx_Oracle.so
usr/lib/python2.7/vendor-packages/cx_Oracle.so
usr/lib/libodpic.so.1
usr/lib/sparcv9/libodpic.so.1

A cleaner solution(and the one I took as I’d like to have these tracked by IPS) is to take the files that constitute the old packages and create a new IPS package. I ended up creating a local package called “obdx-install-prereq” that contains:

usr/include/odpi-230/dpi.h
usr/lib/libodpic.so.1
usr/lib/python2.7/vendor-packages/64/cx_Oracle.so
usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/PKG-INFO
usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/SOURCES.txt
usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/dependency_links.txt
usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/top_level.txt
usr/lib/python2.7/vendor-packages/cx_Oracle.so
usr/lib/sparcv9/libodpic.so.1
usr/share/man/man3lib/libodpic-230.3lib

We currently have a support call open with Oracle where I have requested the creation of Support documentation (with appropriate application and Solaris metadata) to describe this issue and how to work around it. The version of OBDX we are using is well beyond “current – 1”, so I don’t expect any fixes in the installer, but documentation of the workaround would be helpful.

For those interested, my package manifest (obdx-install-prereq.p5m) looks like:

set name=pkg.fmri value="obdx-install-prereq@11.4-11.4.21.1"
set name=pkg.summary value="OBDX Installer prerequisites"
set name=pkg.description value="Copies of the files in packages removed in sru21 that the OBDX installer requires, incorporating pkg:library/python/cx_python-27@6.3-11.4.19, pkg:developer/oracle/odpi-230@2.3.0-11.4.19 and their dependencies"
set name=variant.arch value=sparc
depend fmri=pkg:/library/python/cx_oracle@7.1.1 type=require
depend fmri=database/oracle/instantclient-122 fmri=database/oracle/instantclient-121 type=require-any
file path=usr/lib/libodpic.so.1 mode=555 owner=root group=bin
file path=usr/lib/sparcv9/libodpic.so.1 mode=555 owner=root group=bin
file path=usr/include/odpi-230/dpi.h mode=444 owner=root group=bin
file path=usr/share/man/man3lib/libodpic-230.3lib mode=444 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/64/cx_Oracle.so mode=555 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/cx_Oracle.so mode=555 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/PKG-INFO mode=555 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/SOURCES.txt mode=555 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/dependency_links.txt mode=555 owner=root group=bin
file path=usr/lib/python2.7/vendor-packages/cx_Oracle-6.3-py2.7.egg-info/top_level.txt mode=555 owner=root group=bin

Written by Alan

July 3, 2021 at 5:23 pm

Posted in Solaris, Work

Problem with the Oracle 19c Solaris Prequisite Package

Oh boy, what a day; and this did take a 12 our day to work out what the hell was going on. I’m documenting this in order to save anyone else having the day I’ve just had tracking this down.

In my day job I’ve had to automate zone creation for potentially hundreds of zones. Obviously I have a rather customised script that does this for me.

Today I went to install one of my database zones and got the following

DOWNLOAD PKGS FILES XFER (MB) SPEED
Completed 533/533 114192/114192 1212/1212 13.0M/s
PHASE ITEMS
Installing new actions 149645/149645
Updating package state database Done 
Updating package cache 0/0 
Updating image state Done 
Creating fast lookup database Done 
Updating package cache 9/9 
 cannot mount 'BSPZOBPDB8001_rpool/rpool/export' on '/system/volatile/install.26494/a/export': directory is not empty
 Error occurred during execution of 'update-filesystem-owner-group' checkpoint.
 Failed Checkpoints:
 update-filesystem-owner-group
 Checkpoint execution error:
 Command '['/usr/sbin/zfs', 'mount', '-o', 'mountpoint=/system/volatile/install.26494/a/export', 'BSPZOBPDB8001_rpool/rpool/export']' returned unexpected exit status 1
 cannot mount 'BSPZOBPDB8001_rpool/rpool/export' on '/system/volatile/install.26494/a/export': directory is not empty
Installation: Failed. See install log at /system/volatile/install.26494/install_log

So what looks like is happening, and only on the database zones is that we are getting to the section where it mounts the ZFS datasets and fixes the ownership and permissions. It goes to mount export/home but finds that export already has something in the directory, which ZFS really doesn’t like.

The unfortunate part is that after the failure the installation process does a zfs destroy on all of the datasets so I couldn’t see exactly what was there. My original suspicion was that the build process had gone through and created the mountpoints for all of the datasets before mounting any of them, but ten I found that I could install one of my application zones without a problem.

This lead to looking at differences between the configuration files for the application and database zones, and also looking for what might have changed to stop the database zones installing.

I should note that I’d had problems installing this particular server over the last few days as well, so I had not ruled out this as a cause. Well I did rule it out after I attempted the same zone build on another server.

So, to the actual problem.

One of the things that I need to do is to add a number of extra packages to the default manifest. One of those is for the Oracle Database prerequisites. We recently updated to Oracle 19c, so the prerequisite package is pkg:/group/prerequisite/oracle/oracle-database-preinstall-19c. A few days back I updated it to this from what we had been using previously.

I found that if I commented out this package from the manifest, everything worked just fine. I took one of the zones that built and tried to install the package from the command line. It worked perfectly, but also gave the following note:

Release Notes:
 
pkg://solaris/group/prerequisite/oracle/oracle-database-os-configuration
 This package will create, by default, the oracle user (oracle) and two
 Unix groups, (dba and oinstall). If these defaults are not required 
 then uninstall the package:
 
 group/prerequisite/oracle/oracle-database-os-configuration
 
 Do not modify the created entries in /etc/passwd or /etc/group because
 this will cause 'pkg verify' and 'pkg fix' to report and undo the 
 modifications.
 
 If these actions are not wanted prior to the install then running
 
 pkg avoid oracle-database-os-configuration
 
 will ensure that the package is not installed when the database
 prerequisite package is installed.

We actually have specific oracle users and groups in LDAP, so don’t want these anyway. I don’t think the previous prerequisite package did this.

OK, let’s have a look at the manifest of this group.

$ pkg contents -rm oracle-database-os-configuration
set name=pkg.fmri value=pkg://solaris/group/prerequisite/oracle/oracle-database-os-configuration@11.4,5.11-11.4.32.0.1.88.3:20210330T001524Z
set name=pkg.summary value="Oracle database initial configuration"
set name=pkg.description value="Provides some default setup configuration for the Oracle Database"
set name=info.classification value="org.opensolaris.category.2008:Meta Packages/Group Packages"
set name=org.opensolaris.consolidation value=solaris_re
set name=variant.arch value=i386 value=sparc
dir group=dba mode=0755 owner=oracle path=export/home/oracle
file 1f7e4694adb88bcf4f77833338e953ef439422be chash=212f65c650165647495a273de054549a31808dbb group=root mode=0444 must-display=true owner=root path=usr/share/doc/release-notes/oracle-database-preinstall-19c.txt pkg.content-hash=file:sha512t_256:cfe92df2d974ed8ac4c5fcea4661812bca59628f2361f45713d913ac92ac408f pkg.content-hash=gzip:sha512t_256:401a80920eda0583923faf985dcc4da6e11c8fa0bcf9dacd6f36d2b941957d28 pkg.csize=323 pkg.size=603 release-note=feature/pkg/self@0
group gid=69 groupname=dba
group gid=68 groupname=oinstall
user ftpuser=false gcos-field="Oracle Database Owner" group=dba group-list=oinstall home-dir=/export/home/oracle login-shell=/bin/bash uid=69 username=oracle
signature b9e38504b3c149270fd54d6416ce65594f97309d algorithm=rsa-sha256 chain=370b6b4fba7b0ad472465ffe9377f8f6040b2cfd chain.chashes=ff591399c9e679500060a00196932e292872eeb1 chain.csizes=984 chain.sizes=1269 chash=774089cf732c83322727e12d298e2ca91837a709 pkg.csize=987 pkg.size=1314 value=b7ab54dd59ae2f378980f996f03359f37f42487106035cb778fcdcd4849f27f21b812953324d1e4177914c98bdc298d7e8875a6f46c853da0aa31a692538661e88b42469eb8b019599e8c9a49b95529d11065e1cc03411551d799f2481c50fd590353f07435935fbb7fed79a755151a34500831294eba5ed9ea9fc9cd0e6ee46da7bb4e55bec5613e5521f08d6d561736dcf153d65d703d6a04b6bf71f4470abefd4d024cf06e1331dea37c360ef72a97aec414c017e84e7450b8ebb458a0a8d661ddc97a8085f9e44443aacbbdda65f7607a88797102120b0b46362903ce01184f4f3d4724c46c1e446e6ca55f1dc69be4fc8ae40a473992b0453b3ede6a835 version=0

There is our problem; bolded in red above.

This package is installed before export/home is mounted, and has created the home directory for the oracle user. It works fine once we are up as everything is mounted. However tat is not the state that the system is in when doing a zone build using auto install.

The fix took a little bit of work to find the right tags to use in the manifest to do the equivalent of a “pkg avoid oracle-database-os-configuration”. The trick was to add the following (in blue) to the manifest. Note that it is crucial that the avoid actions appear above the install actions.

 <software_data action="avoid">
 <name>pkg:/group/prerequisite/oracle/oracle-database-os-configuration</name>
 </software_data>
 <software_data action="install">
 <name>pkg:/group/system/solaris-small-server</name>
 <name>pkg:/developer/gcc/gcc-go-9</name>
 <name>pkg:/developer/versioning/git</name>
 <name>pkg:/runtime/python-37</name>
 <name>pkg:/runtime/ruby-25</name>
 <name>pkg:/runtime/ruby-26</name>
 <name>pkg:/library/python/cx_oracle</name>
 <name>pkg:/developer/java/jdk-8</name>
 <name>pkg:/runtime/java/jre-8</name>
 <name>pkg:/diagnostic/top</name>
 <name>pkg:/group/prerequisite/oracle/oracle-database-preinstall-19c</name>
 <name>pkg:/puppet-agent</name>
 </software_data>

Now I’ve asked Oracle to create an info doc on this, and I also think that having a package create a directory on export/home is potentially a bug for something that could be included in an auto install manifest.

Update (August 27)

I have just been notified that tis has been fixed in Solaris 11.4 SRU 36.

Written by Alan

May 11, 2021 at 9:54 pm

Posted in Solaris, Work

Implementing Geocluster Replication on HA-NFS

Also known as The “Disaster Recovery Framework”, or DRF.

This entry is going to undergo a bit of revision as I discovered some more things that need to go into a new entry.

Given the fun and games I’ve just had doing this and the fact that I could find nothing online about anyone having done it, as promised, I’m writing this walk through. The main gotchas that I had were to do with the way that networking had been configured on the different clusters that I was implementing this on, as well as the fact that I had not used the same resource group name for the resource group that I wanted to have replication working for.

I’m only going to list one of the resource groups for this walk through.

Resource Group Replication Name Initial Source Cluster Initial Target Cluster

rplnfs-rg rplnfs ProdNFS DRNFS

We also need to have the following defined in the DNS

Production DR

Cluster Name prodnfs drnfs

Replication logical hostname rplnfs01-repl rplnfs02-repl

Another thing that caught me up was that I was trying to use the logical hostname of the HA-NFS. You can’t do this, it has to be another address.

Installation of DRF on all cluster nodes

The clusters should already have an ha-cluster publisher, which would have been used to install the initial cluster. To install the DRF software, do the following on all cluster nodes

# pkg install ha-cluster-geo-full

Enable the DRF Software

Now our clusters are “special”. The cluster addresses we use are on a different subnet to the cluster host. This means that on an inactive cluster node, we have no addresses active on the correct subnet, which means that geoadm start will fail. We need to either plumb a “dummy” interface on the inactive cluster nodes while we build the geocluster or move another resource group wit ha logical hostname to the other node. I just happened to have another one (emnfs-rg) that I could do this with so that is the path I took. You could as easily have just added an address record to the appropriate subnet interface using ipadm.

ProdNFS

root@clnode0101:~# clrg switch -n clnode0102 emnfs-rg
root@clnode0101:~# geoadm start
 ... checking for management agent ...
 ... management agent check done ....
 ... starting product infrastructure ... please wait ...
 Registering resource type <SUNW.HBmonitor>...done.
 Registering resource type <SUNW.SCGeoInitSvc>...done.
 Registering resource type <SUNW.scmasa>...done.
 Resource type <SUNW.SCGeoZC> has been registered already
 Creating scalable resource group <geo-clusterstate>...done.
 Creating disaster recovery framework initalization resource <geo-init-svc>...
 Oracle Solaris Cluster disaster recovery framework initilization resource created successfully ....
 Creating failover resource group <geo-infrastructure>...done.
 Creating logical host resource <geo-clustername>...
 Logical host resource created successfully ....
 Creating resource <geo-hbmonitor> ...done.
 Creating resource <geo-failovercontrol> ...done.
 Bringing RG <geo-clusterstate> to managed state ...done.
 Bringing resource group <geo-infrastructure> to managed state ...done.
 Enabling resource <geo-clustername> ...done.
 Enabling resource <geo-hbmonitor> ...done.
 Enabling resource <geo-failovercontrol> ...done.
 Node clnode0101: Bringing resource group <geo-infrastructure> online ...done.
 
 Oracle Solaris Cluster disaster recovery framework started successfully.
 
root@clnode0101:~# clrg switch -n clnode0102 emnfs-rg

DRNFS

 root@clnode0201:/var/explorer/output# clrg switch -n clnode0201 emnfs-rg
 root@clnode0201:/var/explorer/output# geoadm start
 ... checking for management agent ...
 ... management agent check done ....
 ... starting product infrastructure ... please wait ...
 Resource type <SUNW.HBmonitor> has been registered already
 Resource type <SUNW.SCGeoInitSvc> has been registered already
 Resource type <SUNW.scmasa> has been registered already
 Resource type <SUNW.SCGeoZC> has been registered already
 Resource group <geo-clusterstate> already exists
 Resource <geo-init-svc> already exists
 Resource group <geo-infrastructure> already exists
 Creating logical host resource <geo-clustername>...
 Logical host resource created successfully ....
 Creating resource <geo-hbmonitor> ...done.
 Creating resource <geo-failovercontrol> ...done.
 Bringing RG <geo-clusterstate> to managed state ...done.
 Bringing resource group <geo-infrastructure> to managed state ...done.
 Enabling resource <geo-clustername> ...done.
 Enabling resource <geo-hbmonitor> ...done.
 Enabling resource <geo-failovercontrol> ...done.
 Node clnode0201: Bringing resource group <geo-infrastructure> online ...done.
 
 Oracle Solaris Cluster disaster recovery framework started successfully.
 
 root@clnode0201:/var/explorer/output# clrg switch -n clnode0201 emnfs-rg

Network issue on a different cluster

It this point it is worth noting an issue I had on another pair of clusters.

The problem with this pair was that I had two ipmp interfaces that went on to the same subnet. When I ran geoadm start I got the following.

Creating logical host resource <geo-clustername>...
 FAILED: clrslh create -g geo-infrastructure -p R_description="Oracle Solaris Cluster Geo logical hostname for communication with the partner clusters" -h ProdNFS geo-clustername
 
 Creation of logical host resource failed with following message: clrslh: multiple PNM objects are available on node clnode0101 which can host given hostname(s). Specify the NetIfList property to resolve the ambiguity.

To get past this, I ran the listed command adding in a -N argument to list the required interfaces.

e.g.

root@clnode0101:~# clrslh create -g geo-infrastructure -p R_description="Oracle Solaris Cluster Geo logical hostname for communication with the partner clusters" -h ProdNFS -N ipmp1@clnode0101,ipmp1@clnode0102 geo-clustername

Setting up the partnership

DRNFS trusting ProdNFS

Just in case it’s not obvious, I changed the fingerprints from what I actually got.

root@clnode0201:~# geops add-trust -c ProdNFS
 
 Local cluster : DRNFS
 Local node : clnode0201
 
 Cleaning up certificate files in /etc/cacao/instances/default/security/jsse on clnode0201
 
 Retrieving certificates from ProdNFS ... Done
 
 New Certificate:
 Owner: CN=clnode0102_agent
 Issuer: CN=clnode0102_ca
 Serial number: 3a4c1ff9
 Valid from: Tue Jun 16 20:52:46 PGT 2020 until: Fri Sep 16 20:52:46 PGT 2022
 Certificate fingerprints:
 M12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SH12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SHA212: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 Signature algorithm name: SHA256withRSA
 Subject Public Key Algorithm: 2048-bit RSA key
 Version: 3
 
 Extensions:
 
 #1: ObjectId: 2.5.29.19 Criticality=false
 BasicConstraints:[
 CA:false
 PathLen: undefined
 ]
 
 #2: ObjectId: 2.5.29.14 Criticality=false
 SubjectKeyIdentifier [
 KeyIdentifier [
 0000: 1A 7A 19 7B 64 66 90 C8 A3 69 05 3D 16 57 59 27 .z..df...i.=.WY'
 0010: 07 04 A2 BB ....
 ]
 ]
 
 Do you trust this certificate? [y/n] y
 
 Adding certificate to truststore on clnode0202 ... Done
 
 Adding certificate to truststore on clnode0201 ... Done
 
 New Certificate:
 Owner: CN=clnode0102_ca
 Issuer: CN=clnode0102_ca
 Serial number: 6997e07d
 Valid from: Tue Jun 16 20:52:46 PGT 2020 until: Fri Sep 16 20:52:46 PGT 2022
 Certificate fingerprints:
 M12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SH12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SHA212: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 Signature algorithm name: SHA256withRSA
 Subject Public Key Algorithm: 2048-bit RSA key
 Version: 3
 
 Extensions:
 
 #1: ObjectId: 2.5.29.19 Criticality=true
 BasicConstraints:[
 CA:true
 PathLen:1
 ]
 
 Do you trust this certificate? [y/n] y
 
 Adding certificate to truststore on clnode0202 ... Done
 
 Adding certificate to truststore on clnode0201 ... Done
 
 Operation completed successfully. All certificates are added to truststore on nodes of cluster DRNFS

ProdNFS Trusting DRNFS

root@clnode0101:~# geops add-trust -c DRNFS
 
 Local cluster : ProdNFS
 Local node : clnode0101
 
 Cleaning up certificate files in /etc/cacao/instances/default/security/jsse on clnode0101
 
 Retrieving certificates from DRNFS ... Done
 
 New Certificate:
 Owner: CN=clnode0202_agent
 Issuer: CN=clnode0202_ca
 Serial number: 47336d88
 Valid from: Sun Aug 23 12:52:11 PGT 2020 until: Wed Nov 23 12:52:11 PGT 2022
 Certificate fingerprints:
 M12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SH12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SHA212: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 Signature algorithm name: SHA256withRSA
 Subject Public Key Algorithm: 2048-bit RSA key
 Version: 3
 
 Extensions:
 
 #1: ObjectId: 2.5.29.19 Criticality=false
 BasicConstraints:[
 CA:false
 PathLen: undefined
 ]
 
 #2: ObjectId: 2.5.29.14 Criticality=false
 SubjectKeyIdentifier [
 KeyIdentifier [
 0000: FF ED EA 95 86 9D 10 8C 21 11 D0 FA 03 9D 8F 0A ........!.......
 0010: 9B 3F 63 65 .?ce
 ]
 ]
 
 Do you trust this certificate? [y/n] y
 
 Adding certificate to truststore on clnode0102 ... Done
 
 Adding certificate to truststore on clnode0101 ... Done
 
 New Certificate:
 Owner: CN=clnode0202_ca
 Issuer: CN=clnode0202_ca
 Serial number: 5a3eb54c
 Valid from: Sun Aug 23 12:52:11 PGT 2020 until: Wed Nov 23 12:52:11 PGT 2022
 Certificate fingerprints:
 M12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SH12: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 SHA212: 12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12:12
 Signature algorithm name: SHA256withRSA
 Subject Public Key Algorithm: 2048-bit RSA key
 Version: 3
 
 Extensions:
 
 #1: ObjectId: 2.5.29.19 Criticality=true
 BasicConstraints:[
 CA:true
 PathLen:1
 ]
 
 Do you trust this certificate? [y/n] y
 
 Adding certificate to truststore on clnode0102 ... Done
 
 Adding certificate to truststore on clnode0101 ... Done
 
 Operation completed successfully. All certificates are added to truststore on nodes of cluster ProdNFS

Verifying Trust

Does ProdNFS Trust DRNFS?

root@clnode0101:~# geops verify-trust -c DRNFS
 
Local cluster : ProdNFS
Local node : clnode0101
 
Retrieving information from cluster DRNFS ... Done
 
Verifying connections to nodes of cluster DRNFS:
clnode0101 -> clnode0202 / {IP address} ... OK
clnode0101 -> clnode0201 / {IP address} ... OK
Operation completed successfully, able to establish secure connections from clnode0101 to all nodes of cluster DRNFS

Does DRNFS Trust ProdNFS?

root@clnode0201:~# geops verify-trust -c ProdNFS
Local cluster : DRNFS
Local node : clnode0201
Retrieving information from cluster ProdNFS ... Done
Verifying connections to nodes of cluster ProdNFS:
clnode0201 -> clnode0102 / {IP address} ... OK
clnode0201 -> clnode0101 / {IP Adress} ... OK
 
Operation completed successfully, able to establish secure connections from clnode0201 to all nodes of cluster ProdNFS

Creating the Partnerships

ProdNFS

root@clnode0101:~# geops create -c DRNFS NFS
Partnership between local cluster "ProdNFS" and remote cluster "DRNFS" successfully created.
root@cnodel0101:~# geoadm status
 Cluster: ProdNFS
 Partnership "NFS" : Degraded
 Partner clusters : DRNFS
 Synchronization : Unknown
 ICRM Connection : Error
 
 Heartbeat "hb_ProdNFS~DRNFS" monitoring "DRNFS": Offline
 Plug-in "ping_plugin" : Inactive
 Plug-in "tcp_udp_plugin" : Inactive

DRNFS

root@clnode0201:~# geops join-partnership ProdNFS NFS
Local cluster "DRNFS" is now partner of cluster "ProdNFS".
root@clnode0201:~# geops list
NFS
root@clnode0201:~# geoadm status
 Cluster: DRNFS
 Partnership "NFS" : OK
 Partner clusters : ProdNFS
 Synchronization : OK
 ICRM Connection : OK
 Heartbeat "hb_DRNFS~ProdNFS" monitoring "ProdNFS": OK
 Plug-in "ping_plugin" : Inactive
 Plug-in "tcp_udp_plugin" : OK

Note that after DRNFS joins the “NFS” partnership, all statuses now show “OK”.

‘zfsrepl’ User

Each cluster node needs to have a replication user which will be sued to run the replication. Each cluster needs seperate ssh credentials for this user.

You could create the user with something like ( happened to choose user:group 101:101):

# groupadd -g 101 zfsrepl
# useradd -u 101 -g 101 -m -d /export/home/zfsrepl -s /bin/bash -c "ZFS Replication User" zfsrepl
# passwd zfsrepl

On one node of each cluster, as the zfsrepl user, set up an ssh key.

IMPORTANT a bug in the DRF code passes the ssh passphrase unquoted in shell arguments. As such, do not use spaces or other characters special t the shell in your choice of ssh passphrase (here speaks the experience of someone who spent a LONG time trying to figure out why my passphrase, which contained a space, was not working in some of the below commands).

zfsrepl@clnode0101:~ $ ssh-keygen -C zfsrepl@prodnfs
zfsrepl@clnode0201:~ $ ssh-keygen -C zfsrepl@drnfs

Once you have the keys generated, make sure that both nodes in each cluster have the same id_rsa and id_rsa.pub files that you just generated.

Now, ~zfsrepl/.ssh/authorized_keys on all cluster nodes need to contain both public keys. The public keys are found in the respective id_rsa.pub files. e.g.

ssh-rsa {public key} zfsrepl@prodnfs
ssh-rsa {public key} zfsrepl@drnfs

~zfsrepl/.ssh/known_hosts must contain every possible address that we could connect to. For this example, we need to list every cluster node and the rplnfsNN-repl address.

clnode0101,{IP Address} ssh-ed25519 {clnode0101 public key}
rplnfs01-repl,{IP Address} ssh-ed25519 {clnode0101 public key}
clnode0102,{IP Address} ssh-ed25519 {clnode0102 public key}
rplnfs01-repl,{IP Address} ssh-ed25519 {clnode0102 public key}
clnode0201,{IP Address} ssh-ed25519 {clnode0201 public key}
rplnfs02-repl,{IP Address} ssh-ed25519 {clnode0201 public key}
clnode0202,{IP Address} ssh-ed25519 {clnode0202 public key}
rplnfs02-repl,{IP Address} ssh-ed25519 {clnode0202 public key}

Protection Groups

This is where we start to configure the ZFS replication. For all items below, the configuration directory I’m using is

CONF=/var/tmp/geo/zfs_snapshot

Warning: Do not try to build a protection group on a cluster with multiple interfaces on the same subnet. This actually requires some modification to the geocluster code as the current code provides no way to provide a -N argument to to the clrslh command.

The rplnfs zpool is part of the rplnfs-rg resource group which consists of

rplnfs-hastp-rs (HA-Storage Plus)
rplnfs-lh-rs (Logical Host)
rplnfs-server-rs (HA-NFS Server)

${CONF}/sbpconf-rplnfs needs to exist on all cluster nodes

clnode0101/2:${CONF}/spbconf-rplnfs

rplnfs-rep|any|clnode0101,clnode0102

clnode0201/2:${CONF}/spbconf-rplnfs

rplnfs-rep|any|clnode0201,clnode0202

As the source point of this replication is on ProdNFS, ${CONF}/zfs_snap_geo_config_rpl on clnode0101 contains:

PS=NFS
PG=rpl-zfssnap-pg
REPCOMP=rplnfs-rep
REPRS=rplnfs-rep-rs
REPRG=rplnfs-zfssnap-rep-rg
DESC="ZFS Snapshot rplnfs"
APPRG=rplnfs-rg
CONFIGFILE=/var/tmp/geo/zfs_snapshot/sbpconf-rplnfs
LOCAL_REP_USER=zfsrepl
REMOTE_REP_USER=zfsrepl
LOCAL_PRIV_KEY_FILE=
REMOTE_PRIV_KEY_FILE=
LOCAL_ZPOOL_RS=rplnfs-hastp-rs
REMOTE_ZPOOL_RS=rplnfs-hastp-rs
LOCAL_LH=rplnfs01-repl
REMOTE_LH=rplnfs02-repl
LOCAL_DATASET=rplnfs/data
REMOTE_DATASET=rplnfs/data
REPLICATION_INTERVAL=120
NUM_OF_SNAPSHOTS_TO_STORE=2
REPLICATION_STREAM_PACKAGE=false
SEND_PROPERTIES=true
INTERMEDIARY_SNAPSHOTS=false
RECURSIVE=true
MODIFY_PASSPHRASE=false

The user zfsrepl needs specific permissions on the pools for replication. This needs to be done on both clusters.

pools="rplnfs"
for p in $pools
do zfs allow zfsrepl create,destroy,hold,mount,receive,release,send,rollback,snapshot $p/data
done

The resource group containing the zpools needs to be offline and the underlying resource disabled before we can run the configuration script. This is where consistent resource naming pays off.

clrg offline rplnfs-rg
for i in server- hastp- 'lh-'; do clrs disable rplnfs-${i}rs; done
clrg unmanage rplnfs-rg
clrg set -p Auto_start_on_new_cluster=false rplnfs-rg

In our case, on the inactive cluster nodes, we also need to have an address in the correct subnet on the interface that we want the replication address to be assigned. This cannot be an address that anything is using.

clnode0102# ipadm create-addr -T static -a {IP address} ipmp1/temp
clnode0202# ipadm create-addr -T static -a {IP address} ipmp1/temp

On the source cluster node run the configuration

clnode0101# /opt/ORCLscgrepzfssnap/util/zfs_snap_geo_register -f /var/tmp/geo/zfs_snapshot/zfs_snap_geo_config-rplnfs

answering the prompts.

If this completes successfully then on the same node run

clnode0101# geopg get --partnership NFS rplnfs-zfssnap-rg

Remove the temporary ipmp entries

clnode0102# ipadm delete-addr ipmp1/temp
clnode0202# ipadm delete-addr ipmp1/temp

Now that both clusters know about the protection group, on the primary cluster’s active node (clnode0101) for this protection group, start it.

clnode0101# geopg start -e global rplnfs-zfssnap-pg

and we are done.

Written by Alan

January 20, 2021 at 1:15 pm

Posted in Solaris, Work

« Older Entries

Alan Hargreaves' Blog

Explorer generates error messages after 25.2.25.04.15.

Processes are cheap (but not free)

Building Chef Inspec from git

Openat(2) failing with EINTR

Thoughts

Updating Ops Center Agents

Ops Center – Recovering Credential Information

How I addressed a geo-infrastructure geocluster failure

Description of Issue

Investigation

Solution

OBDX 18.0.0 Installer and Solaris 11.4 beyond sru 21

Problem with the Oracle 19c Solaris Prequisite Package

Update (August 27)

Implementing Geocluster Replication on HA-NFS

Installation of DRF on all cluster nodes

Enable the DRF Software

ProdNFS

DRNFS

Network issue on a different cluster

Setting up the partnership

DRNFS trusting ProdNFS

ProdNFS Trusting DRNFS

Verifying Trust

Does ProdNFS Trust DRNFS?

Does DRNFS Trust ProdNFS?

Creating the Partnerships

ProdNFS

DRNFS

‘zfsrepl’ User

Protection Groups

Alan

Disclaimer

My Tweets

RSS My Music Blog

RSS My Ham Radio Blog

Categories