Friday, December 24, 2010

VCS Maintenance


1. Do I have to run "hastart" on each node to manually startup the cluster?
Yes. If you need to manually startup the cluster, you have to run "hastart" on each node. There is no command to startup VCS on every node. If you only execute "hastart" on one node, VCS will come up, but it probably won't startup your Service Groups. VCS has to probe each machine the Service Group can
online on, and it can't do that if VCS isn't running on one of the nodes.

2. How do I start VCS when one node is down?
Normally, VCS has to seed all the nodes in your cluster before becoming fully operational. VCS may actually startup, but none of the commands will work. If one of your nodes is down, and you need to start VCS on the other nodes, then you must manually seed the other node(s). Run this command on each node
that is up:
     /sbin/gabconfig -cx
VCS should then be starting up. You may have to online some Service Groups manually:
     hagrp -online {Service Group} -sys {hostname}
If the gabconfig command doesn't work, reconfigure GAB and LLT and try again.
Do the following on *both* nodes:
     1) Make sure had and hashadow are not in the process table. Check
        "ps -ef" and kill them if you have to.
     2) /sbin/gabconfig -U
     3) /sbin/lltconfig -U  (answer yes)
     4) /sbin/lltconfig -c
     5) /sbin/gabconfig -cx
     6) hastart
VCS should then startup on each node that is up.

3. How can I shutdown VCS without shutting down my applications?

Use the "hastop -force" option.
   (1) hastop -all -force  (shuts down VCS on all nodes)
   (2) hastop -local -force  (shuts down VCS on the local node only)
WARNING: Always make the cluster read-only before doing a force shutdown.
     haconf -dump -makero
If you force stop a cluster while it is in read-write mode, you will get a stale configuration error upon VCS restart.
To see if your cluster is in read-only mode, run "haclus -display". The
"ReadOnly" attribute should have a value of 1. If not, then run
"haconf -dump -makero" to make it read-only.
If you start VCS and get a stale configuration error, you have mainly 2 choices.
   (1) Run "hastop -all -force", check on your nodes for any
       inconsistencies, remove any .stale files in /etc/VRTSvcs/conf/config/,
       and restart VCS.
If you see no .stale files, then your's might have a syntax error.
Execute this command to see where the syntax errors are:
         cd /etc/VRTSvcs/conf/config/
         hacf -verify .
   (2) Continue to start VCS by running "hasys -force {hostname}". Pick the
       hostname of the machine you want VCS to load the from.
Usually you would choose the 2nd option if the cluster is not in production or if you're confident the on the specified machine is good enough.

  4. How do I failover a Service Group?
You can manually failover a Service Group two ways:
   (1) hagrp -switch {Service Group} -to {target node}
   (2) hagrp -offline {Service Group} -sys {current node}
       hagrp -online {Service Group} -sys {target node}

The second way simply gives you more control. After you offline the Group, you can online it anywhere when you want to. The first way is for an immediate "handsoff" failover.
   VCS can automatically failover Groups if you do the following:
   (1) Execute "init 6" or "shutdown -ry 0"
   (2) Execute "reboot"
   (3) Switch off the machine's power
   (4) Pull out all heartbeat cables simultaneously
   (5) Cause a "fault", i.e. manually shutdown some service or resource in
       your Service Group.
   (6) Panic the machine.
WARNING: Doing #4 will result in immediate split brain, which may lead
         to data corruption. Never do this on a production cluster.

5. Is offlining a Service Group the same thing as failing
it over? What does offline mean?
No, when you offline a Group you are shutting down all the services in the group, but you are not onlining it anywhere else. Offline for a Group means the services in that group are currently unavailable to any node in the cluster. You can then online the Group at any time on the same node or on another node if you want.
A failover is when a Group offlines from one node and immediately tries to online on another.

6. What's the difference between Agents and Resources?
Agents are VCS processes that control and monitor the Resources. Resources are all those objects in your Service Group, and they all require Agents. For example, all your filesystems are resources, and they all use the Mount Agent. Your virtual IP address is a resource, and it uses the IP or IPMultiNIC Agent. The Veritas Volume Manager Disk Group is a resource, and it uses the DiskGroup Agent. Some Agents, such as the Oracle Enterprise Agent, have to be purchased separately.

7. Does each Service Group have its own IP and DiskGroup?
Usually, Service Groups have their own IP and DiskGroup resources, but this is not technically required; it all depends on your applications. All a Service Group really needs is some resource. Most resources, however, cannot be shared across Service Groups. That is why Service Groups usually do have their own IP's, DiskGroup, filesystems, etc. Groups can share certain resources like NIC and MultiNICA, although the
resource has a unique name in each Group.

8. Should I put everything in one Service Group, or should, have more than one Service Group?
Usually people try to separate different applications as much as possible. Service Groups serve as logical divisions for your applications. You don't want a failure of one application to cause a failover of all your
applications if its unnecessary. If all your applications are using the same Group, then a failure in that Group can cause all your applications to fail. The goal of high availability is to try to minimize single points of
failure. That is why separate applications in a cluster usually means separate Service Groups are recommended.
9. How do I add another Service Group to
You can add a Service Group using the VCS GUI, using VCS commands, or editing the file.

10. Can I use vi to edit
You can edit only when VCS is shutdown. VCS does not read  when it is already running. VCS only reads from the configuration it has in memory. If you edit while VCS is running, VCS will not read your updates, and it will overwrite the with the configuration it has in memory.
You can always edit a copy of and shutdown VCS, move the new into /etc/VRTSvcs/conf/config/ and restart VCS.
Here's an example...
    1) haconf -dump
    2) cp /tmp/
    3) vi /tmp/
    4) haconf -dump -makero
    5) hastop -all -force
    6) cp /tmp/ /etc/VRTSvcs/conf/config/
    7) hastart

11. Can different Resources have the same name if they are in different Service Groups?
No, two resources in the cluster cannot have the same name, even if they are in different Service Groups. Resource names must be unique in the entire cluster.

12. What does autodisable mean? Why did VCS autodisable my Service Group?
VCS does not allow failovers or online operation of a Service Group if it is autodisabled. VCS has to autodisable a Service Group when VCS on a particular node shuts down *but* the GAB heartbeat is still running. Once GAB is unloaded, e.g. when the node actually shuts down to PROM level, reboots, or powers off, VCS on the other nodes can automatically clear the autodisable flag. During the time interval a Group is autodisable, VCS won't allow that Group to failover or be onlined anywhere within the cluster. This is a safety feature to protect against "split brains", when more than one machine is using the same resources, like the same filesystems and virtual IP at the same time.
Once a node leaves the cluster, VCS has to assume that machine can be user-controlled before it goes down, that theoretically someone can login to that machine and manually startup services. It is for that reason that VCS autodisables a Group within the existing cluster. But VCS does let you clear the autodisable flag yourself. Once you're sure that the node that left the cluster doesn't have any services running, you can clear the autodisable flag with this command:
       hagrp -autoenable {name of Group} -sys {name of node}

Repeat the command for each Group that has been autodisabled. The Groups that are autodisabled and the nodes they are autodisabled for can be found with this command:
       hastatus -sum

Most of the time VCS autodisables a Group for a short period of time and then clears the autodisable flag without you knowing it. If the node that leaves the cluster actually shuts down, the GAB module is also unloaded, and VCS running on the other nodes will assume that node has shutdown. VCS will then automatically clear the autodisable flags for you.
There's one default VCS on the running cluster requires GAB to be unloaded within 60 seconds after VCS on that node is stopped. After 60 seconds, if GAB still isn't unloaded, VCS on the existing cluster will assume that node isn't shutting down, and will keep the autodisable flags until the administrator clears them.
To increase the 60 second window to 120 seconds, run this:
       hasys -modify ShutdownTimeout 120
For large systems that take a long time to shutdown, it is a good idea to
increase ShutdownTimeout.
Please read the VCS User's Guide for more information on autodisable.
NOTE: In VCS 3.5, the default ShutdownTimeout has been increased to 120.

13. Does VCS require license keys to run? Did VCS 1.3 require license keys?
The latest versions of VCS require license keys to run. VCS 1.3 and before did not.

14. Do I need to create the same VxVM DiskGroup on both machines?
No, when you create a Volume Manager DiskGroup, just pick one machine to create the DiskGroup on. You do not create the same DiskGroup on both nodes.
After you create a DiskGroup, you can add it as a Resource to your VCS configuration. VCS will then use VxVM commands to import and deport the DiskGroup between the systems during Service Group online, offline, or failover.

15. Can I run different versions of VCS in the same cluster? 

No, absolutely not! Different versions of VCS, and even different patch levels of VCS, cannot run at the same time in the same cluster. Therefore, when you install VCS patches, you must install them on *all* nodes at the same time!
The cluster will have to be partially or completely shutdown during upgrades or patching. Of course, you can shutdown VCS without shutting down your services.

16. Does VCS require "shared disks"?
No, VCS does not require that your nodes are connected to shared disks. However, most people like to have storage that can be deported and imported during failover. If your applications do not need this, then you do not need shared storage. The VCS installation and setup will not ask if you have shared storage. This is great for people who don't have the shared storage ready, but still want to try out or test VCS.

17. What is the difference between freezing the system and freezing a Group? Which is better for maintenance?
Freezing a system prevents VCS from onlining a Service Group onto that system. This is usually done when a machine in the cluster is unstable or undergoing maintenance, and you don't want VCS to try to failover a
Group to that machine. However, if a Group is already online on a frozen system, VCS can still offline that Group.
Freezing a Service Group is the most common practice when maintenance needs to be done on the nodes while VCS is still running. When you freeze a Group, VCS and its Agents will take no action (not even calling Clean) on that Group or its Resources no matter what happens to the resources. That means you can take down your services, like IP's, filesystems, databases and applications, and VCS won't do anything. VCS won't offline the Group, or offline any resources. VCS also won't online anything in that Group, and it won't online that Group anywhere. This basically "locks" the Group on a node, and prevents it from onlining until you unfreeze the Group.
One thing that may be surprising is that VCS will still monitor a frozen Group and its resources. So, during maintenance, VCS might tell you that your resources have faulted, or the Group is offline. If you manually bring everything back up after maintenance, VCS monitoring should refresh and see all your resources and the Group are online again. This is a good thing, since it is best to know if VCS thinks your Group and its resources are online before you unfreeze the Group.
To freeze a Group:
   haconf -makerw
   hagrp -freeze {Group name} -persistent
   haconf -dump -makero
To unfreeze a Group:
   haconf -makerw
   hagrp -unfreeze {Group name} -persistent
   haconf -dump -makero
 18. I just added a DiskGroup to VCS, and VCS offlined everything. Why?
The diskgroup you added was probably already imported manually or through VMSA, and without the "-t" option.
    vxdg import {disk group}
VCS imports diskgroups using "-t", which sets the diskgroup's noautoimport flag to "on".
    vxdg -t import {disk group}
So, when you added the diskgroup to VCS, VCS detected the new diskgroup was imported outside of VCS because the noautoimport flag was set to "off". This is considered a violation, and the DiskGroup Agent monitor script will then offline the entire Service Group. This is a precaution to prevent split brain. You can see a diskgroup's noautoimport flag by doing:
    vxprint -at {disk group}

If you've imported a new diskgroup, and have not yet added it to VCS, you can deport the diskgroup first, and then add it to VCS. You do not need to import a diskgroup to add it to VCS.

19. I need to play with a resource inside a Service Group, but I don't want to cause the Group to fault. What do I need to do?
You should first make the resource non-critical.
  hares -modify {resource name} Critical 0
By making the resource non-critical, VCS will not offline the Group if it thinks this resource faulted.
You must also make any Parents of this resource non-critical. Run this to check if there are any parents for this resource:
  hares -dep

If you don't want VCS to monitor your resource, you can disable monitoring by doing this:
  hares -modify {resource name} Enabled 0

This prevents VCS from monitoring the state of this resource, so it won't fault the Group no matter what you do to the resource, even if it has Critical=1.
If the Group is in production, you might want to freeze the Group just to be safe.

20. After someone started up some process on the other node, VCS reports a "Concurrency Violation", and tries to offline that process. What is this, and is it bad?

A Concurrency Volation is reported when the Agent of a resource reports that same resource or process is running on another node. The Agent will then try to run the offline script for that resource on that other node. This is to prevent split brain. If the Agent cannot offline the process on the other node, then you may
want to manually offline the process or change the Agent's monitoring.

Sometimes a Concurrency Violation is more or less a "false alarm", because it has a lot to do with how good your monitoring is. You need to find out from your Agent, how exactly is it monitoring? If it is an Application Agent
resource, look at the MonitorProgram script, or look at MonitorProcesses If it looks like the Agent is just monitoring for something very superficial, then just change the monitoring. If you are changing the monitoring in production, you may want to freeze the Service Group or make the resource non-Critical.

Some agents have a "second level" or "deep" monitor feature, either built with the agent, or requiring you to write a custom script. If you can write one, you need to make it better than the first level monitor, which is
obviously superficial if it reports online but the resource is really offline.

21. The cluster is hanging, the system is hanging, everything seems to be hanging, and I'm not sure what's going on. What I should and shouldn't I do with VCS?
In emergency situations, it's probably not a good idea to blindly run commands if you don't know what state your services are in. Doing so can cause Concurrency Violations and split brains, which can cause further
confusion or data corruption. Here are some safe commands to gather data and orient yourself before calling support:
hastatus -sum
hares -probe {resource name} -sys {machine name}
/sbin/gabconfig -a
ps -ef
ifconfig -a
vxdg list
vxdisk -o alldgs list
df -kl
Sometimes it may be wise to freeze a Service Group or force stop VCS:

hagrp -freeze {Service Group} -persistent
haconf -dump -makero
hastop -local -force or hastop -all -force

Force stopping VCS is a common practice when "things get stuck". Force stopping VCS also lets your applications stay up (if they are still up). Also, when a Group is frozen, a force stop is the only way to shutdown VCS.
The following operations are usually not very helpful when things are hanging:

hagrp -offline {Service Group} -sys {machine name}
hagrp -online {Service Group} -sys {machine name}
hagrp -switch {Service Group} -to {machine name}
hastop -local
hastop -all

Why? Because these commands assume your systems are behaving normally. These commands tell VCS to online or offline services in an orderly manner. But if your system or cluster is already hung, running these commands probably won't to do any good. These commands may just hang themselves, get queued up in a job scheduler, and add additional load to your system.

Also, if you are unfamiliar with the cluster, running "hastop -all" could shutdown or hang *everything* on all nodes, causing additional unnecessary downtime. In an emergency situation where you are unfamiliar with the cluster, its probably best to gather information and call Support, instead of trying to make VCS do things haphazardly.

Veritas cluster server is a high availability server. This means that processes switch between servers when a server fails. All database processes are run through this server - and as such, this needs to run smoothly. Note that the oracle process should only actually be running on the server which is active.  On monitoring tools, the procs light for whichever box is secondary should be yellow, because oracle is not running. Yet, the cluster is running on both systems. Cluster Not Up -- HELP
The normal debugging of steps includes: checking on status, restarting if no faults, checking licenses, clearing faults if needed, and checking logs.
To find out Current Status:
            /opt/VRTSvcs/bin/hastatus -summary
            This will give the general status of each machine and processes
            /opt/VRTSvcs/bin/hares -display
            This gives much more detail - down to the resource level.
If hastatus fails on both machines (it returns that the cluster is not up or returns nothing), try to start the cluster
        /opt/VRTSvcs/bin/hastatus -summary
        will tell you if processes started properly. It will NOT start processes on a FAULTED system.
Starting Single System NOT Faulted
If the system is NOT FAULTED and only one system is up, the cluster probably needs to have gabconfig manually started. Do this by running:
/sbin/gabconfig -c -x
/opt/VRTSvcs/bin/hastatus -summary 
If the system is faulted, check licenses and clear the faults as described next.
To check licenses:
vxlicense -p
Make sure all licenses are current - and NOT expired! If they are expired, that is your problem. Call VERITAS to get temporary licenses.
There is a BUG with veritas licences. Veritas will not run if there are ANY expired licenses -- even if you have the valid ones you need. To get veritas to run, you will need to MOVE the expired licenses. [Note: you will minimally need VXFS, VxVM and RAID licenses to NOT be expired from what I understand.]
vxlicense -p
Note the NUMBER after the license (ie: Feature name: DATABASE_EDITION [100])
cd /etc/vx/elm
mkdir old
mv lic.number old [do this for all expired licenses]
vxlicense -p [Make sure there are no expired licenses AND your good licenses are there]
If still fails, call veritas for temp licenses. Otherwise, be certain to do the same on your second machine.
To clear FAULTS:
hares -display
For each resource that is faulted run:
hares -clear resource-name -sys faulted-system
If all of these clear, then run hastatus -summary and make sure that these are clear. If some don't clear you MAY be able to clear them on the group level. Only do this as last resort:
hagrp -disableresources groupname
hagrp -flush group -sys sysname
hagrp -enableresources groupname
To get a group to go online:
hagrp -online group -sys desired-system
If it did NOT clear, did you check licenses?

System has the following EXACT status:
gedb002# hastatus -summary
-- System               State                Frozen
A  gedb001              RUNNING              0
A  gedb002              RUNNING              0
-- Group           System               Probed     AutoDisabled    State        
B  oragrp          gedb001              Y          N               OFFLINE      
B  oragrp          gedb002              Y          N               OFFLINE      
gedb002#  hares -display | grep  ONLINE
nic-qfe3  State           gedb001   ONLINE
nic-qfe3  State           gedb002   ONLINE
gedb002# vxdg list
NAME         STATE           ID
rootdg       enabled  957265489.1025.gedb002
gedb001# vxdg list
NAME         STATE           ID
rootdg       enabled  957266358.1025.gedb001

Recovery Commands:
hastop -all
on one machine hastart
wait a few minutes
on other machine hastart
Reviewing Log Files
If you are still having troubles, look at the logs in /var/VRTSvcs/log. Look at the most recent ones for debugging purposes (ls -ltr). Here is a short description of the logs in /var/VRTSvcs/log:
hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This is the log of that process.
engine.log_A: primary log, usually what you will be reading for debugging
Oracle_A: oracle process log (related to cluster only)
Sqlnet_A: sqlnet process log (related to cluster only)
IP_A: related to shared IP
Volume_A: related to Volume manager
Mount_A: related to mounting actual filesystes (filesystem)
DiskGroup_A: related to Volume Manager/Cluster Server
NIC_A: related to actual network device
By looking at the most recent logs, you can know what failed last (or most recently). You can also tell what did NOT run which may be jut as much of a clue. Of course, if none of this helps, open a call with veritas tech support.
Calling Tech Support:
If you have tried the previously described debugging methods, call Veritas tech support: 800-634-4747. Your company needs to have a Veritas support contract.
Restarting Services:
If a system is gracefully shutdown and it was running oracle or other high availability services, it will NOT transfer them. It only transfers services when the system crashes or has an error.
hastatus -summary
will tell you if processes started properly. It will NOT start processes on a FAULTED system. If the system is faulted, clear the faults as described above.
Doing Maintenance on DBs:
BEFORE working on DB
Run hastop -all -force
AFTER working on Dbs:
You MUST bring up oracle on same machine
Once Oracle is up, run:
hastart on the same machine as you started the work on (the first on system with oracle running)
wait 3-5 minutes
then run hastart on the other system 
If you need the instance to run on the other system, you can run: hagrp -switch oragrp -to othersystem
Shutting down db machines:
If you shutdown the machine that is running veritas cluster, it will NOT start on the other machine. It only ails over if the machine crashes. You need to manually switch the services if you shutdown the machine. To  switch processes:
Find out groups to transfer over
hagrp -display
Switch over each group
hagrp -switch group-to-move -to new-system
Then shutdown machine as desired. When rebooted will start cluster daemon automatically.
Doing Maintenance on Admin Network:
If the admin network is brought down (that the veritas cluster uses), veritas WILL fault both machines AND bring down oracle (nicely). You will need to do the following to recover:
      hastop -all
      On ONE machine: hastart
      wait 5 minutes
      On other machine: hastart
If possible, use the section on DB Maintenance. Only use this if system fails on coming up AND you KNOW that it is due to a db configuration error. If you manually startup filesystems/oracle -- manually shut them down and restart using hastart when done.
To startup: Make sure ONLY rootdg volume group is active on BOTH NODEs. This is EXTREMELY important as if it is active on both nodes corruption occurs. [ie. oradg or xxoradg is NOT present]
vxdg list
hastatus (stop on both as you are faulted on both machines )
hastop -all (if either was active make sure you are truly shutdown!)
Once you have confirmed that the oracle datagroup is not active, on ONE machine do the following:
vxdg import oradg [this may be xxoradg where xx is the client 2 char code]
vxvol -g oradg startall
mount -F vxfs /dev/vx/dsk/oradg/name /mountpoint [Find volumes and mount points in /etc/VRTSvcs/conf/config/]
Let DBAs do their stuff
To shutdown:
umount /mountpoint [foreach mountpoint]
vxdg deport oradg
vxvol -g oradg stopall
clear faults; start cluster as described above


Anonymous said...

Really i appreaciate your hard work to gahter all this information. It helps a lot the learners like me...... thanks a lot again and keep posting.

venkatadry said...

madam i have a question... may be my question is not wise but i have some confusion..

for q21 answer

if system is hung can we apply the commands to gather the information...according to my knowledge .if system is hung we cannot apply any commands .

Thanks in Advance.....

System Admin said...

Q21 says about cluster hung.

Incase of server hung we cannot run any commands, we can do any diagnosis in maintanance mode.

Anonymous said...

Can you please update the notes for how to configure VCS