Heartbeat reporting

   Dejan Muhamedagic
   <[1]dmuhamedagic@suse.de>
   v1.0

   hb_report is a utility to collect all information relevant to Heartbeat
   over the given period of time.

Quick start

   Run hb_report on one of the nodes or on the host which serves as a
   central log server. Run hb_report without parameters to see usage.

   A few examples:
    1. Last night during the backup there were several warnings
       encountered (logserver is the log host):
logserver# hb_report -f 3:00 -t 4:00 /tmp/report
       collects everything from all nodes from 3am to 4am last night. The
       files are stored in /tmp/report and compressed to a tarball
       /tmp/report.tar.gz.
    2. Just found a problem during testing:
node1# date : note the current time
node1# /etc/init.d/heartbeat start
node1# nasty_command_that_breaks_things
node1# sleep 120 : wait for the cluster to settle
node1# hb_report -f time /tmp/hb1

Introduction

   Managing clusters is cumbersome. Heartbeat v2 with its numerous
   configuration files and multi-node clusters just adds to the
   complexity. No wonder then that most problem reports were less than
   optimal. This is an attempt to rectify that situation and make life
   easier for both the users and the developers.

On security

   hb_report is a fairly complex program. As some of you are probably
   going to run it as root let us state a few important things you should
   keep in mind:
    1. Don't run hb_report as root! It is fairly simple to setup things in
       such a way that root access is not needed. I won't go into details,
       just to stress that all information collected should be readable by
       accounts belonging the haclient group.
    2. If you still have to run this as root. Well, don't use the -C
       option.
    3. Of course, every possible precaution has been taken not to disturb
       processes, or touch or remove files out of the given destination
       directory. If you (by mistake) specify an existing directory,
       hb_report will bail out soon. If you specify a relative path, it
       won't work either.

   The final product of hb_report is a tarball. However, the destination
   directory is not removed on any node, unless the user specifies -C. If
   you're too lazy to cleanup the previous run, do yourself a favour and
   just supply a new destination directory. You've been warned. If you
   worry about the space used, just put all your directories under /tmp
   and setup a cronjob to remove those directories once a week:
        for d in /tmp/*; do
                test -d $d ||
                        continue
                test -f $d/description.txt || test -f $d/.env ||
                        continue
                grep -qs 'By: hb_report' $d/description.txt ||
                        grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
                        continue
                rm -r $d
        done

Mode of operation

   Cluster data collection is straightforward: just run the same procedure
   on all nodes and collect the reports. There is, apart from many small
   ones, one large complication: central syslog destination. So, in order
   to allow this to be fully automated, we should sometimes run the
   procedure on the log host too. Actually, if there is a log host, then
   the best way is to run hb_report there.

   We use ssh for the remote program invocation. Even though it is
   possible to run hb_report without ssh by doing a more menial job, the
   overall user experience is much better if ssh works. Anyway, how else
   do you manage your cluster?

   Another ssh related point: In case your security policy proscribes
   loghost-to-cluster-over-ssh communications, then you'll have to copy
   the log file to one of the nodes and point hb_report to it.

Prerequisites

    1. ssh
       This is not strictly required, but you won't regret having a
       password-less ssh. It is not too difficult to setup and will save
       you a lot of time. If you can't have it, for example because your
       security policy does not allow such a thing, or you just prefer
       menial work, then you will have to resort to the semi-manual
       semi-automated report generation. See below for instructions.
       If you need to supply a password for your passphrase/login, then
       please use the -u option.
    2. Times
       In order to find files and messages in the given period and to
       parse the -f and -t options, hb_report uses perl and one of the
       Date::Parse or Date::Manip perl modules. Note that you need only
       one of these. Furthermore, on nodes which have no logs and where
       you don't run hb_report directly, no date parsing is necessary. In
       other words, if you run this on a loghost then you don't need these
       perl modules on the cluster nodes.
       On rpm based distributions, you can find Date::Parse in
       perl-TimeDate and on Debian and its derivatives in
       libtimedate-perl.
    3. Core dumps
       To backtrace core dumps gdb is needed and the Heartbeat packages
       with the debugging info. The debug info packages may be installed
       at the time the report is created. Let's hope that you will need
       this really seldom.

What is in the report

    1. Heartbeat related
          + heartbeat version/release information
          + heartbeat configuration (CIB, ha.cf, logd.cf)
          + heartbeat status (output from crm_mon, crm_verify, ccm_tool)
          + pengine transition graphs (if any)
          + backtraces of core dumps (if any)
          + heartbeat logs (if any)
    2. System related
          + general platform information (uname, arch, distribution)
          + system statistics (uptime, top, ps, netstat -i, arp)
    3. User created :)
          + problem description (template to be edited)
    4. Generated
          + problem analysis (generated)

   It is preferred that the Heartbeat is running at the time of the
   report, but not absolutely required. hb_report will also do a quick
   analysis of the collected information.

Times

   Specifying times can at times be a nuisance. That is why we have chosen
   to use one of the perl modules--they do allow certain freedom when
   talking dates. You can either read the instructions at the
   [2]Date::Parse examples page.

   or just rely on common sense and try stuff like:
3:00          (today at 3am)
15:00         (today at 3pm)
2007/9/1 2pm  (September 1st at 2pm)

   hb_report will (probably) complain if it can't figure out what do you
   mean.

   Try to delimit the event as close as possible in order to reduce the
   size of the report, but still leaving a minute or two around for good
   measure.

   Note that -f is not an optional option. And don't forget to quote dates
   when they contain spaces.

   It is also possible to extract a CTS test. Just prefix the test number
   with cts: in the -f option.

Should I send all this to the rest of Internet?

   We make an effort to remove sensitive data from the Heartbeat
   configuration (CIB, ha.cf, and transition graphs). However, you have to
   tell us what is sensitive! Use the -p option to specify additional
   regular expressions to match variable names which may contain
   information you don't want to leak. For example:
# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

   We look by default for variable names matching "pass.*" and the
   stonith_host ha.cf directive.

   Logs and other files are not filtered. Please filter them yourself if
   necessary.

Logs

   It may be tricky to find syslog logs. The scheme used is to log a
   unique message on all nodes and then look it up in the usual syslog
   locations. This procedure is not foolproof, in particular if the syslog
   files are in a non-standard directory. We look in /var/log /var/logs
   /var/syslog /var/adm /var/log/ha /var/log/cluster. In case we can't
   find the logs, please supply their location:
# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

   If you have different log locations on different nodes, well, perhaps
   you'd like to make them the same and make life easier for everybody.

   The log files are collected from all hosts where found. In case your
   syslog is configured to log to both the log server and local files and
   hb_report is run on the log server you will end up with multiple logs
   with same content.

   Files starting with "ha-" are preferred. In case syslog sends messages
   to more than one file, if one of them is named ha-log or ha-debug those
   will be favoured to syslog or messages.

   If there is no separate log for Heartbeat, possibly unrelated messages
   from other programs are included. We don't filter logs, just pick a
   segment for the period you specified.

   NB: Don't have a central log host? Read the CTS README and setup one.

Manual report collection

   So, your ssh doesn't work. In that case, you will have to run this
   procedure on all nodes. Use -S so that we don't bother with ssh:
# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

   If you also have a log host which is not in the cluster, then you'll
   have to copy the log to one of the nodes and tell us where it is:
# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

   Furthermore, to prevent hb_report from asking you to edit the report to
   describe the problem on every node use -D on all but one:
# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1

   If you reconsider and want the ssh setup, take a look at the CTS README
   file for instructions.

Analysis

   The point of analysis is to get out the most important information from
   probably several thousand lines worth of text. Perhaps this should be
   more properly named as report review as it is rather simple, but let's
   pretend that we are doing something utterly sophisticated.

   The analysis consists of the following:
     * compare files coming from different nodes; if they are equal, make
       one copy in the top level directory, remove duplicates, and create
       soft links instead
     * print errors, warnings, and lines matching -L patterns from logs
     * report if there were coredumps and by whom
     * report crm_verify results

The goods

    1. Common
          + ha-log (if found on the log host)
          + description.txt (template and user report)
          + analysis.txt
    2. Per node
          + ha.cf
          + logd.cf
          + ha-log (if found)
          + cib.xml (cibadmin -Ql or cp if Heartbeat is not running)
          + ccm_tool.txt (ccm_tool -p)
          + crm_mon.txt (crm_mon -1)
          + crm_verify.txt (crm_verify -V)
          + pengine/ (only on DC, directory with pengine transitions)
          + sysinfo.txt (static info)
          + sysstats.txt (dynamic info)
          + backtraces.txt (if coredumps found)
          + DC (well...)
          + RUNNING or STOPPED

   Last updated 29-Nov-2007 16:12:02 CEST

References

   1. mailto:dmuhamedagic@suse.de
   2. http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES
