# [Pvfs2-cvs] commit by pcarns in pvfs2-1/doc: pvfs2-ha-heartbeat-v2.tex

CVS commit program cvs at parl.clemson.edu
Thu Apr 24 12:18:59 EDT 2008

Update of /projects/cvsroot/pvfs2-1/doc
In directory parlweb1:/tmp/cvs-serv30827/doc

Modified Files:
pvfs2-ha-heartbeat-v2.tex
Log Message:
simplifying and updating the pvfs heartbeat/failover document:
- use file system labels rather than device files for SAN mounts
- switch to simpler IPMI example stonith configuration
- move hardware specific example scripts to a separate subdirectory from the
scripts needed for a basic configuration

Index: pvfs2-ha-heartbeat-v2.tex
===================================================================
RCS file: /projects/cvsroot/pvfs2-1/doc/pvfs2-ha-heartbeat-v2.tex,v
diff -p -u -r1.1 -r1.2
--- pvfs2-ha-heartbeat-v2.tex	7 Nov 2007 21:54:12 -0000	1.1
+++ pvfs2-ha-heartbeat-v2.tex	24 Apr 2008 16:18:59 -0000	1.2
@@ -16,8 +16,8 @@

-\title{PVFS2 High-Availability Clustering using Heartbeat 2.0}
-\date{2007}
+\title{PVFS High Availability Clustering using Heartbeat 2.0}
+\date{2008}

\pagestyle{plain}
\begin{document}
@@ -37,39 +37,26 @@

\section{Introduction}

-This document describes how to configure PVFS2 for high availability
-using Heartbeat version 2.x from www.linux-ha.org.  See pvfs2-ha.tex for
-documentation on how to configure PVFS2 for high availability using
-Heartbeat version 1.x.
-
-Heartbeat 2.x offers several improvements.  First of all, it allows for
-an arbitrary cluster size.  The servers do not have to be paired up for
-failover.  For example, if you configure 16 servers and one of them
-fails, then any of the remaining 15 can serve as the failover machine.
-
-Secondly, Heartbeat 2.x supports monitoring of resources.  Examples of
-resources that you may want to actively monitor include the PVFS2 server
-daemon, the IP interface, and connectivity to storage hardware.
-
-Finally, Heartbeat 2.x includes a configuration mechanism to express
-dependencies between resources.  This can be used to
-express a preference for where certain servers run within the cluster,
-or to enforce that resources need to be started or stopped in a specific
-order.
-
-This document describes how to set up PVFS2 for high availability with
-an arbitrary number of active servers and an arbitrary number of passive spare
-nodes.  Spare nodes are not required unless you wish to avoid
-performance degradation upon failure.  As configured in this document,
-PVFS2 will be able to tolerate $\lceil ((N/2)-1) \rceil$ node failures,
-where N is the number of nodes present in the Heartbeat cluster
-including spares.  Over half of the nodes
-must be available in order to reach a quorum and decide if another node has
-failed.
+This document describes how to configure PVFS for high availability
+using Heartbeat version 2.x from www.linux-ha.org.

-No modifications of PVFS2 are required.  Example scripts referenced in
+The combination of PVFS and Heartbeat can support an arbitrary number
+of active server nodes and an arbitrary number of passive spare nodes.
+Spare nodes are not required unless you wish to avoid performance
+degradation upon failure.  As configured in this document, PVFS will
+be able to tolerate $\lceil ((N/2)-1) \rceil$ node failures, where N is
+the number of nodes present in the Heartbeat cluster including spares.
+Over half of the nodes must be available in order to reach a quorum and
+decide if another node has failed.
+
+Heartbeat can be configured to monitor IP connectivity, storage hardware
+connectivity, and responsiveness of the PVFS daemons.  Failure of any of
+these components will trigger a node level failover event.  PVFS clients
+will transparently continue operation following a failover.
+
+No modifications of PVFS are required.  Example scripts referenced in
this document are available in the \texttt{examples/heartbeat} directory of
-the PVFS2 source tree.
+the PVFS source tree.

\section{Requirements}

@@ -77,60 +64,57 @@ the PVFS2 source tree.

\subsubsection{Nodes}

-Any number of nodes may be configured, although you need at least 3 in
-order to tolerate a failure.  See the explanation in the introduction of
-this document.  You may also use any number of spare nodes.  A spare
-node is a node that does not run any services until a failover occurs.
-If you have one or more spares, then they will be selected first to run
-resources in a failover situation.  If you have no spares (or all spares
-are exhausted), then at least one node will have to run two services
-
-The examples in this document will use 4 active nodes and one spare
-node.
+Any number of nodes may be configured, although you need at least three
+total in order to tolerate a failure.  You may also use any number of
+spare nodes.  A spare node is a node that does not run any services until
+a failover occurs.  If you have one or more spares, then they will be
+selected first to run resources in a failover situation.  If you have
+no spares (or all spares are exhausted), then at least one node will
+have to run two services simultaneously, which may degrade performance.
+
+The examples in this document will use 4 active nodes and 1 spare node,
+for a total of 5 nodes.  Heartbeat has been tested with up to 16 nodes
+in configurations similar to the one outlined in this document.

\subsubsection{Storage}

-The specific type of storage hardware is not important, but it must be
-possible to allocate a separate block device to each server, and all
-servers must be capable of accessing all block devices.
-
-One way of achieving this is by using a SAN.  In the examples used in
-this document, the SAN has been divided into 4 LUNs.  Each of the 5
-servers in the cluster is capable of mounting all 4 LUNs.  However, the
-same LUN should never be mounted on two nodes simultaneously.  This
-document assumes that each block device is formatted using ext3.
-The Heartbeat software will insure that a given LUN is mounted in only
-one location at a time.
-
-It is also important that the device naming be consistent across all
-nodes.  For example, if node1 mounts /dev/fooa, then it should see the
-same data as if node2 were to mount /dev/fooa.  Likewise for /dev/foob,
-etc.
+A shared storage device is required.  The storage must be configured
+to allocate a separate block device to each PVFS daemon, and all nodes
+(including spares) must be capable of accessing all block devices.
+
+One way of achieving this is by using a SAN.  In the examples used in this
+document, the SAN has been divided into 4 LUNs.  Each of the 5 nodes in
+the cluster is capable of mounting all 4 LUNs.  The Heartbeat software
+will insure that a given LUN is mounted in only one location at a time.
+
+Each block device should be formatted with a local file system.
+This document assumes the use of ext3.  Unique labels should be set on
+each file system (for example, using the \texttt{-L} argument to
+\texttt{mke2fs} or \texttt{tune2fs}).  This will allow the block devices
+to be mounted by consistent labels regardless of how Linux assigned
+device file names to the devices.

\subsubsection{Stonith}

-Heartbeat needs some mechanism to fence or stonith a failed node.  One
-straightforward way to do this is to connect each server node to a
-network controllable power strip.  That will allow any given server to
-send a command over the network to power off another server.
+Heartbeat needs some mechanism to fence or stonith a failed node.
+Two popular ways to do this are either to use IPMI or a network
+controllable power strip.  Each node needs to have a mechanism available
+to reset any other node in the cluster.  The example configuration in
+this document uses IPMI.

-It is possible to configure PVFS2 and Heartbeat without a power control
+It is possible to configure PVFS and Heartbeat without a power control
device.  However, if you deploy this configuration for any purpose other
than evaluation, then you run a very serious risk of data
corruption.   Without stonith, there is no way to guarantee that a
failed node has completely shutdown and stopped accessing its
storage device before failing over.

-The example in this document is using an APC switched PDU (which allows
-commands to be sent via SNMP or ssh) as the power control device.
-
\subsection{Software}

-This document assumes that you are using Hearbeat version 2.0.8, and
-PVFS2 version 2.6.x or greater.  You may also wish to use example
-scripts included in the \texttt{examples/heartbeat} directory of the PVFS2 source
-tree.
+This document assumes that you are using Heartbeat version 2.1.3,
+and PVFS version 2.7.x or greater.  You may also wish to use example
+scripts included in the \texttt{examples/heartbeat} directory of the
+PVFS source tree.

\subsection{Network}

@@ -139,40 +123,45 @@ used with Heartbeat.  First of all, you
address to use for communication within the cluster nodes.

Secondly, you need to allocate an extra IP address and hostname for each
-active PVFS2 server.  In the example that this document uses, we must
-allocate 4 extra IP addresses, along with 4 hostnames in DNS
-for those IP addresses.  In this document, we will refer to these as
-virtual addresses''.  Each active PVFS2 server will be configured
+PVFS daemon.  In the example that this document uses, we must allocate 4
+extra IP addresses, along with 4 hostnames in DNS for those IP addresses.
+In this document, we will refer to these as virtual addresses'' or
+virtual hostnames''.  Each active PVFS server will be configured
to automatically bring up one of these virtual addresses to use for
communication.  If the node fails, then that IP address is migrated to
another node so that clients will appear to communicate with the same
server regardless of where it fails over to.  It is important that you
-not use the primary IP address of each node for this purpose.
+\emph{not} use the primary IP address of each node for this purpose.

In the example in this document, we use 225.0.0.1 as the multicast
-address, node\{1-5\} as the normal node hostnames, and
-virtualnode\{1-4\} as the virtual hostnames.
+address, node\{1-5\} as the normal node hostnames,
+virtual\{1-4\} as the virtual hostnames, and 192.168.0.\{1-4\} as the
+
+Note that the virtual addresses must be on the same subnet as the true

-\section{Configuring PVFS2}
+\section{Configuring PVFS}

+for use on each of the active nodes.

-There are a few points to consider when configuring PVFS2:
+There are a few points to consider when configuring PVFS:
\begin{itemize}
-\item Use the virtual addresses when specifying meta servers and I/O
+\item Use the virtual hostnames when specifying meta servers and I/O
servers
\item Synchronize file data on every operation (necessary for consistency on
failover)
\item Synchronize meta data on every operation (necessary for consistency on
-failover)
+failover).  Coalescing is allowed.
\item Use the \texttt{TCPBindSpecific} option (this allows multiple daemons to
-run on the same node if needed)
+run on the same node using different virtual addresses)
\item Tune retry and timeout values appropriately for your system.  This
may depend on how long it takes for your power control device to safely
shutdown a node.
\end{itemize}

-Figure~\ref{fig:pvfs2conf} shows one example of how to configure PVFS2.
+Figure~\ref{fig:pvfs2conf} shows one example of how to configure PVFS.
Only the parameters relevant to the Heartbeat scenario are shown.

\begin{figure}
@@ -191,36 +180,34 @@ Only the parameters relevant to the Hear
</Defaults>

<Aliases>
-        Alias virtualnode1_tcp3334 tcp://virtualnode1:3334
-        Alias virtualnode2_tcp3334 tcp://virtualnode2:3334
-        Alias virtualnode3_tcp3334 tcp://virtualnode3:3334
-        Alias virtualnode4_tcp3334 tcp://virtualnode4:3334
+        Alias virtual1_tcp3334 tcp://virtual1:3334
+        Alias virtual2_tcp3334 tcp://virtual2:3334
+        Alias virtual3_tcp3334 tcp://virtual3:3334
+        Alias virtual4_tcp3334 tcp://virtual4:3334
</Aliases>

<Filesystem>
...
<MetaHandleRanges>
-                Range virtualnode1_tcp3334 4-536870914
-                Range virtualnode2_tcp3334 536870915-1073741825
-                Range virtualnode3_tcp3334 1073741826-1610612736
-                Range virtualnode4_tcp3334 1610612737-2147483647
+                Range virtual1_tcp3334 4-536870914
+                Range virtual2_tcp3334 536870915-1073741825
+                Range virtual3_tcp3334 1073741826-1610612736
+                Range virtual4_tcp3334 1610612737-2147483647
</MetaHandleRanges>
<DataHandleRanges>
-                Range virtualnode1_tcp3334 2147483648-2684354558
-                Range virtualnode2_tcp3334 2684354559-3221225469
-                Range virtualnode3_tcp3334 3221225470-3758096380
-                Range virtualnode4_tcp3334 3758096381-4294967291
+                Range virtual1_tcp3334 2147483648-2684354558
+                Range virtual2_tcp3334 2684354559-3221225469
+                Range virtual3_tcp3334 3221225470-3758096380
+                Range virtual4_tcp3334 3758096381-4294967291
</DataHandleRanges>
<StorageHints>
TroveSyncMeta yes
TroveSyncData yes
-                CoalescingHighWatermark 1
-                CoalescingLowWatermark 1
</StorageHints>
</Filesystem>
\end{verbatim}
\end{scriptsize}
-\caption{Example \texttt{pvfs2-fs.conf} file}
+\caption{Example \texttt{fs.conf} file}
\label{fig:pvfs2conf}
\end{figure}

@@ -231,13 +218,26 @@ start the Heartbeat service.
\section{Configuring storage}

Make sure that there is a block device allocated for each active server
-in the file system.  Format each one with ext3.  Do not create a PVFS2
+in the file system.  Format each one with ext3.  Do not create a PVFS
storage space yet, but you can create subdirectories within each file
system if you wish.

-Confirm that each block device can be mounted from every node, and that
-the device names are consistent.  Do this one node at a time.  Never mount
-the same block device concurrently on two or more nodes.
+Confirm that each block device can be mounted from every node using the
+file system label.  Do this one node at a time.  Never mount
+the same block device concurrently on two nodes.
+
+\section{Configuring stonith}
+
+Make sure that your stonith device is accessible and responding from each
+node in the cluster.  For the IPMI stonith example used in this document,
+this means confirming that \texttt{ipmitool} is capable of monitoring
+each node.  Each node will have its own IPMI IP address, username, and
+
+\begin{verbatim}
+$ipmitool -I lanplus -U Administrator -P password -H 192.168.0.10 power status +Chassis Power is on +\end{verbatim} \section{Distributing Heartbeat scripts} @@ -245,13 +245,7 @@ The scripts that are in the \texttt{exam installed to the following suggested locations on each server node: \begin{itemize} \item pvfs2-ha-heartbeat-configure.sh: /usr/bin -\item apc*: /usr/bin -\item baytech*: /usr/bin -\item qla*: /usr/bin \item PVFS2: /usr/lib/ocf/resource.d/external/ -\item PVFS2-notify: /usr/lib/ocf/resource.d/external -\item Filesystem-qla-monitor: /usr/lib/ocf/resource.d/external -\item pvfs2-stonith-plugin: /usr/lib/stonith/plugins/external \end{itemize} \section{Base Heartbeat configuration} @@ -300,8 +294,7 @@ It is possible to start the Heartbeat se CIB, but it is simpler to begin with a populated XML file on all nodes. \texttt{cib.xml.example} provides an example of a fully populated -Heartbeat configuration with 5 nodes and 4 active PVFS2 servers. It -also includes some optional components for completeness. Relevant +Heartbeat configuration with 5 nodes and 4 active PVFS servers. Relevant portions of the XML file are outlined below. This file should be modified to reflect your configuration, and then @@ -337,37 +330,9 @@ and SAN mount point at the same time. G start or stop all associated resources for a node with one unified command. In the example \texttt{cib.xml}, there are 4 groups (server0 through -server3). These represent the 4 active PVFS2 servers that will run on +server3). These represent the 4 active PVFS servers that will run on the cluster. -\subsection{PVFS2-notify} - -The \texttt{PVFS2-notify} resources, such as \texttt{server0\_notify}, are -used as a mechanism to send alerts when a server process fails over to -another node. This is provided by the \texttt{PVFS2-notify} script in -the examples directory. - -The use of a notify resource is entirely optional and may be omitted. -This particular script is designed to take four parameters: -\begin{itemize} -\item \texttt{firsthost}: name of the node that the server group should normally -run on -\item \texttt{fsname}: arbitrary name for the PVFS2 file system -\item \texttt{conf\_dir}: location of notification configuration files -\item \texttt{title}: component of the title for the notification -\end{itemize} - -The \texttt{PVFS2-notify} script serves as an example for how one might -implement a notification mechanism. However, it is incomplete on its -own. This example relies on a secondary script called -\texttt{fs-instance-alarm.pl} to send the actual notification. For -example, one could implement a script that sends an email when a failure -occurs. The \texttt{conf\_dir} parameter could be passed along to -provide a location to read a configurable list of email addresses from. - -\texttt{fs-instance-alarm.pl} is not provided with this example or -documentation. - \subsection{IPaddr} The \texttt{IPaddr} resources, such as \texttt{server0\_address}, are @@ -380,23 +345,24 @@ on your network. See the network requir The \texttt{Filesystem} resources, such as \texttt{server0\_fs}, are used to describe the shared storage block devices that serve as back end storage -for PVFS2. This is where the PVFS2 storage space for each server will -be created. In this example, the device names are \texttt{/dev/fooa1} -through \texttt{/dev/food1}. They are each mounted on directories such -as \texttt{/san\_mounta1} through \texttt{/san\_mountd1}. Please note +for PVFS. This is where the PVFS storage space for each server will +be created. In this example, the device names are labeled as +\texttt{label0} +through \texttt{label3}. They are each mounted on directories such +as \texttt{/san\_mount0} through \texttt{/san\_mount3}. Please note that each device should be mounted on a different mount point to allow multiple \texttt{pvfs2-server} processes to operate on the same node without -collision. +collision. The file system type can be changed to reflect the use of +alternative underlying file systems. -\subsection{PVFS2} +\subsection{PVFS} The \texttt{PVFS2} resources, such as \texttt{server0\_daemon}, are used to describe each \texttt{pvfs2-server} process. This resource is provided by the PVFS2 script in the examples directory. The parameters to this resource are listed below: \begin{itemize} -\item \texttt{fsconfig}: location of PVFS2 fs configuration file -\item \texttt{serverconfig}: location of PVFS2 server configuration file +\item \texttt{fsconfig}: location of PVFS fs configuration file \item \texttt{port}: TCP/IP port that the server will listen on (must match server configuration file) \item \texttt{ip}: IP address that the server will listen on (must match both the file @@ -404,15 +370,15 @@ system configuration file and the IPAddr \item \texttt{pidfile}: Location where a pid file can be written \end{itemize} -Also notice that there is a monitor operation associated with the PVFS2 +Also notice that there is a monitor operation associated with the PVFS resource. This will cause the \texttt{pvfs2-check-server} utility to be triggered periodically to make sure that the \texttt{pvfs2-server} process is not only -running, but is correctly responding to PVFS2 protocol requests. This +running, but is correctly responding to PVFS protocol requests. This allows problems such as hung \texttt{pvfs2-server} processes to be treated as failure conditions. Please note that the PVFS2 script provided in the examples will attempt -to create a storage space for each server if it is not already present. +to create a storage space on startup for each server if it is not already present. \subsection{rsc\_location} @@ -420,7 +386,7 @@ The \texttt{rsc\_location} constraints, are used to express a preference for where each resource group should run (if possible). It may be useful for administrative purposes to have the first server group default to run on the first node of your cluster, -etc. +for example. Otherwise the placement will be left up to Heartbeat. \subsection{rsc\_order} @@ -430,64 +396,17 @@ which resources must be started or stopp organized into groups, but without ordering constraints, the resources within a group may be started in any order relative to each other. These constraints are necessary because a \texttt{pvfs2-server} process will not -start properly if the IP address that it should listen on and the shared -storage that it should use are not available yet. +start properly until its IP address and storage are available. -\subsection{pvfs2-stonith-plugin} +\subsection{stonith} -The \texttt{pvfs2-stonith-plugin} resource is an example of how to -configure a stonith device for use in Heartbeat. See the Heartbeat -documentation for a list of officially supported devices. - -In this example, the stonith device is setup as a clone, which means -that there are N identical copies of the resource (one per node). This -allows any node in the cluster to quickly send a stonith command if -needed. - -The \texttt{pvfs2-stonith-plugin} is provided by a script in the -examples directories. It requires a parameter to specify the file -system name, and a parameter to specify a configuration directory. This -plugin is not complete by itself, however. It relies on three scripts -to actually perform the stonith commands: -\begin{itemize} -\item \texttt{fs-power-control.pl}: used to send commands to control power to a -node -\item \texttt{fs-power-gethosts.pl}: used to print a list of nodes that can be -controlled with this device -\item \texttt{fs-power-monitor.pl}: used to monitor the stonith device and -confirm that is available -\end{itemize} +The \texttt{external/ipmi} stonith device is used in this example. +Please see the Heartbeat documentation for instructions on configuring +other types of devices. -These three stonith scripts are not provided with these examples. They -may need to be specifically implemented for your environment. As an alternative, -you can simply use one of the standard stonith devices that are -supported by Heartbeat (see Heartbeat documentation for details). - -The following scripts provide lower level examples of how to control an APC power -strip (via SNMP or SSH) or a Baytech power strip (via SSH): -\begin{itemize} -\item \texttt{apc-switched-pdu-hybrid-control.pl} -\item \texttt{apc-switched-pdu-hybrid-monitor.pl} -\item \texttt{baytech-mgmt-control.pl} -\item \texttt{baytech-mgmt-monitor.pl} -\end{itemize} - -One approach to implementing power control would be to use the -pvfs2-stonith-plugin device script and write -\texttt{fs-power\{control/monitor/gethosts\}} scripts that can parse -configuration files describing your cluster and send appropriate -commands to the above provided APC and Baytech control scripts. - -\subsection{SAN monitoring} - -The example CIB configuration does not use this feature, but an -additional resource script has been included that modifies the -\texttt{Filesystem} resource to allow it to monitor SAN connectivity. This -script is called \texttt{Filesystem-qla-monitor}. It requires that the -nodes use QLogic fibre channel adapters and EMC PowerPath -software for SAN connectivity. If this configuration is available, then this script can -issue appropriate PowerPath commands periodically to confirm that there -is connectivity between each node and its block device. +There is one IPMI stonith device for each node. The attributes for that +resources specify which node is being controlled, and the username, +password, and IP address of corresponding IPMI device. \section{Starting Heartbeat} @@ -512,18 +431,18 @@$ crm_mon -r

\section{Mounting the file system}

-Mounting PVFS2 with high availability is no different than mounting a
-normal PVFS2 file system, except that you must use the virtual hostname
-for the PVFS2 server rather than the primary hostname of the node.
+Mounting PVFS with high availability is no different than mounting a
+normal PVFS file system, except that you must use the virtual hostname
+for the PVFS server rather than the primary hostname of the node.
Figure~\ref{fig:mount} provides an example.

\begin{figure}
\begin{scriptsize}
\begin{verbatim}
-$mount -t pvfs2 tcp://virtualnode1:3334/pvfs2-fs /mnt/pvfs2 +$ mount -t pvfs2 tcp://virtual1:3334/pvfs2-fs /mnt/pvfs2
\end{verbatim}
\end{scriptsize}
-\caption{Mounting PVFS2 file system}
+\caption{Mounting PVFS file system}
\label{fig:mount}
\end{figure}

@@ -532,18 +451,18 @@ \$ mount -t pvfs2 tcp://virtualnode1:3334
The following example illustrates the steps that occur when a node fails:

\begin{enumerate}
-\item Node2 (which is running a \texttt{pvfs2-server} on the virtualnode2 IP
+\item Node2 (which is running a \texttt{pvfs2-server} on the virtual2 IP
\item Client node begins timeout/retry cycle
-\item Heartbeat services running on remaining servers notice that node2
+\item Heartbeat services running on remaining nodes notice that node2
is not responding
-\item After a timeout has elapsed, remaining servers reach a quorum and
+\item After a timeout has elapsed, remaining nodes reach a quorum and
vote to treat node2 as a failed node
\item Node1 sends a stonith command to reset node2
\item Node2 either reboots or remains powered off (depending on nature
of failure)
\item Once stonith command succeeds, node5 is selected to replace it
-\item The virtualnode2 IP address, mount point, and
+\item The virtual2 IP address, mount point, and
\texttt{pvfs2-server} service
are started on node5
\item Client node retry eventually succeeds, but now the network
@@ -565,6 +484,26 @@ without a true failure event.
particular resource group.
\item \texttt{crm\_verify}: can be used to confirm if the CIB
information is valid and consistent
+\end{itemize}
+
+
+The \texttt{examples/heartbeat/hardware-specific} directory contains
+
+\begin{itemize}
+\item \texttt{pvfs2-stonith-plugin}: An example stonith plugin
+that can use an arbitrary script to power off nodes.  May be used (for
+example) with the \texttt{apc*} and \texttt{baytech*} scripts to control
+remote controlled power strips if the scripts provided by Heartbeat are
+not sufficient.
+\item \texttt{Filesystem-qla-monitor}: A modified version of the
+standard FileSystem resource that uses the \texttt{qla-monitor.pl}
+script to provide additional monitoring capability for QLogic fibre
+channel cards.
+\item \texttt{PVFS2-notify}: An example of a dummy resource that could