High availability, load balancing, and clustering are inter-related topics that involve a mix of IdP deployment choices and additional components and configuration, depending on the intended goals. This page discussed all of these features and some of the options available for achieving them.
Within this document, the following terms are used:
node - a single IdP instance
cluster - a collection of nodes
high availability - the ability for nodes to fail without loss of existing operational data (e.g. sessions)
failover - the ability to detect a node failure and redirect work destined for a failed node to an operational node
load balancing - the ability to (relatively) evenly distribute the cluster's workload amongst all nodes
Note, the number of nodes within a cluster need not correlate to the number of servers (physical or virtualized) within the cluster. Each server may run more than one instance of the IdP software and thus count as more than one node. Also note that high availability, fail over, and load balancing are distinct features and not all solutions provide all features.
Finally, be aware that fail over and load balancing are completely outside the control of the IdP. These require some mechanism (some described below) for routing traffic before it even reaches a node.
Application Architecture and Effects on Clustering
The IdP is a stateful application. That is, it maintains operational information between requests that is required to answer subsequent requests. There are a number of different sets of state information, some specific to a particular client device, some spanning all clients, and data that varies in temporal scope (i.e., long term state vs. shorter term state). Some of the most common examples are:
Spring Web Flow conversational state during profile request processing (essentially a single login, query, or other operation)
An "IdP session" capturing authentication results (so they can be reused for SSO) and optionally tracking services for logout
Message replay cache
SAML artifact store
CAS ticket store
The first few are examples of per-client state that, subject to implementation constraints, can potentially be stored in the client. Consent storage can (and is) supported client-side, but is very limited in space and utility there. The others are examples of cross-client state that by definition have to be managed on the server node (or a data store attached to each node). In every deployment, then, there can be a mix of state with different properties.
The first bullet above is an exceptional case because it represents state that is implemented by software other than the IdP itself, namely Spring Web Flow. Most web flow executions, specifically those involving views, require a stateful conversation between the client and server. This state is managed in a SWF component called a "flow execution repository", and by default this repository is implemented in-memory within the Java container and state is tracked by binding each flow execution to the container session (the one typically represented by the JSESSIONID cookie).
So, out of the box, the IdP software requires that flows involving views start and finish on the same node, the most common example being a login that requires a form prompt or redirect.
While some containers do have the capability to serialize session state across restarts or replicate sessions between nodes, and Spring Web Flow is able to leverage that mechanism, the IdP does not support that because the objects it stores in the session are not required to be "serializable" in the formal Java sense of that term. This greatly simplifies the development of the software, but makes clustering harder.
At present, there is no solution provided to replicate the per-request conversational state. This means that 100% high availability is not supported; a failed node will disrupt any requests that are in the midst of being processed by the node. It also means that some degree of session "stickiness" is required. Clients must have some degree of node affinity so that requests will continue to go to a single node for the life of a request. This was always strongly encouraged for performance reasons, but is now formally required.
All other state in the IdP falls into a second category, that of "non-conversational" data that the IdP stores and manages itself. The majority of this data is read and written using the org.opensaml.storage.StorageService API. Any implementation of this API is able to handle storage for a wide range of purposes within the IdP.
Not every use case involving a StorageService can use any implementation interchangeably because of other considerations. The most common example is that not every piece of state can be stored in a client, or may not fit in cookies, which have very draconian size limitations. For example, the replay cache and SAML artifact stores require a server-side implementation because the data is not specific to a single client, and the tracking of services for logout requires too much space for cookies.
Storage Service Implementations
At present the software includes the following storage service implementations:
in-memory using a hashtable
client-side using secured cookies and HTML5 Local Storage
The former two are configured automatically after installation and are both used for various purposes by default. The latter two require special configuration (and obviously additional software with its own impact on clustering) to use.
Excluding user credentials and user attribute data more generally, there is one exceptional case of data that may be managed by the IdP but is not managed by the unified StorageService API discussed above.
By default, the strategy used to generate "persistent", pair-wise identifiers for use in SAML assertions is based on salted hashing of a user attribute, and does not store any data.
An alternative strategy available relies on a JDBC connection to a relational database with a specialized table layout (one that is compatible with the StoredID connector plugin provided in older versions). The requirements of this use case make it impractical to leverage the more generic StorageService API, but the IdP is extensible to other approaches to handling this data.
Below are the most common methods for creating a cluster of nodes that look like one single service instance to the world at large.
Hardware- or Software-Based Clustering
The intended approach is to rely on special hardware or software designed to intercept and route traffic to the various nodes in a cluster (so the hardware or software basically becomes a networking switch in front of the nodes). This switch is then given the hostname(s) of all the services provided by the cluster behind it.
Guaranteed and flexible high-availability, load-balancing, and failover characteristics
Fine-grained control over node activation and deactivation, making online maintenance simple
More difficult to set up
Requires purchase of equipment (some solutions can be very costly)
Adds additional hardware/software configuration
Because of the guaranteed characteristics provided by this solution, we recommend this approach. Caution should be taken to ensure that the load balancing hardware does not become a single point of failure (i.e., one needs to buy and run two of them as well as addressing network redundancy).
DNS Round Robin
A round robin means that each cluster node is registered in DNS under the same hostname. When a DNS lookup is performed for that hostname, the DNS server returns a list of IP addresses (one for each active node) and the client chooses which one to contact.
Easy to set up
No guarantee of failover characteristics. If a client chooses an IP from the list and then continues to stick with that address, even if a node is unreachable, the service will appear as unavailable for that client until their DNS cache expires.
No guarantee of load-balancing characteristics. Because the client is choosing which node to contact, clients may "clump up" on a single node. For example, if the client's method of choosing which node to use is to pick the one with the lowest IP address all requests would end up going to one node (this is an extreme example and it would be dumb for a client to use this method).
If a client randomly chooses an IP from the list for each request (unlikely, but not disallowed), then the requests will fail if they depend on conversational state.
This approach cannot be used to run multiple nodes on the same IP address since DNS does not include port information.
We strongly discourage this approach. It is mentioned only for completeness.
Default IdP Configuration
By default, the IdP uses the following strategies for managing its state:
The message replay cache and SAML artifact store use an in-memory StorageService bean.
The IdP session manager uses a cookie- and HTML Local Storage-based StorageService bean (with session cookies) and does track SP sessions for logout.
The CAS support relies on a ticket service that produces encrypted and self-recoverable ticket strings to avoid the need for clustered storage, though this can sometimes break older CAS clients due to string length.
The Local Storage use and logout defaults are applicable to new installs, and not systems upgraded from V3.
The client-side StorageServices used in the default configuration use a secret key to secure the cookies and storage blobs, and this key needs to be carefully protected and managed. Simple tools to manage the secret key are provided.
These defaults mean that, out of the box, the IdP itself is easily clusterable with the most critical data stored in the client and the rest designed to be transient, making it simple to deploy any number of nodes without additional software. This does not address the need to make authentication and attribute sources redundant, of course, as these are outside the scope of the IdP itself. The consent features are also quite limited in utility, but are at least usable without deploying a database, though this is still assumed for real-world use of the feature.
Provided some form of load balancing and failover routing is available from the surrounding environment (see above), this provides a baseline degree of failover and high availability out of the box (with the caveat that high availability is limited to recovery of session state between nodes, but not mid-request), scaling to any number of nodes.
Replay detection is limited, of course, to a per-node cache.
SAML 1.1 artifact use is not supported if more than one node is deployed, because that requires a global store accessible to all nodes.
SAML 2.0 artfact use is not supported by default if more than one node is deployed, but it is possible to make that feature work with additional configuration (discussion TBD).
To combine these missing features with clustering requires the use of alternative StorageService implementations (e.g., memcache, JPA/Hibernate, or something else). This can in part be overridden via the idp.replayCache.StorageService and idp.artifact.StorageService properties (and others). A more complete discussion of these options can be found in the StorageConfiguration topic.