Page Comparison

High availability, load balancing, and clustering are inter-related topics that involve a mix of IdP deployment choices and additional components and configuration, depending on the intended goals. This page discussed all of these features and some of the options available for achieving them.

Table of Contents

Terminology

Within this document, the following terms are used:

...

The IdP is a stateful application. That is, it maintains operational information between requests that is required to answer subsequent requests. There are a number of different sets of state information, some specific to a particular client device, some spanning all clients, and data that varies in temporal scope (i.e., long term state vs. shorter term state). Some of the most common examples are:

Spring Web Flow conversational state during profile request processing (essentially a single login, query, or other operation)
An "IdP session" capturing authentication results (so they can be reused for SSO) and optionally tracking services for logout
Attribute release and terms of use ("consent") store
Message replay cache
SAML artifact store
CAS ticket store

The first few are examples of per-client state that, subject to implementation constraints, can potentially be stored in the client. Consent storage can (and is) supported client-side, but is very limited in space and utility there. The others are examples of cross-client state that by definition have to be managed on the server node (or a data store attached to each node). In every deployment, then, there can be a mix of state with different properties.

...

All other state in the IdP falls into a second category, that of "non-conversational" data that the IdP stores and manages itself. The majority of this data is read and written using the org.opensaml.storage.StorageService API. Any implementation of this API is able to handle storage for a wide range of purposes within the IdP.

...

At present the software includes the following storage service implementations:

in-memory using a hashtable
client-side using secured cookies and HTML5 Local Storage
relational database via Hibernate
memcached

The former two are configured automatically after installation and are both used for various purposes by default. The latter two require special configuration (and obviously additional software with its own impact on clustering) to use.

...

The intended approach is to rely on special hardware or software designed to intercept and route traffic to the various nodes in a cluster (so the hardware or software basically becomes a networking switch in front of the nodes). This switch is then given the hostname(s) of all the services provided by the cluster behind it.

Pros:

Guaranteed and flexible high-availability, load-balancing, and failover characteristics
Fine-grained control over node activation and deactivation, making online maintenance simple

Cons:

More difficult to set up
Requires purchase of equipment (some solutions can be very costly)
Adds additional hardware/software configuration

Because of the guaranteed characteristics provided by this solution, we recommend this approach. Caution should be taken to ensure that the load balancing hardware does not become a single point of failure (i.e., one needs to buy and run two of them as well as addressing network redundancy).

...

A round robin means that each cluster node is registered in DNS under the same hostname. When a DNS lookup is performed for that hostname, the DNS server returns a list of IP addresses (one for each active node) and the client chooses which one to contact.

Pros:

Easy to set up

Cons:

No guarantee of failover characteristics. If a client chooses an IP from the list and then continues to stick with that address, even if a node is unreachable, the service will appear as unavailable for that client until their DNS cache expires.
No guarantee of load-balancing characteristics. Because the client is choosing which node to contact, clients may "clump up" on a single node. For example, if the client's method of choosing which node to use is to pick the one with the lowest IP address all requests would end up going to one node (this is an extreme example and it would be dumb for a client to use this method).
If a client randomly chooses an IP from the list for each request (unlikely, but not disallowed), then the requests will fail if they depend on conversational state.
This approach cannot be used to run multiple nodes on the same IP address since DNS does not include port information.

We strongly discourage this approach. It is mentioned only for completeness.

...

By default, the IdP uses the following strategies for managing its state:

The message replay cache and SAML artifact store use an in-memory StorageService bean.
The IdP session manager uses a cookie- and HTML Local Storage-based StorageService bean (with session cookies) and does track SP sessions for logout.
The attribute release and terms of use consent features use a cookie- and and HTML Local Storage-based StorageService bean (with persistent cookies).
The CAS support relies on a ticket service that produces encrypted and self-recoverable ticket strings to avoid the need for clustered storage, though this can sometimes break older CAS clients due to string length.

The Local Storage use and logout defaults are applicable to new installs, and not systems upgraded from V3.

...

Provided some form of load balancing and failover routing is available from the surrounding environment (see above), this provides a baseline degree of failover and high availability out of the box (with the caveat that high availability is limited to recovery of session state between nodes, but not mid-request), scaling to any number of nodes.

Feature Limitations

Replay detection is limited, of course, to a per-node cache.
SAML 1.1 artifact use is not supported if more than one node is deployed, because that requires a global store accessible to all nodes.
SAML 2.0 artfact use is not supported by default if more than one node is deployed, but it is possible to make that feature work with additional configuration (discussion TBD).

To combine these missing features with clustering requires the use of alternative StorageService implementations (e.g., memcache, JPA/Hibernate, or something else). This can in part be overridden via the idp.replayCache.StorageService and idp.artifact.StorageService properties (and others). A more complete discussion of these options can be found in the StorageConfiguration topic.

Versions Compared

Old Version 4

New Version Current

Key

Terminology

Feature Limitations