Clustering

High availability, load balancing, and clustering are inter-related topics that involve a mix of IdP deployment choices and additional components and configuration, depending on the intended goals. This page discussed all of these features and some of the options available for achieving them.

1 Terminology
2 Application Architecture and Effects on Clustering
- 2.1 Conversational State
- 2.2 Non-Conversational State
  - 2.2.1 Storage Service Implementations
- 2.3 Exceptional State
3 Common Clustering Approaches
- 3.1 Hardware- or Software-Based Clustering
- 3.2 DNS Round Robin
4 Default IdP Configuration
- 4.1 Feature Limitations of these Defaults

Terminology

Within this document, the following terms are used:

node - a single IdP instance

cluster - a collection of nodes

high availability - the ability for nodes to fail without loss of existing operational data (e.g. sessions)

failover - the ability to detect a node failure and redirect work destined for a failed node to an operational node

load balancing - the ability to (relatively) evenly distribute the cluster's workload amongst all nodes

Note, the number of nodes within a cluster need not correlate to the number of servers (physical or virtualized) within the cluster. Each server may run more than one instance of the IdP software and thus count as more than one node. Also note that high availability, fail over, and load balancing are distinct features and not all solutions provide all features.

Finally, be aware that fail over and load balancing are really outside the control of the IdP. These require some mechanism (some described below) for routing traffic before it even reaches a node.

Application Architecture and Effects on Clustering

The IdP is a stateful application. That is, it maintains operational information between HTTP requests that is required to answer subsequent HTTP requests in particular/expected ways. There are a number of different sets of state information, some specific to a particular client device, some spanning all clients, and data that varies in temporal scope (i.e., long term state vs. shorter term state). Some of the most common examples are:

Spring Web Flow conversational state during profile request processing (essentially a single login, query, or other operation). This refers to the start and finish of a particular “logical” operation being performed; the most common tend to be SAML, CAS, or (with the OP plugin) OpenID Connect authentication request and response sequences. In some cases this results in several HTTP request/response pairs and may even involve redirects to external sites in the middle.
An "IdP session" capturing authentication results (so they can be reused for SSO) and optionally tracking services for logout. This is state that is created by specific authentication requests but extends beyond their lifetime.
Attribute release and terms of use ("consent") storage
Message replay cache
SAML artifact store
CAS ticket store

The first few are examples of per-client state that, subject to implementation constraints, can potentially be stored in the client. Consent storage can (and is) supported client-side, but is very limited in space or utility there. The others are examples of multiple-client state that by definition have to be managed on the server node (or a data store attached to each node). In every deployment, then, there is a mix of state with different properties.

Conversational State

The first bullet above is an exceptional case because it represents state that is implemented by software other than the IdP itself, namely Spring Web Flow. Most web flow executions, specifically those involving views, require a stateful conversation between the client and server. This state is managed in a SWF component called a "flow execution repository", and by default this repository is implemented in-memory within the Java container and state is tracked by binding each flow execution to the container session (the one typically represented by the JSESSIONID cookie).

So, out of the box, the IdP software requires that flows involving views must start and finish on the same node, the most common example being a login that requires a form prompt or redirect.

While Java containers do have the capability to serialize session state across restarts or replicate sessions between nodes, and Spring Web Flow is able to leverage that mechanism, the IdP does not support that because the objects it stores in the session are not required to be "serializable" in the formal Java sense of that term. This greatly simplifies the development of the software, but makes clustering harder. We do not have plans at present to fix this restriction.

At present, there is no solution provided to replicate the per-request conversational state. This means that 100% high availability is not supported; a failed node will disrupt any requests that are in the midst of being processed by the node. It also means that some degree of session "stickiness" is required. Clients must have some degree of node affinity so that requests will continue to go to a single node for the life of a request. This was always strongly encouraged for performance reasons, but is now formally required.

The most common techniques for implementing this “stickiness” in load balancers are affinity based on client/source IP address, and affinity based on cookies. Using cookies of course requires that the load balancer implement HTTP(S) itself and proxy traffic to the actual IdP servers. Using the IP address is more flexible and less invasive to the client and allows a wider range of load balancing scenarios but is also dependent on clients using somewhat unique, and stable, addresses, at least for the life of a series of related requests.

Non-Conversational State

All other state in the IdP falls into a second category, that of "non-conversational" data that the IdP stores and manages itself. The majority of this data is read and written using the StorageService API. Any implementation of this API is able to handle storage for a wide range of purposes within the IdP.

Not every use case involving a StorageService can use any implementation interchangeably because of other considerations. The most common example is that not every piece of state can be stored in a client, or may not fit in cookies, which have very draconian size limitations. For example, the replay cache and SAML artifact stores require a server-side implementation because the data is not specific to a single client, and the tracking of services for logout requires too much space for cookies.

Storage Service Implementations

At present the software includes (or plugins exist for) the following storage service implementations:

in-memory using a hashtable
client-side using secured cookies and HTML5 Local Storage
memcached
relational database via JDBC (via plugin)

The former two are configured automatically after installation and are both used for various purposes by default. The latter two require special configuration (and obviously additional software with its own impact on clustering) to use.

Exceptional State

Excluding user credentials and user attribute data more generally, there is one exceptional case of data that may be managed by the IdP but is not managed by the unified StorageService API discussed above.

By default, the strategy used to generate "persistent", pairwise identifiers for use in SAML assertions (or OIDC tokens) is based on salted hashing of a user attribute, and does not rely on stored data.

An alternative strategy available relies on a JDBC connection to a relational database with a specialized table layout (one that is compatible with the StoredID connector plugin dating to older versions). The requirements of this use case make it impractical to leverage the more generic StorageService API, but the IdP is extensible to other approaches to handling this data.

The PersistentNameIDGenerationConfiguration topic describes this feature in more detail.

Common Clustering Approaches

Below are the most common methods for creating a cluster of nodes that look like one single service instance to the world at large.

Hardware- or Software-Based Clustering

The intended approach is to rely on special hardware or software designed to intercept and route traffic to the various nodes in a cluster (so the hardware or software basically becomes a networking switch in front of the nodes). This switch is then given the hostname(s) of all the services provided by the cluster behind it.

Pros:

Guaranteed and flexible high-availability, load-balancing, and failover characteristics
Fine-grained control over node activation and deactivation, making online maintenance simple

Cons:

More difficult to set up
Requires purchase of equipment (some solutions can be very costly)
Adds additional hardware/software configuration

Because of the guaranteed characteristics provided by this solution, we recommend this approach. Caution should be taken to ensure that the load balancing hardware does not become a single point of failure (i.e., one needs to buy and run two of them as well as addressing network redundancy).

DNS Round Robin

A round robin means that each cluster node is registered in DNS under the same hostname. When a DNS lookup is performed for that hostname, the DNS server returns a list of IP addresses (one for each active node) and the client chooses which one to contact.

Pros:

Easy to set up

Cons:

No guarantee of failover characteristics. If a client chooses an IP from the list and then continues to stick with that address, even if a node is unreachable, the service will appear as unavailable for that client until their DNS cache expires.
No guarantee of load-balancing characteristics. Because the client is choosing which node to contact, clients may "clump up" on a single node. For example, if the client's method of choosing which node to use is to pick the one with the lowest IP address all requests would end up going to one node (this is an extreme example and it would be dumb for a client to use this method).
If a client randomly chooses an IP from the list for each request (unlikely, but not disallowed), then the requests will fail if they depend on conversational state.
This approach cannot be used to run multiple nodes on the same IP address since DNS does not include port information.

We strongly discourage this approach. It is mentioned only for completeness.

Default IdP Configuration

By default, the IdP uses the following strategies for managing its state:

The message replay cache and SAML artifact store use an in-memory StorageService bean.
The IdP session manager uses a cookie- and HTML Local Storage-based StorageService bean (with session cookies) and does track SP sessions for logout.
The attribute release and terms of use consent features use a cookie- and and HTML Local Storage-based StorageService bean (with persistent cookies), but we naturally expect people will deploy a more persistent storage option for this use case.
The CAS support relies on a ticket service that produces encrypted and self-recoverable ticket strings to avoid the need for clustered storage, though this can sometimes break older CAS clients due to string length.

The Local Storage use and logout defaults are applicable to new installs, and not systems originally upgraded from V3.

The client-side StorageServices used in the default configuration use a secret key to secure the cookies and storage blobs, and this key needs to be carefully protected and managed. Simple tools to manage the secret key are provided.

These defaults mean that, out of the box, the IdP itself is easily clusterable with the most critical data stored in the client and the rest designed to be transient, making it simple to deploy any number of nodes without additional software. This does not address the need to make authentication and attribute sources redundant, of course, as these are outside the scope of the IdP itself. The consent features are also quite limited in utility, but are at least usable without deploying a database, though this is still assumed for real-world use of the feature.

Provided some form of load balancing and failover routing is available from the surrounding environment (see above), this provides a baseline degree of failover and high availability out of the box (with the caveat that high availability is limited to recovery of session state between nodes, but not mid-request), scaling to any number of nodes.

Feature Limitations of these Defaults

Replay detection is limited, of course, to a per-node cache.
SAML 1.1 artifact use is not supported if more than one node is deployed, because that requires a global store accessible to all nodes.
SAML 2.0 artfact use is not supported by default if more than one node is deployed, but it is possible to make that feature work with additional configuration by prefixing the artifacts and then implementing advanced routing within a load balancer. This is rarely done.

To combine these missing features with clustering requires the use of alternative StorageService implementations (e.g., memcache, JDBC, or something else). This can in part be overridden via the idp.replayCache.StorageService and idp.artifact.StorageService properties (and others). A more complete discussion of these options can be found in the StorageConfiguration topic.

Identity Provider 5