Date: Fri, 29 Mar 2024 07:45:22 +0000 (UTC) Message-ID: <1903052052.23.1711698322470@357366c44ab7> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_22_1859557889.1711698322470" ------=_Part_22_1859557889.1711698322470 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
High availability, load balancing, and clustering are inter-rela= ted topics that involve a mix of IdP deployment choices and additional comp= onents and configuration, depending on the intended goals. This page discus= sed all of these features and some of the options available for achieving t= hem.
Within this document, the following terms are used:
node - a single IdP instance
cluster - a collection of nodes
high availability - the ability for nodes to fail witho= ut loss of existing operational data (e.g. sessions)
failover - the ability to detect a node failure and red= irect work destined for a failed node to an operational node
load balancing - the ability to (relatively) evenly dis= tribute the cluster's workload amongst all nodes
Note, the number of nodes within a cluster need not correlate to the num= ber of servers (physical or virtualized) within the cluster. Each server ma= y run more than one instance of the IdP software and thus count as more tha= n one node. Also note that high availability, fail over, and load balancing= are distinct features and not all solutions provide all features.
Finally, be aware that fail over and load balancing are completely outsi= de the control of the IdP. These require some mechanism (some described bel= ow) for routing traffic before it even reaches a node.
The IdP is a stateful application. That is, it maintains operational inf= ormation between requests that is required to answer subsequent requests. T= here are a number of different sets of state information, some specific to = a particular client device, some spanning all clients, and data that varies= in temporal scope (i.e., long term state vs. shorter term state). Some of = the most common examples are:
Spring Web Flow conversational state during profile request processing (= essentially a single login, query, or other operation)
An "IdP session" capturing authentication results (so they can be reused= for SSO) and optionally tracking services for logout
Attribute release and terms of use ("consent") store
Message replay cache
SAML artifact store
CAS ticket store
The first few are examples of per-client state that, subject to implemen= tation constraints, can potentially be stored in the client. Consent storag= e can (and is) supported client-side, but is very limited in space and util= ity there. The others are examples of cross-client state that by definition= have to be managed on the server node (or a data store attached to each no= de). In every deployment, then, there can be a mix of state with different = properties.
The first bullet above is an exceptional case because it represents stat= e that is implemented by software other than the IdP itself, namely Spring = Web Flow. Most web flow executions, specifically those involving views, req= uire a stateful conversation between the client and server. This state is m= anaged in a SWF component called a "flow execution repository", and by defa= ult this repository is implemented in-memory within the Java container and = state is tracked by binding each flow execution to the container session (t= he one typically represented by the JSESSIONID cookie).
So, out of the box, the IdP software requires that flows involving views= start and finish on the same node, the most common example being a login t= hat requires a form prompt or redirect.
While some containers do have the capability to serialize session state = across restarts or replicate sessions between nodes, and Spring Web Flow is= able to leverage that mechanism, the IdP does not su= pport that because the objects it stores in the session are not required to= be "serializable" in the formal Java sense of that term. This greatly simp= lifies the development of the software, but makes clustering harder.
At present, there is no solution provided to replicate the per-request c= onversational state. This means that 100% high availability is not supporte= d; a failed node will disrupt any requests that are in the midst of being p= rocessed by the node. It also means that some degree of session "stickiness= " is required. Clients must have some degree of node affinity so that reque= sts will continue to go to a single node for the life of a request. This wa= s always strongly encouraged for performance reasons, but is now formally r= equired.
All other state in the IdP falls into a second category, that of "non-co= nversational" data that the IdP stores and manages itself. The majorit= y of this data is read and written using the org.opensaml.storage.StorageService API= . Any implementation of this API is able to handle storage for a wide range= of purposes within the IdP.
Not every use case involving a StorageService can use any implementation= interchangeably because of other considerations. The most common example i= s that not every piece of state can be stored in a client, or may not fit i= n cookies, which have very draconian size limitations. For example, the rep= lay cache and SAML artifact stores require a server-side implementation bec= ause the data is not specific to a single client, and the tracking of servi= ces for logout requires too much space for cookies.
At present the software includes the following storage service implement= ations:
in-memory using a hashtable
client-side using secured cookies and HTML5 Local Storage
relat= ional database via Hibernate
The former two are configured automatically after installation and are b= oth used for various purposes by default. The latter two require special co= nfiguration (and obviously additional software with its own impact on clust= ering) to use.
Excluding user credentials and user attribute data more generally, there= is one exceptional case of data that may be managed by the IdP but is not = managed by the unified StorageService API discussed above.
By default, the strategy used to generate "persistent", pair-wise identi= fiers for use in SAML assertions is based on salted hashing of a user attri= bute, and does not store any data.
An alternative strategy available relies on a JDBC connection to a relat= ional database with a specialized table layout (one that is compatible with= the StoredID connector plugin provided in older versions). The requirement= s of this use case make it impractical to leverage the more generic Storage= Service API, but the IdP is extensible to other approaches to handling this= data.
The PersistentNameIDGener= ationConfiguration topic describes this feature in more detail.
Below are the most common methods for creating a cluster of nodes that l= ook like one single service instance to the world at large.
The intended approach is to rely on special hardware or software designe= d to intercept and route traffic to the various nodes in a cluster (so the = hardware or software basically becomes a networking switch in front of the = nodes). This switch is then given the hostname(s) of all the services provi= ded by the cluster behind it.
Pros:
Guaranteed and flexible high-availability, load-balancing, and failover = characteristics
Fine-grained control over node activation and deactivation, making onlin= e maintenance simple
Cons:
More difficult to set up
Requires purchase of equipment (some solutions can be very costly)
= li>Adds additional hardware/software configuration
Because of the guaranteed characteristics provided by this solution, we = recommend this approach. Caution should be taken to ensure that the load ba= lancing hardware does not become a single point of failure (i.e., one needs= to buy and run two of them as well as addressing network redundancy).
A round robin means that each cluster node is registered in DNS under th= e same hostname. When a DNS lookup is performed for that hostname, the DNS = server returns a list of IP addresses (one for each active node) and the cl= ient chooses which one to contact.
Pros:
Easy to set up
Cons:
No guarantee of failover characteristics. If a client chooses an IP from= the list and then continues to stick with that address, even if a node is = unreachable, the service will appear as unavailable for that client until t= heir DNS cache expires.
No guarantee of load-balancing characteristics. Because the client is ch= oosing which node to contact, clients may "clump up" on a single node. For = example, if the client's method of choosing which node to use is to pick th= e one with the lowest IP address all requests would end up going to one nod= e (this is an extreme example and it would be dumb for a client to use this= method).
If a client randomly chooses an IP from the list for each request (unlik= ely, but not disallowed), then the requests will fail if they depend on con= versational state.
This approach cannot be used to run multiple nodes on the same IP addres= s since DNS does not include port information.
We strongly discourage this approach. It is mentioned o= nly for completeness.
By default, the IdP uses the following strategies for managing its state= :
The message replay cache and SAML artifact store use an in-memory Storag= eService bean.
The IdP session manager uses a cookie- and HTML Local Storage-based Stor= ageService bean (with session cookies) and does track SP sessions for logou= t.
The attribute release and terms of use consent features use a cookie- an= d and HTML Local Storage-based StorageService bean (with persistent cookies= ).
The CAS support relies on a ticket service that produces encrypted and s= elf-recoverable ticket strings to avoid the need for clustered storage, tho= ugh this can sometimes break older CAS clients due to string length.
The Local Storage use and logout defaults are applicable to new installs= , and not systems upgraded from V3.
The client-side StorageServices used in the default configuration use a = secret key to secure the cookies and storage blobs, and this key needs to b= e carefully protected and managed. Simple tools to manage the secret key are provided.
These defaults mean that, out of the box, the IdP itself is easily clust= erable with the most critical data stored in the client and the rest design= ed to be transient, making it simple to deploy any number of nodes without = additional software. This does not address the need to make authentication = and attribute sources redundant, of course, as these are outside the scope = of the IdP itself. The consent features are also quite limited in utility, = but are at least usable without deploying a database, though this is still = assumed for real-world use of the feature.
Provided some form of load balancing and failover routing is available f= rom the surrounding environment (see above), this provides a baseline degre= e of failover and high availability out of the box (with the caveat that hi= gh availability is limited to recovery of session state between nodes, but = not mid-request), scaling to any number of nodes.
Replay detection is limited, of course, to a per-node cache.
SAML 1.1 artifact use is not supported if more than one node is deployed= , because that requires a global store accessible to all nodes.
SAML 2.0 artfact use is not supported by default if more than one n= ode is deployed, but it is possible to make that feature work with addition= al configuration (discussion TBD).
To combine these missing features with clustering requires the use of al= ternative StorageService implementations (e.g., memcache, JPA/Hibernate, or= something else). This can in part be overridden via the idp.replay= Cache.StorageService and idp.artifact.StorageService properties (and others). A more complete discussion of these options ca= n be found in the StorageConfiguration topi= c.