Clustering
The goal of a typical deployment is NOT to cluster the SP. That's a possible solution to solving the underlying problem, which is to cluster the application itself.
Clustering the SP requires that you understand a number of inter-related issues:
Is the application itself clusterable?
How does the application's session management interact with the SP's session management, if at all?
What are the capabilities of your load balancer?
What are your tolerances for single points of failure?
The key issue is session management. The SP does maintain other information in memory, but generally this only includes the replay cache, and the protection you get from replay checks probably isn't enough to justify worrying about clustering it.
Session management is critical because most web applications that are "high end" enough to cluster have their own session mechanism, and do not rely on the SP session. To cluster an application like this, dealing with the SP isn't enough. You'll still have to do something about the application itself, and you should start there before you even get to the SP. In other words, if you can't cluster the application without the SP, chances are you won't be able to with it.
Having addressed that, now you can turn to the SP itself. Usually, though not always, with applications that maintain their own sessions, the SP is "bolted" onto the front of the application to protect a login script, and the rest of the interactions are handled by the application or the container session. Occasionally, you might see a hybrid design where both kinds of sessions are maintained. The usual reason is to preserve the ability to do logout through the SP. Before you bother with that, note that logout is itself very complex, and incredibly difficult to pull off. Consider whether that feature alone is worth all the trouble.
Avoidance
If you can get away with minimizing the footprint of the SP session to a short-term entry point, then the easiest way to cluster is not to do it at all. Instead, rely on your load balancer to maintain session affinity (so-called stickiness) for a minute or two. The reason for this is so that the message to the SP from the IdP containing the SAML assertion can be processed and the browser redirected to the same server that received it. That redirect will normally be to the "protected" login script and the rest of the traffic should be handled by the application session, so can be to any server.
If you're using Apache you may be able to avoid even that short term stickiness requirement by using proxy session creation.
Now, let's say you can't do that. There are a few other options discussed below.
Cookie-Based Session Recovery
The SP has a feature to store a limited amount of attribute and session data in the client in an encrypted cookie, and a shared key across a set of servers can be used to recover the session on each node as the client moves between them. This is transparent, but isn't intended for use by a client constantly moving between servers on every request due to the overhead. The feature is discussed in the SessionCache page.
This is generally the recommended approach now.
Shared Process
The "simple", but not advisable, solution is to take advantage of the fact that the SP is divided into two pieces, and all of the session state is maintained in the shibd
process rather than the web server. While the SP installation requires that you install both halves on each machine, you don't technically have to use both halves on each server. If you have a fast enough, and secure enough, network, you could utilize a TCP connection to connect a number of web servers running the SP to a single shibd
"listener" process. This process can run on any of the cluster nodes, or on a separate box devoted to it. To set this up, just follow the documentation for using the TCP Listener plugin.
On Windows, this plugin is already the default, so it's just a matter of configuring it. On Unix, you'll usually have to switch to it from the "Unix" variant. Normally, the listener component binds itself to the local loopback address and blocks traffic from any other source. Just configure the machine running shibd
and set the listener's address to an actual network address, and set the ACL to a list of addresses corresponding to your web servers. Each of them in turn has the SP installed with the same configuration, which allows them to connect to the shared process.
While the protocol was made less chatty over the years, it remains very non-performant this way because every single request has to traverse this hop. For a typical application not under extreme load, this is likely to be viable for the most part. However, you have to understand a few things:
Support for this is more or less unofficial and generally limited to consortium members (meaning nobody’s likely to jump in and help with this if you have problems).
Performance becomes pretty horrible very quickly as load increases. This is a bad solution for any but a lightly used site.
The protocol between the servers is NOT secure. It's a simple XML protocol running over TCP. You MUST rely on a secure network between the servers, ideally a private subnet.
The server running
shibd
is a single point of failure. If the process fails or the server fails, you'll lose all active sessions. However, you can restart any of the web servers without losing any state. They will reload any sessions as needed.Session affinity is still important here. You'll get much better performance if you keep some locality of reference, because the sessions are cached in the web servers as well.
Shared Database
If you really want to try to cluster the SP in a persistent way, there is a plugin provided for this purpose based on using an ODBC-compliant database. You'll see there are lots of caveats recorded there.
Keep in mind that unless you actually cluster the database itself, you still have a single point of failure. The only advantage you'll be getting over the shared process approach is that it should be able to recover sessions from disk if the database fails temporarily. But in practice, by the time the failure is corrected, chances are most of your users tried to restart their sessions anyway.
Clustering databases is certainly possible, but it can be very difficult and even expensive. To cluster this kind of data, you either need very rapid replication, or you'll have to get around that with session affinity.