IdP Key and Certificate Management

Background

The topic of key and certificate management for IdPs is probably the one that comes up the most on various lists but has the least good answers and the most difficulty in providing any kind of concise response, particularly when the people asking don’t have a full grounding in SAML or public key cryptography. It’s also prone to a lot of “it can’t be this bad” incredulity by people asking about it because they quite rightly can’t fathom how unbelievably dumb the world we find ourselves in has turned out. (If you had told me, in 2001, that after doing all of the work to design a viable trust management model for SAML that’s relatively simple to implement, that virtually the entire world outside higher education would react by shrugging and ignoring the problem, I probably would have either not believed it or given up right there.)

In a nutshell, outside of the small number of metadata-capable implementations out there (like Shibboleth), most of them have absolutely no viable means of changing or revoking the acceptance of a key at all, and a few others try and hack in an approach based on certificates expiring, which does not solve the problem fully and exacerbates the whole mess by requiring constant flag days issuing new certificates (or even worse, new keys).

So that’s your baseline: the world didn’t standardize on a way to do all this, and in practice there’s no effective way to change or revoke one’s key outside of controlled ecosystems and/or specific implementations. That’s where you start the conversation, so the answer to “how do I do this?” is “badly and painfully” if you have a decent number of SPs to deal with that aren’t all running one of a few different software products.

What I hope to cover in this HowTo are a few different common problems, outlining how impactful they are, and an overall strategy I evolved into over time that takes advantage of the IdP’s advanced features that position my deployment to address future challenges. I suspect at least something in here will be useful to people.

The earlier sections are background/education about the problem so if you you’re just looking for Shibboleth configuration help, it’s later on starting with the “Approach Overview” section.

Key Types

The first thing you have to understand is the different types of keys involved in operating the IdP. A more deep dive into this can be found in the SecurityAndNetworking topics in the IdP documentation. This is a briefer overview.

There are primarily three different types of keys that the running IdP needs to access:

Cross-Node Secret Key
Decryption Key (this is referred to in specs and config as encryption, but it’s really for decryption by the IdP, and used by SPs to encrypt to it)
Signing Key

The secret key is mentioned just for completeness: nothing but the IdP should ever have access to it and is only for internal use and can be changed at any time because the provided tools maintain a history of keys allowing legacy data to be decrypted, etc. Easy to manage (as much as any key is).

The decryption key, while generally an RSA keypair is vastly less critical than the signing key because the amount of decryption the IdP does is very minimal. Almost no SPs except for Shibboleth actually even support encrypting the identifiers in Logout requests, and much of the time those identifiers aren’t even sensitive, and the “sensitive” ones are a privacy concern, not a matter of security more generally.

The signing key is of course the whole ballgame. There are basically two types of SSO protocols, those that involve direct communication from SP to IdP and those that strictly push information through the browser. The former have the advantage of some additional security mitigations and can even avoid the need for signing altogether if designed sensibly, but are harder to deploy and manage over time. The latter, which includes the primary way SAML gets used, have manageability advantages, but are vulnerable to more catastrophic breaks because possession of the signing key is literally all that’s required to forge logins to any trusting services.

It is absolutely critical to limit the people who could ever get access to the key, which is why traditional systems management approaches in which lots of constantly shuffling systems administration staff could have access to it are an absolutely terrible idea. HSMs help with this problem, but are expensive and can be difficult to scale for online use.

Certificate and Key Rotation

There are no known attacks against the RSA (or ECDSA) algorithm that justify constantly rotating the signing or decryption keys. This is good, because of how painful it is to do. Unfortunately the use of public key cryptography on the web has convinced people that it’s just necessary to constantly rotate them. That is not true, but is also easy on the web because these keys are used in conjunction with certificates signed by roots of trust that are just baked into browsers and quite frankly just don’t mean much. It’s only a trust infrastructure for the weakest definition of the term.

With no independent authority to define the roots of trust, and no ability to rely on existing roots to sign SAML keys (EV certificates were kind of an exception and those have been killed off anyway), that approach just doesn’t work. The model adopted by Shibboleth and later defined as a SAML standard was based on certificates in XML metadata files that does not require or even allow any evaluation of the certificates themselves. The public keys are what matter.

And in this model, it is actually pretty easy to rotate these keys by publishing metadata and adjusting the IdP configuration in a particular sequence. But that only works when the SPs support metadata fully, and that did not happen more broadly in the world, and is virtually unheard of in the Software as a Service world. That makes rotating keys a mess because you have a mix of compliant and non-compliant SPs and often don’t know for sure which ones are which. In practice, actually changing a key is a long, manual process that risks breaking many connected SPs and is only done when you absolutely have to, typically because of a compromise or the possibility of one.

This is made more complex because of the certificate “baggage”. It’s not universally safe to reissue a certificate around the same public key and expect things to just work. Many, if not most, non-compliant SPs will not accept a different certificate even if the key is the same. Some will. And that is also very hard to know absent testing. So the worst thing you can do is deploy short-lived certificates that constantly need to be re-issued. If you’re going to go through the pain, it’s probably worth just changing the key.

It is true, of course, that the longer a key is around, the greater the chance that it could have been stolen. Unfortunately, that doesn’t magically make it simple to rotate them, and the best you can do is limit access so aggressively that the actual risk of theft is not all that greater over time.

All that said, there is a major, major difference between rotating the signing and decryption keys in terms of pain and risk, and this is why it’s hugely important to never reuse the same key for both. That is why the IdP software installs with separate default keys. If you fell into this trap for some reason, you should move aggressively to split them, by rotating to a new decryption key, since that is much less painful and will put things on a sounder footing.

Decryption Key Rotation

As I noted above, there just isn’t a lot of usage of this key, and it’s likely that the majority of usage is by metadata-capable SPs. That doesn’t necessarily help you if you’re not operating in the context of a federation that supports metadata, but I really don’t know of much usage of the decryption key outside of that context. I can’t think of really any “manual” or “self-managed” SPs that even have support for this or will ask for your decryption key. Some of them might even be so broken as to just assume it’s the same as the signing key.

Ultimately this is also only about logout in the vast majority of deployments, and folks, logout doesn’t work at any scale, and is not a good solution to any problem people think it solves. Breaking logout is not even a blip on my operational radar.

So my advice is to just not worry about it and follow the standard metadata-based approach for this. You add a new decryption key to the IdP configuration, you add it to the metadata you publish or have published on your behalf, you wait a day, remove the old decryption key, and fix everything else you find out about later.

In the event you disagree, or have other decryption use cases, then an adaptation of the approaches I suggest for dealing with the signing key could certainly be adapted to this problem.

Signing Key Rotation

This is of course “the problem”. It’s helpful to look at some different cases to motivate short term answers and then the bigger picture, which is to work toward a solution that can help position a deployment for long term success.

One case that comes up occasionally is the need to accomodate a broken SP. The most common reason for this tends to be certificate content, but it could also be something like a key that’s too big or too small. With certificates, the biggest problem over the last 10 years or so has been digest algorithms. Older certificates were often issued using SHA-1 digests, or even older, MD-5. Both are considered broken for certificate use, but of course they are not in fact broken for SAML because it’s not the certificate that matters, it’s the key inside it. But having said that, libraries over time tend to drop support for algorithms considered broken and so it may not be a real choice by the SP to insist on this but a consequence of code they don’t control. It’s still a bug in SAML terms, but it’s not necessarily an avoidable one.

Assuming that only a few SPs exhibit a problem, addressing this case is not per se a scenario requiring complete rotation of the key or certificate, but can be addressed by overriding the signing keypair used for some services. Unfortunately, Shibboleth was not designed to make this particularly easy because our target was to drive many SPs at once with metadata, not to implement pairwise configuration for specific SPs. Some settings are pretty easy to override quickly, but overriding the key is somewhat more complex to handle.

The broader case, of course, is a desire to fully rotate the key or certificate for every SP. That has no simple “just do this” answer for the reasons already stated, but there is a general approach to configuring the IdP that makes doing that as reasonable as possible while also making per-SP overrides more straightforward. So the rest of this article covers how to do that, making the simpler problem a subset of the larger one.

Lastly, a note about per-SP keys. Some newer implementations of SAML take the view that the way to solve this problem is to always create a dedicated keypair for every SP so that any key rotation is always limited to one SP at a time. That certainly has some appeal, but the plain fact is that this approach simply cannot allow for multi-lateral, metadata-based federation with SPs at scale. You can’t federate automatically with SPs if you have to manually set up a dedicated key and configuration for every SP, and that approach would simply undermine what Shibboleth does that most other SAML IdPs can’t, and will never, support. But, sure, if that’s all you care about, then it’s certainly a road you could go down, and I’ll touch on what that looks like because it’s just a generalization of the techniques below. But it isn’t our design center and it will never be as clean as people would like if that’s what they want to do.

Approach Overview

The major precondition to this approach is knowledge. You have to fix the problem of not knowing what the behavior of your SPs actually is in order to put them into appropriate buckets/categories, principally two: compliant and non-compliant. There are some additional issues that complicate things but overall it’s really just “SPs that properly implemented SAML and support metadata and automated key rotation” and “the rest”.

Once that’s done, the goal of my approach is to curate SP metadata such that any SPs in the “the rest” bucket are tagged with an EntityAttribute extension that identifies which key they are currently using by pointing to the IdP SecurityConfiguration bean that contains that key. The name of that bean is the value of the tag, and the IdP’s MetadataDrivenConfiguration feature is used to tell the IdP which configuration bean, and thus which key, to use for that SP. What this does is “lock” those SPs to use a specific key so that the default key (which is broadly published via federation metadata) can be changed without causing the new key to immediately be used for SPs that are broken. In this manner, you can automatically rotate the key for any SPs that can handle that while deferring the rest for the long, manual, painful trek of getting them all updated.

For your own purposes, it may be useful to further subdivide “the rest” of the SPs that are broken into these buckets:

Manual
- These are the bane of the SAML world, and no amount of scorn can possibly be enough. These are SPs who think it’s reasonable to manually configure your key(s) by emailing them and give you no direct control over the configuration of your own IdP settings. There just aren’t words to express how ridiculous this is. The “beauty” of this is that there’s almost nothing they can do to implement a practical and safe mechanism for changing the key. If it’s easy, they’re vulnerable to trivial attacks by bad actors getting them to add a key they control. If it’s hard (because they’re taking it seriously enough to demand some kind of evidence from you), then you can’t get it changed or revoked in the event of a security incident in a timely manner. If you are such an SP, please take a long look in the mirror. Stop.
Team- or Customer-Managed
- The rest of the broken SPs are becoming the dominant set in the cloud world; they rely on the customer to access administrative functionality within the application to configure SSO settings, including the key. This is not good, so don’t pat yourself on the back if this is you. It’s simply less ridiculous than the manual approach. With hundreds of these systems, there is no practical way to respond to a security breach without long delays, and you have to have the needed access to these systems to fix them, which may or may not be palatable. Some vendors even have the audacity to charge extra for these admin licenses that they themselves make necessary through their broken SAML code. That’s at least audacious, if not outright theft.

These bad actors are obviously the ones that give rise to the idea that every one of them should have its own key. It’s certainly possible to go down that road, as I noted earlier, but the hidden trap with this idea is the threat model. Why would you need to change the key(s)? Usually because of exposure. Guess what happens if one of your keys is exposed? Chances are they all were, or could have been, because they’re likely all in one place. You now have to change them all, so your revolutionary strategy just turned back into “touch every system anyway”, not to mention the additional attack surface of juggling hundreds of keys (or more). I don’t see the win, but YMMV.

Configuration Examples

This is what you’re actually here for, how to actually configure additional keys and then tag SPs to use them. In the examples below, an “old” and “new” signing keypair are used to illustrate how this looks during transitions. Ultimately, you may end up collapsing things back down to a single keypair and get rid of the older beans once they’re not in use.

All of my examples use a naming convention of “osu.” as a prefix. That’s just a convenience for me to keep my locally-defined beans isolated. Ultimately just don’t use “shibboleth.” as a prefix and collisions are generally avoided, but the idea is a good pattern to use.

Credentials

The first step is to actually lay out multiple signing credentials, which is done in conf/credentials.xml. Actually adding keys here doesn’t impact anything, because out of the box only the “default” credential is recognized.

My approach is to use sensible labels on these beans so that if I’m in a key transition, it’s very clear which key is which (i.e., “one” and “two” don’t really do it for me). I used labels involving the year I generated the keypair, but anything descriptive is good enough. Per usual, I used properties to actually define the file paths and passwords and such, this is just the wiring to get them defined.

credentials.xml

  <util:list id="shibboleth.SigningCredentials">
      <ref bean="osu.SigningCredential.2019" />
      <ref bean="osu.SigningCredential.2004" />
  </util:list>

<alias alias="shibboleth.DefaultSigningCredential" name="osu.SigningCredential.2019" />

<bean id="osu.SigningCredential.2004"
    class="net.shibboleth.idp.profile.spring.factory.BasicX509CredentialFactoryBean"
    p:privateKeyResource="%{idp.signing.key.2004}"
    p:privateKeyPassword="%{idp.signing.password.2004}"
    p:certificateResource="%{idp.signing.cert.2004}"
    p:entityId-ref="entityID" />

<bean id="osu.SigningCredential.2019"
    class="net.shibboleth.idp.profile.spring.factory.BasicX509CredentialFactoryBean"
    p:privateKeyResource="%{idp.signing.key.2019}"
    p:privateKeyPassword="%{idp.signing.password.2019}"
    p:certificateResource="%{idp.signing.cert.2019}"
    p:entityId-ref="entityID" />

In my example, the new key is identified as the default via the alias. This is notable because when you start out transitioning, it’s the old key you would have defaulted. The significance of the default key is that it’s the one used in the absence of any other information and is the one you publish in federation metadata, so after a transition in the metadata from old to new, you’re then changing the default to be the new key. At all times, that’s the key used for “compliant” SPs.

Defining Security Configurations

The thing that complicates Shibboleth in this area is that there’s no direct setting for “use this key for an SP” via the relying-party.xml file. There are a ton of complex settings for signing and encryption behavior that are wrapped up in a larger structure called a SecurityConfiguration. That’s the thing you can actually set on a per-SP basis. Deep inside that bean is the actual reference to the signing keypair to use.

So the next phase is to illustrate how to lay out multiple security configurations with different keys. Again, you want to label these beans clearly, because these names are the ones that will show up in the metadata or in filters to add them on the fly.

conf/relying-party.xml

    <bean id="osu.SecurityConfig.2019" parent="shibboleth.DefaultSecurityConfiguration">
        <property name="signatureSigningConfiguration">
            <bean parent="shibboleth.SigningConfiguration.SHA256" p:signingCredentials-ref="osu.SigningCredential.2019" />
        </property>
    </bean>

    <bean id="osu.SecurityConfig.2004" parent="shibboleth.DefaultSecurityConfiguration">
        <property name="signatureSigningConfiguration">
            <bean parent="shibboleth.SigningConfiguration.SHA256" p:signingCredentials-ref="osu.SigningCredential.2004" />
        </property>
    </bean>

It doesn’t take a lot of XML, but it’s a bit non-obvious without knowing the APIs, and you have to make sure to inherit from the appropriate SigningConfiguration beans, which in most cases means basing it on the one that uses SHA-256.

Installing Security Configurations

In order to make use of the beans defined above, you have to install some additional wiring, but a lot of what’s going on is very implicit and not spelled out. This is because in the absence of a setting specific to a profile and relying party, the system automatically falls back to the bean called shibboleth.DefaultSecurityConfiguration, which in turn uses the signing key identified by the bean name shibboleth.DefaultSigningCredential. Thus, you don’t have to tag SPs that should use the default key and that’s not the purpose of defining two beans here.

Rather, it’s the broken SPs that will be pointed to one of those two beans as part of the transition from one key to the other (or possibly just permanently pointing at an exceptional key used for just a single SP).

To make this work, you need to define a lookup strategy that will allow metadata tags to supply the SecurityConfiguration to use, and then install that strategy function into the various profile beans you use in your defaults. By doing this, you do not need to define any RelyingParty overrides for this at all. You may need them for other reasons, but they’re not needed here.

conf/relying-party.xml

    <bean id="osu.SecurityConfigurationLookupStrategy" parent="shibboleth.MDDrivenBeanProperty" p:propertyName="securityConfiguration"
        p:propertyType="#{T(net.shibboleth.idp.profile.config.SecurityConfiguration)}" />

    <!-- OSU default SSO profile baselines -->
    <bean id="osu.Shibboleth.SSO" parent="Shibboleth.SSO"
        p:securityConfigurationLookupStrategy-ref="osu.SecurityConfigurationLookupStrategy" />
    <bean id="osu.SAML2.SSO" parent="SAML2.SSO"
        p:securityConfigurationLookupStrategy-ref="osu.SecurityConfigurationLookupStrategy" />
    <bean id="osu.SAML2.ECP" parent="SAML2.ECP"
        p:securityConfigurationLookupStrategy-ref="osu.SecurityConfigurationLookupStrategy" />
    <bean id="osu.SAML2.Logout" parent="SAML2.Logout"
        p:securityConfigurationLookupStrategy-ref="osu.SecurityConfigurationLookupStrategy" />

    <bean id="shibboleth.DefaultRelyingParty" parent="RelyingParty">
        <property name="profileConfigurations">
            <list>
                <ref bean="osu.Shibboleth.SSO" />
                <ref bean="osu.SAML2.SSO" />
                <ref bean="osu.SAML2.ECP" />
                <ref bean="osu.SAML2.Logout" />
            </list>
        </property>
    </bean>

In my example, I’m supporting 4 different profiles, and installing the lookup strategy in all of them, then using those as my “baseline” via the default relying party bean. I do have other settings I use on my profile defaults but I’m omitting them as outside the scope here, the point is just that this is how to get the lookup strategy applied to every SP across the board.

What this does is tell the system to look in the metadata for a particular tag name that will contain the bean ID of the SecurityConfiguration to use, and if none is found, it will fall back to default system behavior.

Note that, yes, you can do this by adopting the “full” metadata-driven configuration approach and using the “.MDDriven”-suffixed beans mentioned in the MetadataDrivenConfiguration topic. That lookup strategy bean is actually just the same bean that’s inside the system wiring for that feature. I show it this way because it’s much faster to only wire up support for tags you intend to use then to whole hog force every setting to route into the metadata, and well, I know how, so that’s what I did.

Tagging Metadata

The final example just illustrates how to tag metadata to actually make use of all the work above. This is done for the broken SPs to ensure they’re using either the old or new key, depending on where they are in the transition process. Since it’s based on metadata, it’s fully reloadable and can be deployed essentially instantly to coordinate with a change on the SP side.

There are two general cases with metadata: you control it or you don’t. Since you should NEVER rely on remotely supplied metadata from a vendor (there are a ton of reasons, I’m not going to go into them here), the only two cases that should matter are federation members and non-members. You should always control the metadata for non-members, and so for those cases, however you curate, generate, etc. that metadata should be extended to support the EntityAttributes extension. The tag needed here looks like this:

<EntityDescriptor xmlns="urn:oasis:names:tc:SAML:2.0:metadata"
  xmlns:mdattr="urn:oasis:names:tc:SAML:metadata:attribute"
  xmlns:saml="urn:oasis:names:tc:SAML:2.0:assertion"
  entityID="https://sp.example.org">
  
  <Extensions>
    <mdattr:EntityAttributes>
      <saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:uri"
          Name="http://shibboleth.net/ns/profiles/securityConfiguration">
        <saml:AttributeValue>osu.SecurityConfig.2019</saml:AttributeValue>
      </saml:Attribute>
    </mdattr:EntityAttributes>
  </Extensions>
...
</EntityDescriptor>

The other case is federation metadata, preventing direct control. For these cases, you must use the EntityAttributesFilter to attacjh the tag at runtime. This is where brute force enumeration of the SPs comes into play. This example would typically live inside a <MetadataProvider> element, but it is possible to maintain them externally in a separate file by means of the ByReferenceFilter feature.

          <!-- Locked to newer signing key. -->
          <MetadataFilter xsi:type="EntityAttributes">
            <saml:Attribute NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:uri" Name="http://shibboleth.net/ns/profiles/securityConfiguration">
              <saml:AttributeValue>osu.SecurityConfig.2019</saml:AttributeValue>
            </saml:Attribute>

            <Entity>https://sp.example.org</Entity>
            <Entity>https://another.example.org/sp</Entity>
          </MetadataFilter>

Typically you would have two of these filters for an extended period, attaching either the old or the new tag to the SPs as needed.

Future Key Rotations

Given a configuration as outlined, the process for future key changes is thus:

Define a new credential bean for the new key.
Define a new SecurityConfiguration bean using the new key.
Publish the new key via federation metadata.
Wait a day.
Change the alias for the “default” signing credential to the new key.
Remove the old key from the federation metadata.

In parallel, start addressing all the other SPs one at a time, flipping the tag value from the old config bean to the new config bean as they are addressed. Once all are done, remove the old credential and config beans and you’re done.

Sounds simple but of course is a massive amount of work.

You will miss stuff. When you’re wrong about what an SP is doing, you’ll find out and simply have to adjust things to reflect that going forward but usually there will be reasons why you were wrong, such as “they changed software” or “they can’t really handle multiple keys in the federation metadata and so are actually broken”, and you capture that information for the next time.

The important thing is to try and accurately gauge this up front when adding new SPs so that you minimize the mistakes later.