Troubleshooting
Basic Troubleshooting
The system provides ample logging to diagnose errors, and your first responsibility is to review that information in any situation. You should also be familiar with all the logs involved, and what their function is, to understand where to look.
The core logs consist of:
The IdP process log, usually in idp.home/logs/idp-process.log (warnings and errors will be mirrored to idp.home/logs/idp-warn.log), is the primary troubleshooting resource for most problems. It should be the first place you look.
A servlet container log from Tomcat, Jetty, etc. This is usually NOT all that useful unless the IdP is actually refusing to initialize and start up at all, but may be important in those cases. Note that if the Status page is viewable, that likely means the IdP started properly.
A web server log in the case that you are proxying traffic from Apache or the like. This is also probably not involved unless the problem is very low-level and involves connectivity to the back-end system.
Web access logs, which track requests to the web server. These will be wherever traffic is handled (servlet container, Apache, etc.) and help with understanding what clients are asking for. This can help diagnose issues with the advertised locations for IdP functionality.
Back-end logs of authentication and attribute sources may also be necessary in some cases, but most of the time the IdP should provide enough in its own log to indicate what the problems are.
Slow startup time
There are two known reasons for experiencing unusually slow startup: using Tomcat in certain ways and a blocking PRNG. Taken in order:
If you're using Tomcat, this issue is discussed on the (old) prep page along with some workarounds, primarily forcing it to expand the warfile, counter to the usual advice we give with Jetty.
With all containers, it's not uncommon for startup to be delayed sometimes several minutes because Java defaults to using /dev/random as a source of randomness, which is a blocking device. The most common workaround for this is to switch to /dev/urandom instead, which can be done by passing a system property to Java during startup (e.g. on Jetty via start.ini):
-Djava.security.egd=file:/dev/urandom
Traditionally this has been viewed as less safe. There are alternate views of this. The Shibboleth Project does not take an official position on this question.
Common Error Screens
The login service was unable to identify a compatible way to respond to the requested application...
This is a human translation of the EndpointResolutionFailed event in the IdP that triggers when the check between the AssertionConsumerServiceURL
in an SP's request is not in the SP's metadata, so the IdP fails the request in accordance with the standard's requirement to validate the response location.
Most SAML SPs, and certainly most Shibboleth SPs, will include a full AssertionConsumerServiceURL
attribute in their <AuthnRequest>
message to the IdP. The value of the URL in a Shibboleth SP is determined by the computed request URL that led to the issuance of the request and is primarily a function of web server configuration (on Apache) or the SP's <ISAPI> site mapping configuration (on IIS).
Whatever its set to, that URL generally has to be one of the <AssertionConsumerService
> endpoints in the SP's metadata used by the IdP or this error message is returned by default in response to the error and the IdP log records the problem.
As an IdP operator you either have bad metadata, or you have a broken SP, and unless you manage the metadata yourself (likely in the case of a commercial SP), it's not your problem to solve. Only the SP operator knows whether the URL is valid, so you may have to update the metadata or they will have to stop generating requests including the bad URL. The cause of a bad URL is generally a failure to properly configure a web server running the SP to account for local virtualization, load balancing, etc. It could also simply be a failure to update metadata to reflect a change that the SP deployer has made.
Common Log Messages
org.springframework.webflow.conversation.impl.LockTimeoutException: Unable to acquire conversation lock after 30 seconds
If you see this in the log, usually accompanied by other problems and reports of users unable to login successfully, you are experiencing a problem somewhere between the IdP and some back-end system you have configured for authentication, attribute resolution, or some other kind of remotely accessed functionality. This layer isn't properly timing out fast enough to prevent users from replaying requests, smashing form buttons, etc., and the result is clients stacking up requests that are hung up until the limit on simultaneous requests from one client is reached. Eventually new requests timeout trying to get locks to add new requests to the queue and this starts getting logged.
You can't fix this other than by determining which aspect of your deployment is hung up and adjusting it to either behave properly or at least fail quickly enough to produce errors and get transactions completed. It may not work, but it shouldn't hang.
Invalid XXXXService configuration
The XXXX above might be "NameIdentifierGeneration" or "RelyingPartyResolver" or "AttributeResolver", or any of the other service names. When you see this message while debugging something that's not working, you are missing the real problem and focusing on the aftermath.
The log is not just the last message you see; you have to look at all of it, particularly during startup or reloading a new configuration for a particular part of the system. The real error is that you've broken a part of the configuration in a particularly significant way, usually with a syntax error of some sort, and you need to look at the error that describes that problem to make any progress.
Sometimes that will be a fairly complex Spring message because of a problem creating Java objects, but occasionally it will be fairly obvious what the error is if the problem mentions an XML syntax error or a property name that turns out to be misspelled. Regardless, that's the real problem and you should ignore whatever comes after it until you resolve the original issue and get the configuration of the affected service at least loading again.
If you're really confused, try changing the relevant "failfast" property in the services.properties file to force the IdP to fail at startup. At least then you'll know where the real issue is.
Error: "unable to find resource 'status.vm'"
The first time the status page is accessed after starting the IdP, the following log message may be seen:
logs/idp-warn.log
ERROR [org.apache.velocity:96] - ResourceManager : unable to find resource 'status.vm' in any resource loader.
Although this is presented by the Velocity library as an ERROR
level message, this should be ignored unless you have customised the status page yourself. In fact, the default status page is implemented as a JSP and not as a Velocity template, and the message merely indicates that the fallback from Velocity to JSP is occurring for the status view the first time it is invoked. This message is filtered in the logging configuration in all recent IdP versions.
Error: "unable to find resource 'login.vm'"
If the default login UI is changed from a Velocity template to a JSP, the following error will be seen at least once:
logs/idp-warn.log
ERROR [org.apache.velocity:96] - ResourceManager : unable to find resource 'login.vm' in any resource loader.
Although this is presented by the Velocity library as an ERROR
level message, this can be ignored. The message merely indicates that the fallback from Velocity to JSP is occurring. This message is filtered in the logging configuration in all recent IdP versions.
Miscellaneous issues
IdP stops sending authentication statements after 2 weeks; restarting the IdP temporarily resolves the issue
IdPs that consume metadata using the FileBackedHTTPMetadataProvider
regularly download metadata from an external metadata source. The IdP has no control over the size of that metadata file, and it may have sufficiently increased in size from one download to the next that the IdP cannot process the later file. Since the IdP stores the previous file, it continues to use that until the metadata it contains expires (perhaps a couple of weeks after the initial out-of-memory error) at which point the IdP stops recognizing SPs. Re-starting the IdP temporarily resolves the issue because it allows the IdP to download new metadata until the next time it runs out of memory.
Symptoms include the idp-warn.log showing entries like this:
logs/idp-warn.log
Please see the Installation pages for your particular deployment to determine how to set an appropriate Java heap size, and what that size should be, or (much more sensiibly), just switch to the DynamicHTTPMetadataProvider via an MDQ service hosted by your federation.
Unhelpful logging information from exception traces.
The default logging provided in logback.xml is for a truncated information from exceptions. Particularly when debugging a development system it may be helpful to turn up the amount of information in the log. This page has more details but replacing this line in logback.xml:
with
Can often provide a useful hint of where the fault has come from. For instance:
is more helpful when debugging than