Session Cleanup for Database Session Storage can cause performance issues
Description
Environment
90 windows servers running the Shibboleth Daemon using Database Session Management. The database is installed on a SQL Server
Activity
Scott Cantor January 29, 2016 at 4:45 PM
Added a check so cleanup thread doesn't get created if interval is 0.
Scott Cantor October 1, 2015 at 9:28 PM
If it was actually crashing, I was hoping to see any log output from shibd prior to the crash, but honestly unless it were set very high, it's not likely to show much. When you said it "crashed" did you literally mean shibd crashed? Now I'm thinking you just meant the app itself backlogged into an unusable state and had to be restarted. The IIS half doesn't actually talk to the database itself, so if shibd were still running but was just "stuck" waiting on locks, that isn't what I meant by crash.
Unless you're feeding it requests with the same NameID value over and over, the issue in 636 isn't your bug. If you are doing that, you'd need thousands of sessions with that value before it would make a substantial impact.
A bunch of new sessions should not generally need to update the same row, unless they did happen to have the same NameID in them, and a few dozen even from one user with the same NameID shouldn't really be noticeable. Of course if it's all just backlogged by constant table locks doing deletes, I'm sure that will be horrible, just trying to understand the scope of the problem.
I honestly never considered the obvious implications of every node running that thread, but it's clear in hindsight.

Laura Stewart October 1, 2015 at 9:01 PM
The ability to disable the cleanup for certain servers would help the issue.
We do have logs for the issue. Is there a specific log/message that you want to see?
Here is an explanation of what we saw. On the SQL Server we were able to see lots of blocking including one session that blocked 1659 other queries of all types. That is when the asp.net errors and isapi failures occurred. What we think happened is a delete query ran that locked the table. Then a user tried to login but had to wait because of the lock. They were impatient and tried again and again and so on and that happened for multiple users. This caused multiple database sessions to be created to update the same database row. When the delete lock was released it tried to process these updates but ended up being deadlocked because they wanted the same row. These deadlocks in turn caused more locking specifically of deletes which caused more insert/select/update/delete locking etc. And you have a cascade of locks that results in an outage. The update deadlocking looks to be related to SSPCPP-636. We are going to test out changing the setting mentioned in our performance lab to see if that prevents the issue.
Scott Cantor October 1, 2015 at 4:59 PM
If you have any log output from a deadlock that's crashing, I'd like to see it, there should never be crashes (and it shouldn't deadlock either, the cleanup thread is separate from any threads that would be taking locks, but if it's timing out the transactions waiting for locks, that might not be getting handled somewhere).
The ability to disable the refresh thread isn't there, it ignores a value of 0 on the cleanupInterval setting, which is wrong.
Scott Cantor October 1, 2015 at 3:33 PM
Safe to say 90 servers is way more than I would have thought could ever work, so that's interesting. But yeah, that makes sense, though I'm wondering what you mean by "in the database layer" since that's not my layer. If you mean through some manual cleanup process, I guess that's possible but most people wouldn't do it.
Either way, having the option to disable the cleanup thread is clearly the right thing so that it can be handled by only one of the servers.
We probably have this bug on the IdP side as well, will investigate since that's the only place we're actively working at the moment.
Re: the index, that's just documentation, feel free to edit as you see fit.
Session cleanup interval is defined in the Shibboleth2 file. For the database session management, each web server hosting the Shibboleth Daemon servers as a listener. So each server is trying to cleanup database sessions. As the number of servers increases this causes contention and prolonged locking. For instance, for 90 servers all running the cleanup every 15 minutes, depending on how long the cleanup takes the texts table could be locked for most of the 15 minutes. This is an issue because it blocks new sessions from being created. It can also cause deadlocks which can cause the system to crash. (dll deadlock failure, required app pool recycles to correct)
Suggestion is to allow session cleanup to be handled at the database level rather than from each web server. Also, an index on the expires column for the texts and strings tables greatly improves performance for these calls and reducing locking between the delete calls and new session creation.