Persistence (Historical Design Notes / Comments)

The persistence layer is responsible for insulating various components that have to preserve data across multiple web requests from the specifics of data storage. This includes data associated with specific clients as well as data that has to be globally accessible when servicing all clients. Not all components requiring persistence necessarily use this layer if they have more specialized requirements that are not easy to abstract behind a common interface.

Draft Proposal

The following technical requirements for the abstract API are suggested based on experience with the Service Provider's overlapping requirements:

  • String-Based API
    • Handle storing string and text data (blobs can be encoded as text), keeping serialization of objects separate.
    • One of the consequences of this is that aliasing has be implemented by hand by managing alternate indexes to information. For example, a secondary key B to an obect keyed by A would be stored as a mapping of the strings (B, A) so that B can be used to find A. If the mapping of B to A is not unique, then the value becomes a list requiring upkeep, and this can cause performance problems if the set of A is unbounded or large. If this is a common case, building in explicit (and thus more efficienct) secondary indexing may be worth considering.
  • Two-Part Keys
    • Supporting "partitions" or "contexts" makes it practical to share one instance of a storage back-end across different client components. Not such a big deal with database tables or in-memory storage, but very useful for options like memcache. Ultimately many back-ends will have to combine the keys, but that can be left to implementations to deal with.
  • Exposing Capabilities
    • Exposing back-end implementation capabilities such as maximum key size enables clients to intelligently query for them and adapt behavior. For example, some components might be able to truncate or hash keys while others might not. This might be something to enhance by adding pluggable strategy objects to shorten keys. Another aspect of variable behavior might be support for versioning, which a client-side storage option wouldn't handle (you can't conditionally set a cookie).
  • Internal Synchronization
    • All operations should be atomic to simplify callers.
  • Versioning
    • Attaching a simple incrementing version to records makes detecting collisions and resolving contention relatively simple without necessarily losing data. Callers can determine whether to ignore or reconcile conflicts. As noted, this may need to be an optionally supported feature.
  • TTLs
    • All records normally would get a TTL value to support cleanup. This wouldn't work for some use cases, so we probably need a permanent option (which again, might be negotiable).

At least in the SP, eventing or pub/sub has not been a requirement to date and I'd like to avoid it if we can, since it greatly limits the possible implementations.

Use Cases

Replay Cache

Most identity protocols assume the use of nonces (usually via message IDs) to prevent replay attacks, though these checks are usually of low importance within the IdP. The more valuable capability is in detecting stale requests to prevent the browser from being trapped in a back-button / login loop. Because of the low security importance, an unreplicated in-memory storage service is usually sufficient. A passively replicated data store would also work well. Client-side storage is not an option, obviously.

Use of the storage API is straightforward; a context is used to isolate the namespace of possible values being checked and the value to check is the key. The value is irrelevant. The key size here can potentially exceed a desirable key size, though not in general, and hashing is sufficient to address that.

Artifact Store

The SAML artifact mechanism requires associating artifact message handles with assertions or messages. For SAML 1 artifacts to function, all servers responding to artifact lookup requests need access to the data store, making in-memory implementations suitable only for single-node systems. Replication would need to be rapid and reliable. For SAML 2 artifacts, it's possible to associate an artifact with a server URL. With additional work to deploy dedicated TLS-protected virtual hosts with unique names, it's possible to avoid a replicated artifact store. Normally every server in a cluster would be load-balanced behind one name and certificate, so this is much more complex to support, probably requiring additional addresses or ports. In either case, client-side storage is not an option.

The two-part key mechanism is irrelevant here because all artifacts are unique by themselves. The message handle is the key, and the serialized message is the value. The key size here can potentially exceed a desirable key size, though not in general, and hashing is sufficient to address that. The value is a potentially non-trivial message on the order of 10k in size.

Terms of Use

No experience with this use case, but I would speculate that this is associating some kind of local user identity with an identifier representing some kind of ToU. I would imagine a ToU could contain parameterized sections or require user input that would need to be preserved, and that would be a simple matter of storing a more complex object produced by a particular ToU module. I could imagine needing a TTL for this data for ToU that have to be renewed periodically, but permanence might also be needed.

Server-side storage here seems awkward without replication, since a user wouldn't understand why he/she was being prompted again. Client-side storage is possible but also quite awkward due to multiple devices. Also seems like a bad thing to eat into our exceedingly limited cookie space. This could be a use case for Web Storage.

Need to investigate existing uApprove code to see what's being stored.

Technology considerations seem similar to the Terms of Use case, only moreso. No way anything more than a global yes/no fits into a cookie, but Web Storage is a possibility if the extra prompting from multiple devices isn't a concern.

Session Store

We need some form of persistence for user sessions to support SSO, and features like logout depend on what we store and how we store it. This is a primary use case for client-side storage, but also a difficult one because of size limitations, particularly if logout is involved. This is a likely candidate for storing some kind of structured data as a blob but unlike the SP, sessions shouldn't need to be arbitrarily extensible.

As a first cut, the data involved is:

  • a unique ID, highly random (16 bytes)
  • representation of the user (ideally a canonical name) (256 bytes)
    • currently this is defined per service and allows us to attach things like the client address so that the resolver can use it
  • expiration based on time of last use (8 bytes)
  • nary authentication state (time, duration, method) (8 + 8 + 2 bytes)
  • nary service login records (entityID, method, NameID, SessionIndex) (256 + 2 + ? + 32 bytes)
    • method mainly serves here to drive attribute filters based on authentication method, can we toss this?
    • do we need time of login to a service?

Lookup of sessions is primarily by the unique ID, except when logout is involved. Then we need lookup by (entityID, NameID, SessionIndex?).

A simple layering on top of the API might be to pickle the entire structure against the session ID as a single-part key, and then create reverse mappings for the entityID + NameID (or a hash) of each service login to the session ID. The reverse mappings ought to expire on the same basis as the primary record, but that might not be efficient to manage, not clear at this point.

With a server-side approach, data needs to be replicated at least by the time stickiness were to wear off, or SSO won't happen (nor logout of course). A client-side approach is the holy grail here, but see below.

Estimated size data is shown above. We could use 2-byte shorts to represent authentication methods, and expand those into URIs only when needed. This saves substantial space in capturing authentication state. An entityID can be up to 1024 bytes specificationally speaking, but in practice are much shorter and well under 256 bytes. The outlier is the NameID, which is nearly unbounded in theory and we can't hash it down because the whole point is to be able to propagate it in a logout request.

Even a simple case study is already well in excess of some browser limits on total cookie size for a domain, even without including overhead for padding, encryption, a MAC, and encoding. Compression would help a little but probably not significantly. Web Storage does not seem like a good fit for this use case either. Session information needs to be accessible to the server without a lot of hoop-jumping, and Web Storage does not allow for this. We would have to generate interstitial pages that use JavaScript to read and post back the session data to the server in the middle of the conversation.

I think a likely direction here is to split off the data associated with service logins because that's only required for logout, and is the entire reason this becomes unmanageable to store in cookies. Thus, the session cache component could incorporate multiple storage service instances injected for the different subsets of data.

Another problem here is that the current server-side design allows us to make data available to the resolver about the user or the client extensibly via Java subject/principal objects. Moving that to the client creates problems with attribute queries, and it's been bad in the past to support functionality that only works with push, it breaks the symmetry and consistency of the resolver's behavior in different flows. This may be another opportunity to push advanced needs to server-side storage.

Possible Implementations

In-Memory

Not much to say, this is obviously straightforward.

Memcache

There's an existing implementation of the V2 session cache, and a version of the SP interface, which leads me to assume this should be possible. What isn't clear to me is the point of it. I know memcache's value as a cache, but this is a storage layer, not a cache. Unless the service were deployed separately from any IdP node, there would be no simple way to take down the server with the memcache daemon. With a single point of failure like that, a database seems like a much better choice. Probably this is another case where non-persistent state and true persistence lead to different back-ends.

JDBC

Clearly possible, and the SP has an ODBC implementation. JDBC should be a straightforward port even without optimizing it.

Cookies

Supporting cookies is principally a size problem. Full portability means limiting total cookie usage to 4k for the whole IdP, and we probably lose 25% to securing the data. Chunking is probably a waste of time unless we want to target browsers without the tiny domain-wide limit. Opera's probably practical to treat exceptionally, but I don't think Safari is.

The storage API would obviously need direct or indirect access to the HttpServletRequest/Response pair, and there could be timing issues if an attempt to update the data were made after generating a response to the client.

The V2 session cache uses a different HMAC key for every session because it's storing the session key on the server. A client-only model would mean using a fixed key or keys. This isn't a major problem except that we also need to encrypt, which is not something the V2 code does. Keeping the encryption key safe means handling key versioning and ideally automating the generation of new keys automatically, perhaps on a schedule.

Applying the notion of a cookie name/value pair to the proposed technical design above, one might represent every individual record as a separate cookie, but this seems impractical because of the overhead of securing them. If we imagine that use of cookie storage would be relatively minimal because of the size limitations, it seems possible to serialize an entire set of mappings into a cookie named by a storage service instance. That is, the cookie name acts like a database connection name and the storage plugin "connects" to the cookie when asked to read the data, and writes back changes. Clearly this involves some overhead, but it maps well to the design and seems to work well if the number of mappings is low or one.

Versioning wouldn't be easy here since different nodes could both update and write back the same cookie, but I suppose one could have some kind of server-side synchronization of updates to the information such that it's in a consistent state before a cookie gets written back. Seems like a lot of work and hard to manage, and I would guess that the use cases for using the cookie for storage could live without versioning.

Web Storage

Web storage has much better capacity then cookies, but when you dig into it, it is a very poor solution for storing data generated and manipulated by the server. It's totally targeted at client-side application logic. The only way the data gets to the server is via a JavaScript-triggered post operation to the server, which is an awkward thing to do. The overall implementation looks very awkward because even writing back data involves being able to inject JavaScript into a page at an appropriate time.

Another deficiency is that there's no support for expiration of data other than brute force, or unless data is scoped to a browser window (I often open and close tabs and windows, which would break this).

On the other hand, for truly persistent data, in which the user interface is deeply involved (think ToU and consent), this seems like a less than crazy idea. These are also areas of particularly likely dependence on JavaScript as a matter of course. The IdP proper seems like something we want to avoid having such a dependency.