Solutioning
Parking lot for various thoughts on design/implementation so I don’t lose them. “Shower thoughts” would be a good name for this I guess.
Tracking Headers to Clear
A nasty part of the original SP is the notion of having to know all the possible header variables that might be populated, independently of the specific attributes that are present in any given session, to prevent smuggling.
Currently shibd computes the set of headers from the attribute mapping layer, and adds in various “special” ones, and serves up the list in a special remoting call that the modules use to get the set. Because the modules don’t really know if it’s changed, and because there’s no complex signaling/interrupt mechanism to get it to realize that, I eventually made the attribute mappings non-reloadable by default to prevent them from changing without a conscious deployer step, and even then you’d get out of sync if you did things improperly. I never liked any of that and it won’t work for the new design anyway.
My thought is that this should be moved out of shibd and into the agents, and divorce the decoding shibd does to create IdPAttributes out of protocol claims and other data from the names of the headers used to decorate requests. The former is handled by shibd but the latter should be in the agents. I envision a simple mapping file of IdPAttribute ID to local variable names, essentially “renaming” them (or not) and enumerating all of the possible “controlled” variables in the agent so that it knows locally what to clear before the decoration step where they’re exported into the requests.
It would remain non-reloadable since the whole agent is I think not going to support any kind of dynamic configuration anyway, as Apache and IIS don’t generally do that as it is. Parsing a property file syntax isn’t fun but shouldn’t be that much code and is a simple enough format for people to understand. Most deployments of the SP are fairly static in this regard anyway.
Agents, Entities, Applications, Oh My
The configuration model as a whole is complicated by the desire to support joint use of the shibd plugin by multiple, independent web server agents protecting unrelated systems. That is, 10 agents protecting service.example.org is simple in comparison, but combining those 10 with another 10 protecting unrelated.example.org adds an additional “dimension” to the configuration and to the security model to think about. it is a “multi-personality” / virtualized way of looking at shibd that is not part of the SP to date, nor part of the IdP, so we have no precedence for it as a feature.
We have to account for:
Applications and application overrides that are unrelated to each other and currently not named uniquely.
The fact that each application may act as a static entityID or as a dynamically computed entityID that is derived from the virtual host or in the general case based on specific paths to content.
Settings per-profile and per-relying party (the eventual IdP in this case), including potentially the entityID itself.
The fact that in the future we want to support protocols like OpenID that have similar constructs to an entityID but that may not be the same value.
The desire to hide as much protocol machinery as possible from the agents in the first place.
Initially I considered rooting everything in an Application, but that falls apart due to naming (every SP deployment currently identifies its default Application as “default”). Then I moved toward rooting this in the entityID, with the Applications contained in that as a sort of multi-map, but that falls apart because of how the SP computes the entityID to use in a content-specific way, not to mention the long term need for other protocols.
It also doesn’t address the significant problem of limiting which objects a given set of related agents can actually make use of; that is, how does the system prevent one agent from associating itself with an entityID and Application(s) of its choosing, introducing risks. An example: if the plugin supports resolving additional data about the user that is only meant for use or to be seen by specific agents, other agents could impersonate them to get access to that data. There are likely more esoteric risks but that’s an obvious one.
My mental strawman is thinking about this from the perspective of how to identify and secure the agent access, which is an unrelated but eventually necessary problem to tackle for shibd to be shareable in the way people want. Not surprisingly, this looks a lot like dealing with secret-based OpenID clients in the manner I’ve been advocating for scaling out enterprise deployments via the unregistered client approach; that is, the agents need to be backed by a service account process that issues/manages the account secrets needed for the agents to authenticate themselves to the hub.
Putting the issues with secret-based authentication to the side (will cover that separately), if one imagines this is “enough” to verify an agent and its requests, then it follows we can associate the correct set of Applications to each agent identity. Thus we have a model:
Agent Identity, some form of local agent identifier unique to a deployment (this would be shared across a cluster, but could I suppose be a set of identities):
Application(s) identified by a non-unique ID specific to each agent identifier, likely cribbed directly from existing SP configs in many cases:
A RelyingPartyConfigurationResolver associated with an Application to provide default and per-IdP profile settings.
There’s a realization underlying this: agents don’t want to know about SAML or entityIDs and neither do their deployers. If we’re serious about offloading complexity, then it should be the hub’s job to figure out what to name things in specific protocols to comport with the metadata the hub operator is willing to provide to IdPs.
The point is that it turns out we probably need to rethink the notion of entityIDs and client_ids in this implementation as solely a hub-side consideration. The agent connects to its Application definitions and it’s the job of the configuration to allow the hub to figure out how to make SAML or OpenID work for those applications in the ways preferred by that operator. Thus, not only is there probably not a “rule” applied to decide if an agent can “be” a particular entityID, but there probably isn’t a need to even allow agents to influence the value at all.
This also limits the risk of impersonation in light of the Shibboleth skipEndpointValidationWhenSigned option since if any agent connected to a hub were able to act as an entityID given that consideration by an IdP, all agents connected to that hub could leverage it given that the signing happens at the hub. That alone makes it impossible to allow agents to influence the value.
Compatibility
I don’t think we can or should try to support the V2/3 configuration files in any direct sense. I’s very messy to deal with when most of it will be ignored, and there’s enough of a different model here that forcing an explicit conversion of the main file is probably warranted.
The most significant piece that crosses this boundary is the RequestMap and the intent would be to drop the namespace but not change much else, so copying it right out should usually work ok. So far the changes there include:
Namespace (it likely needs to go away due to how we may end up parsing it)
Some older settings will be removed so would be ignored if left
s/entityID/authority to allow for genericity in terms
The other portions of the configuration are largely not relevant to the new agents, except for a small subset of basic information that should be much more concisely expressed using an updated syntax, so making that simpler seems like the more important goal than direct compatibility. This mainly includes session settings, the applications, and their handler locations (at least the base URL, if not the individual paths, not sure about that yet).
I am leaning toward using INI-style properties for everything possible. Using Apache commands has appeal of course, but it’s not the simplest code to support dozens of commands, and I think it will lead to more complexity than simpler alternatives might.
Configuration
Can we make ApplicationDefaults/ApplicationOverride go away?
I think so.
Mentally assessing the configuration for the agents in depth, I reached an epiphany about the entire edifice of Application/ApplicationOverride, Sessions, and Handler elements and think it may be possible to dispense with all of it in favor of moving almost every important setting into the RequestMap space. That is, I think we may be able to and probably want to move all notion of an Application to the hub (TBD whether by that name though it is in the Java code now). Right now the Applications in the hub as previously discussed are a container for the profile configuration settings and relying party overrides and that’s fine as a means of handling unusually complex cases, and agents can continue to “connect” to them by supplying a matching identifier in their messages.
But from the agent point of view, if all the protocol-specific content is ignored, what is an Application? It’s primarily a container for the Sessions element and the Handlers, and also a session boundary (in edge cases where path-based overrides are used).
The Sessions element is a package of settings that, unless I’m overlooking an intractable problem, could be pulled from the RequestMap directly to control actively applied rules such as lifetime/timeout, cookie settings, redirect limiting, etc. It may be a bit odd in some edge cases to see those settings at the path level, but I don’t think it’s inherently impossible given that at any point in processing, the SP is operating in the context of some web request with a mappable URL. In the end, it’s just collapsing out an indirection today whereby the request maps to an applicationId that is used to fetch an Application and on to its Sessions properties. This just drops out the indirection, accounting for the vastly more common case of a single set applied globally anyway.
Notably, this includes the almightly handlerURL, the thing that trips everyone up when doing path-based overrides, and moves it right where it probably should have been all along.
The problem of imposing a session boundary could be addressed by including a new setting that influences the session cache/lookup process to isolate sets of sessions from each other via some identifier.
The handlers are more complex. To fully replace all that without replicating the existing hierarchical configuration will need a different way of looking at how handlers get defined and how to tell when they’re active/available. I think as a strawman, it’s plausible to imagine a handlers.ini file as a default, with each path being a section key, and the settings for the handler (including its type obviously) under those keys. As a default, it would apply automatically and would require no other machinery to “activate” for processing a given request.
To support the full range of insanity allowed now in which overrides can define different handler sets, it will be necessary to allow for “alternative” handler sets, probably in separate files, and then we would need some kind of connective setting to tie a request via the RequestMap to the right handler set. This means that in effect the original purpose of the applicationId (which used to be forcibly defaulted to the string “default”) becomes a means of connecting requests not to an Application but to a handler configuration.
Why is this better? Because 99% of deployments likely will never need more than one handler configuration. Most of what historically was tweaked with handler settings tends now to be handled via RequestMap anyway (e.g., SAML protocol options). So if we can move to a situation where the original need for overrides is limited solely to cases where the handler set has to be different, we can further reduce the probability people would ever need to do this kind of thing at all, and most agents will simply have a default set of handlers and a RequestMap, plus a few global sets of settings.
And voila, we have the makings of a new configuration.
One edge case was chaining, which showed up in two cases, the SessionInitiator and LogoutInitiator. (Yes, I should have called it LoginInitiator, but we didn’t have logout yet.) My intent right now is that the purposes behind chaining SessionInitiators (discovery and protocol precedence) would both become hub jobs. The latter already is implemented, and the former isn’t, but because SAML discovery requires knowledge of the SP’s entityID anyway, it ultimately was going to have to move there anyway. Logout is much farther off to work out, but my assumption is that the basic mechanic of “Local or Global” can be implemented within a single handler, since the “global” behavior is going to be up to the hub anyway.
Similar to handlers, there is also potentially a need for multiple, independent sets of rules about how to name attributes, particularly when HTTP headers are used because of the smuggling risk.
So where does that leave the configuration?
As a strawman:
shibboleth.ini - the new, simple sectioned property file containing:
maybe some logging settings
hub connectivity settings (the client’s secret obviously would likely be in an additional file perhaps)
session cache settings
possibly the old ISAPI material for the IIS module, but that might end up in its own little ini file
a mapping of configuration IDs to handler files, defaulting to “default” → handler.ini
a mapping of configuration IDs to attribute files, defaulting to “default” → attributes.ini
handlers.ini - the default handlers
each section would be the path (odd, but should work), with the handler properties underneath
attributes.ini - attribute name mappings
default list of mappings of attributes needed for header feature to inform the code of what headers to “protect”
can’t be sparse, because anything not listed would be unprotected, so even foo=foo would probably have to be included
would be optional in the case of server variable use but likely still useful as a remapping tool anyway
might be a place to also put “global” settings related to attribute handling, TBD
request-map.xml
the Big Kahuna that’s probably still XML because it’s just not practical to do it any other way
Sketching the Request Flow
Working on the initial security and request processing model, given these assmptions:
The request body is the serialization of a DDF either in my proposed form or later on as JSON. (Point being, not form encoded.)
The WWW-Authenticate header would carry Basic auth for now, with the agent ID and a shared secret. Agent IDs thus can’t have colons, so would not be URLs but that was the plan anyway, hostnames are more likely and periods are ok. I did not see any length limitations inherent to basic-auth.
The agent would under some conditions understand cookies and be able to store and return one.
The basis of the request webflows then:
PRC gets created per usual as a pre-step by the flow, which then branches back into shared parent flow actions.
Extract the WWW-Authenticate header based on known schemes, possibly just Basic and perhaps “something custom” at this point. I suspect custom types would present a challenge for some client libraries, so may be simpler to just leverage Basic with a dummy password if needed.
Populate AgentRequestContext under PRC by resolving Agent based on the ID from the basic-auth header. For now, that’s our primary ID signal. Failure to resolve an Agent obviously fails.
Check IP address of request against Agent’s allowed ranges.
Check for Agent setting to determine requirement to authenticate (allowing for localhost deployments without the overhead).
If needed, check for a record of previous cached authentication in the Java session. This contains the agent ID, client address, and the expiration.
If the record is there and valid, the request is accepted for processing.
If not, any existing record is cleared, and then the credentials from the basic-auth header have to be run through a CredentialValidator chain and the resulting Java Subject would carry the agent ID in a UsernamePrincipal, which is to be compared to the incoming value, and then the request is accepted for processing.
Now we parse the body based on Content-Type to obtain the DDF, which is stored in the inbound MessageContext as the message.
Most incoming requests would carry an application ID in a standard place, so this would be resolved against the Agent to obtain the Application to store in the context (or fail the request).
Some kind of MAC over parts of the request would be nice. OAuth 1.0 defines that, but it’s likely a bit overkill, OTOH it would at least be somewhat “known” as an approach. They define a way to carry all that in a WWW-Authenticate header so that may be attractive, but I don’t know how clean that would be for agents. Notably it can’t sign non-form bodies, and I’m not sure I want to worry about canonical form to make signing the body work.
The real value I think is that a MAC that covered the URL path and the host, and possibly somethiing from the TLS cert would achieve a weak kiind of channel binding so that if the server trusts that the agent isn’t going to create and send a signed blob if it can’t connect to a trusted server would be a form of guarantee that there was no MITM. It’s not full on channel binding because it relies on that agent behavior assumption without a shared value from the TLS session.
Indirecting HttpServletRequest/Response
To reuse a lot of the current code, particularly from OpenSAML, we have to have a means of indirecting access to the servlet layer into wrapper objects that source request data from and write response data to the DDF objects passed in from (or sent back to) the agent. This is the “tunnelling” aspect of the SP today, in which facades for HTTP request and response logic gets exposed on top of the DDF objects so that the code is relatively understandable but the data is being remoted across a socket interface. The difference here is the use of HTTP to remote the data but in other respects it’s similar.
The challenge here is that the DDF objects backing the wrapped interfaces are part of the request context tree/state, but the getHttpServletRequest/Response methods on the various profile and messaging APIs are parameterless. That was a mistake, but it was necessary for getting singletons to be able to access servlet interfaces since there wouldn’t be any request state to pass into APIs that are that deep in the system without every method up the stack taking a servlet interface.
Without changing the APIs up at the action/handler layer, we need a way to allow a new pair of special Suppliers to be given access to the request state in order to walk down the tree to where the DDF objects will be to wrap them. The original design I tried didn’t work, so had to fall back to more thread-locals. A different pair of TLS-based suppliers is required, but it turns out we only need the tunnelling semantic in relatively few places because we don’t actually interact with the servlet layer all that much, at least in my work so far. I was able to get message encoding work by embedding the set up of the right TLS objects in a try/finally block around the use of the MessageEncoder, and I think decoding should work similarly.
We’ll likely hit more complex cases but I’ll cross that bridge later.
Session Thoughts
My inclination is to look at confining “formal” session management to the agents and not the hub. This makes the hub more stateless. It does not preclude using the Java StorageService as a back-end because we can remote the various CRUD operations implemented by the storage API within the agent remoting protocol (already done). Thus, where applicable an agent could implement a Java StorageService-backed session cache if it wanted and our initial set probably will. That will move us to Java for JDBC and off the clearly non-viable-for-Linux ODBC option.
In either case, the data being managed needs to be split between the data the agent has to understand and use and the data that could be left opaque. An example of the latter would be NameID matching information for SAML logout. The opaque data could be attached and then passed back to the hub during operations that might require it and then reparsed there for use.
So I envision the hub knowing about a subset of session data as objects to parse and understand but not actually implementing the prototypical cache of them natively.
For the cache itself, I am starting to lean towards hoping all the other Apache modules around are right and using the filesystem is “good enough” as a cross-process solution. If things are simplified to that degree, I think it may also be possible to relatively easily implement cross-server session migration by implementing an agent to agent call to fetch a session based on the cookie, then creating the copy of the file locally as needed. We could hopefully touch the files occasionally to get a coarse-grained timeout, which we already decided would be the right approach vs. the “update it on every request” fully consistent timeout behavior we have now. Just doesn’t need to be that exact.
Error Handling
Almost certainly we will drop all of the custom error handling in the SP in favor of web server native approaches. The SP would return error codes and where possible set internal variables up to carry extended data but customizing the pages would be up to Apache/IIS config, and whatever other agents do in the future. The existing machinery is overblown, barely used properly, and a source of a lot of bugs.
To keep the redirection option intact and to avoid as much code change, I’m thinking that we might keep some of the custom exception layer and move some of the parameter tracking that was handled with the TemplateEngine as a hack natively into the exception base class and maybe try and carry everything relevant on the exceptions so we can surface those into the sendError routines and do whatever we’re going to do with it all from there. Maybe also track an intended status code and possibly make some of the code values configurable somehow, not sure how yet.
Regular Expressions
It appears C++ 11 has a native regular expression API, with a number of different dialects, but per usual, they botched it and the implementation is templated, locked into stone for ABI reasons, and is so slow it’s viewed as useless.
Boost has an old implementation that is header-only on C++11 (all we need) and is similar to the STL’s but much, much faster.
It’s plausible we could abstract it enough to allow the built to select one of those options at compile time and leave some wiggle room. Both support POSIX and Extended POSIX, making things relatively consistent across them both.
Notably, Xerces/XML’s support was based on complete matching, which is not what Apache does and likely not as suited to our use cases as partial. We may need to support both with a runtime option for compatibility.
Identity Protocol Support and Chaining
I envision adapting some of the existing notions of “chaning” of handlers into the hub somehow, as a means of stacking up protocols and allowing the agent to be oblivious to which protocol is used. Rather than implement this in an agent-aware way, as with the SessionInitiator chains in the SP today, there would instead be a package of options that the agent would need to be aware of to pick up at runtime from the RequestMap or from parameters, and then pass those in a “new login” request to the hub. The hub’s job would be to manufacture the appropriate response to the agent’s client to get the client over to an IdP with a suitable request message.
The SessionInitiator endpoint would be a master webflow that relies on subflows to initiate requests, allowing pluggability of new protocols, so this needs to be dynamically extensible out of the box. The hub deployer would probably control the precedence of these subflows being tried, but perhaps the agent could have a setting allowing it to be overridden.
Since each subflow would operate against the same input message, they would be able to pick and choose which options and settings to look for in the request to alter behavior, along with the usual relying-party- or metadata-driven approaches we have now.
A similar model could be used on the response side of this, and contrary to my original intention, I think we want to move towards having the agents manage the handlers/paths that they process and what functionality in the hub they map to. That allows the agent to construct the “right” self-referential URLs for response handling and just supply them to the hub to bake into requests (as in SAML where the ACS URL is typically in the request).
The advantage is that we would finally get to a model where a single response URL could be used for all protocols and bindings, and the hub can chain its response processor flows together so they can probe the buffered data to see if it’s applicable (e.g., HTTP method, parameters, etc.), allowing one endpoint to handle SAML and OIDC and all bindings.