DesignNotes

DRAFT - A lot of this was earliy thinking, some still valid, some superseded by design and implementation work that’s already done.

1 Background
2 High Level Design
3 Configuration
4 Remoting
5 Eliminated Components
6 Handlers
7 Session Cache
8 Library Content
9 Other Thoughts

Background

To better inform discussion about a next generation SP design, this is a high level summary of the current design, components, and some of the key decisions that have to be made if we go at this again.

To reiterate some of the main problems with the current design:

The volume of C/C++ code needs to be greatly reduced to be anything close to sustainable with the skill sets available today from people not near retirement.
Reliance on the currently used XML libraries has proven to be too big a risk going forward because they are essentially unmaintained. Alternatives may exist if a rewrite were done, but wouldn't address point 1.
The footprint is not amenable to a lot of "modern" approaches to web application deployment.
Packaging has been a significant effort, partly due to the sheer number of different libraries involved, and shipping lots of libraries on Windows means lots of maintenance releases.

That said, there are some key requirements we probably have some consensus on that a new design would have to maintain:

Support for Apache 2.4 and IIS 7+, ideally in a form that leads more easily to other options in the future.
Support for the current very general integration strategy for applications that relies on server variables or headers.
Some relatively straightforward options for clustering.

The new design being considered would offload the majority of "not every request" processing to a "hub" that would be implemented in Java on top of the existing code base used by the IdP, including its SAML support, XML Security, metadata, etc. Long term it could potentially expand to more generic support for other protocols that fit the same general pattern. The intention would be to maintain at least some degree of scalability by ensuring that post-login "business as usual" requests could be handled without the hub as much as possible, which is where a lot of the interesting design decisions lie.

It is one of the basic design questions as to whether the agent and hub would be a 1:1 relationship akin to the (intended) use of "shibd" today, or a more shared model of a single hub supporting many agents. The former has more potential for a "drop-in" replacement of the existing software, while the latter offers more opportunity for new design thinking and changes.

I have embedded a number of editorital comments and opinions in blockquotes (there are lots of opinions throughout, but these are moreso personal musings).

High Level Design

The SP is a set of web server modules that sit on top of a number of libraries created for the project, along with a large number of dependencies, many of which are rarely used by other projects. The modules are supplemented by a stand-alone daemon process that people gravitate to, thinking it is the SP, when in fact it's really a processing assistant and a state management tool for the work being done by the modules. In fact, the internal design is such that it's conceptually possible to build a version of the SP in a single binary that has no separate process, but that would create state management problems for sessions and would add significant symbol contamination such that conflicts inside Apache would be more common on some platforms because of how shared libraries work on most non-Windows systems.

So, in practice the system is split into those two pieces but they share a common configuration and the code on both sides of the boundary actually comes from the same set of libraries, built in two separate ways so that different code is included (or more accurately, one half contains everything and the other "lite" half contains much less code). At runtime the libraries are initialized with feature signals that inform the code which components to create and make available, and that includes "InProcess" and "OutOfProcess" designations that tell the code which half of the system it's in. Much of the functional code therefore contains conditional sections that evaluate at runtime how to behave based on these signals.

The SP modules and "shibd" executable wrapper sit on top of the "libshibsp" library, which in turn sits on top of the "libsaml" and "libxmltooling" libraries, the latter two conventionally considered to make up the C++ version of OpenSAML. There's a substantial amount of code in the modules because that's where all the web server touchpoints are (Apache in particular requires a lot, though some for supporting Apache < 2.4), but the "shibd" executable is really just a thin wrapper that does a little bit of setup work and then starts a "Listener" plugin, the server half of the remoting layer that supports IPC between the two halves. The listener runs a socket select loop waiting for connections and starting job threads to handle them, all of which is handled by the "libshibsp" library itself and all its dependencies.

Notably, only the "shibd" process is linked to "libsaml" and the full version of "libxmltooling" and on to "libxml-security-c", "libcurl", and OpenSSL. Thus, all the security work, SAML processing, signature/encryption code, backchannel communication, etc. is all handled by the portions of library code linked to "shibd". The web server modules don't link directly to those libraries, and instead make remote calls to "shibd" to process messages. Both halves are linked to the Xerces XML parser and to a logging library. The XML linkage allows the module to support the XML-based configuration and to do some kinds of XML processing in isolated cases.

It doesn't matter all that much for discussion purposes which part of the job all the different libraries do other than to say that in the end, most of this code has to somehow be replaced by "something" in order to substanially reduce the native code footprint.

Like most modern code bases, there's a substantial amount of modular design, with C++ used to define abstract component interfaces that are implemented by supporting components. There is in fact a lot of boilerplate code in "libxmltooling" the provides for both portable dynamic loading of extensions and for a plugin management layer that allows components to register themselves with a plugin manager of a given type, and then allows the configuration to instantiate a plugin of a particular type at runtime. This plugin model extends to a range of components, both very large (whole "service" abstractions") and small (individual handlers that know how to process requests of a certain sort). With a few exceptions for the "mainline" processing of requests, XMLObject wrappers, and the touchpoints with Apache and IIS, essentially most of the objects in the code tend to be plugins of one sort or another.

Configuration

The configuration is split between XML files and a bridge back to Apache (for that server module obviously). IIS initially had no practical support (known to me) for module configuration "integration". It does now, but it is quite horrendously designed and is also XML, so things that we might be able to avoid using XML for would probably end up reverting to XML and a lot of mechanisms that most deployers of IIS have never touched. It would be a major project to add full GUI support for extending the IIS configuration, though in a perfect world that would be the best outcome for a lot of people.

The configuration object is largely a big "scaffold" built by a couple of large, complex classes that store off a lot of information and build lots of plugins that do other work and hang them off the configuration class. It's basically the outside-in approach that dependency injection avoids in other systems. The noteworthy thing is that once this is built, apart from reloading changed configuration, the object tree is (with one recent exception) fully "processed" up front rather than constantly parsed in real time.

It should be possible (whether it's practical or not...) to actually pass a configuration file to a remote service to process and return some kind of processed result. Representing the result is clearly not simple, but would potentially allow some amount of temporary compatibility and maybe long term avoidance of XML parsing even if the format remains XML. It's attractive to consider the idea that the entire SP configuration could stay "local" for agility and local control and avoid any part of it being held by a central hub, even if the hub does the work to parse it.
Even if the hub stays local, there are reasons to consider how we might split the configuration apart, probably by dropping a lot of the current stuff on the floor and migrating to Spring-based configuration of the hub, likely by just reusing existing IdP formats for metadata, the attribute resolver, etc. We would hope, I think to support but largely avoid the need for native Spring configuration, since that's much more onerous for every SP deployer to deal with. I think most of the core features we'd need from the current code are all primarily configured with custom XML syntax anyway.
Thus, a key question is how much of the low-level SAML details to leave to this configuration and how much to move to the hub. While compatibility might dictate parsing and somehow supporting a lot of those details, it's not clear this is hugely beneficial. In most deployments, either the same people would be operating and supporting both the SP and the hub, or it's just as likely that the people deploying the SP tend to have little or no SAML experience or real understanding of these options. Simply redoing much of those settings in a new hub-centric way and ignoring much of the legacy SP configuration related to SAML might make a lot more sense and require less code long term.
This in turn leads to the idea that the hub might be deployable via a pre-packaged container created and delivered by organizations for their SPs to use. "Local to the agent" doesn't mean that it has to be packaged and delivered with it.

Reloading of configurations is currently the responsibility of each component itself to manage. Typically these classes inherit from a common base class that mediates this with some complex locking semantics to allow for shared read access but single writer access to swap in new implementations of the underlying service. This allows reloads to be triggered in the background, with all processing done "out of band" of requests and then swapped in briefly with a pointer swap under the exclusive lock.

Ideally I'd like all that code to go because it's extremely complex but it may prove difficult to get rid of entirely though I would hope/expect that very few configuration files would be directly parsed by the SP. Possibly the core of the code might remain but with less of the existing file management logic. As above, it would be hoped that much of the code the IdP uses now for things like metadata would simply be used instead.

Remoting

The current remoting layer between the server modules and "shibd" started life as a true ONC RPC interface but was replaced in V2 by a simple (and still unsecured) socket protocol with length-prefixed messages. There are plugins for both Unix domain and TCP sockets. The messages are actually passed as XML (this is an internal detail and not exposed to the rest of the code) and are serialized trees that are pretty similar to JSON in flavor (simple values of different types, arrays, and structures of named members), but predate JSON by many years.

The programming interface to this is a bit messy (it was ported almost 25 years ago from C to C++ by a very inexperienced C++ programmer) and has some "odd" object reference vs. value semantics), but functionally it provides a very easy way to construct trees of data dynamically, and pass the trees across the process boundary. A structure convention of operation, input, and output is used to represent a remoted call. The tree is mutated by the call, and the caller and callee just create, manipulate, and read the tree to exchange data and signal work and results.

There is no actual remote "API" in a documented sense. Instead, components typically exist in runtime form in both the calling and callee processes and a component registers the message names it listens for at an "address" based on the type of component and on context supplied from the configuration when it builds the components to keep addresses unique (e.g., if there are two identical handlers living at different paths, the path is supplied from outside via the constructor to make the addresses unique). The code is thus "talking to itself" across the boundary and the client and server logic is usually in the same code unit, so changes to the interface are made "all at once" in a new version.

This design allows for any portion of a component's work to be farmed out to "shibd" by splitting it into client and server portions, and some creative class designs allow this to be modeled in ways that allow the code to "collapse" back into one process and cut out the remoting, though this has never really been used.

The handlers tend to be good examples of this pattern, so the classes in cpp-sp/shibsp/handler/impl are good samples for what this looks like. Notably, and particularly with handlers, there are base classes that provide support for actually remoting the bulk of an HTTP request and response so that it's possible to pass an entire request, cookies, headers, etc. to "shibd" for work and a response to be returned and then mapped back into Apache or IIS API calls to issue a response, without the calling half knowing exactly what's being done.

This has pretty clear applicability to doing the same kind of thing to offload entire processing steps to a remote service. For example, a SAML ACS handler can effectively implement itself without knowing anything about SAML, the parameter names, etc.

A note about threads: the calling code uses a client socket pool to reuse connections. The server code has a very primitive approach: a bound socket is passed off to a job thread that runs for the life of the socket, and terminates if the socket is closed or stops working. The job threads are not pooled or limited, because the intent was that jobs would be completed fast enough to complete work and free up the connection for a client thread in the caller to use it again, limiting the number of job threads needed. This works well except for Apache pre-fork mode where hundreds of child processes exist, each one connecting and starting its own thread with a gigantic default stack size. This is why the SP doesn't work with pre-fork mode under load, which mattered way longer than it should have, and is ridiculous to use now.

If the hub were built into a shared web service, then the threading issues should be moot. Security would probably have to come from TLS and that's still a dependency challenge I don't have a good answer for. It may end up being the case that tunneling naked traffic over stunnel could be the best solution, but obviously adds moving parts.
Alternatively, a more direct conversion of the current design would inherit pros and cons of the existing approach unless more work was done to redesign aspects of the code, particularly on the server side. The obvious huge win of treating the hub as "local" is to continue to punt on security, and if it were needed by fewer deployers something like stunnel might be a more reasonable "if you need it..." suggestion.

Eliminated Components

There's a lengthy list of components that should be possible to eliminate from the C++ code by offloading the work:

All SAML specifics, including metadata, message handling, policy rules, artifact resolution. I wouldn't anticipate a single SAML reference in the code, ideally, other than perhaps paths for compatibility.
SAML Attribute processing, extraction, decoding, filtering, resolution of additional data, etc.

I would think one big win would be leveraging the IdP's AttributeRegistry, AttributeResolver (and filter engine of course) to do all this work for us, and in particularly to instantly add the ability to supplement incoming data with database, LDAP, web service, etc. lookups and transformations, returning all of the results to the SP agent. For some deployments, that alone may be enough to incentivize converting to this approach, though it's possible that many of those deployments would (and have) just proxied SSO already anyway.

Credential handling and trust engines
SOAP client
Protocol and security policy "providers" that supply a lot of low-level configuration details
Replay cache

Most of this code would be either unnecessary to a redesign or already implemented in Java, modulo that configuring it may be very different or would have to be wired up in code based on the existing SP configuration syntax (if viewed as absolutely needed).

Handlers

Like Apache or IIS, there's a "handler" concept that acts as a plugin point for any code that has to fully process requests and issue responses to do work for the SP, as opposed to the more "filter"-oriented parts that operate during normal web server request handling to enforce sessions, export attributes, do access control, etc. Handlers obviously work portably across any SP-supporting environment.

It's very unlikely given logout and other requirements that the actual concept of a handler goes away. The downside of that fact is that it could leave behind a decent amount of supporting code to allow for pluggability since all of that is pretty cross-functional and is kind of the price of supporting extensibility.

I wonder how realistic it may be to just implement a generic handler "shell" that can forward request/response messaging to the hub, living at whatever path(s) are required, but leave all of the semantics to the hub. That seems more likely to be tractable with a local-ish hub simply because it might allow for handlers that have to manipulate sessions (i.e., logout).

SessionInitiators

These start the process of establishing sessions by either issuing login requests to an IdP or by doing discovery to identify the IdP. They typically ran in "chains" that allowed discovery to run if needed or drop through to protocol-specific handlers that issued requests, allowing different SAML versions and WS-Federation to co-exist in whatever precedence was desired.

I would expect these to be eliminated and the hub would perform all this work. There are some very old plugins for discovery with cookies and forms that got little use but ultimately that can all be farmed out to "something" doing the discovery process.

AssertionConsumerServices

All SAML-based at the moment, these are really generically any endpoint that handles a login response to establish a new session. They also do a lot of the work of the SP calling other components to extract, filter, resolve attributes, etc. Notably they also create the session with the client in conjunction with the Session Cache.

I would expect these to be eliminated and the hub doing most of this work but the cross-over into a session is TBD depending on some of the trickier questions. The local hub design would be handling sessions so could be a pretty direct port of the current design.

LogoutInitiators

These start (and sometimes finish) a logout operation, and the initiator concept is separated out so as to allow for issuing SAML LogoutRequests to be separate from processing inbound LogoutRequest and LogoutResponse messages.

Clearly much of this code has to go (the SAML parts), but probably a core amount of it will have to stay to support logout if the hub were remote and not managing sessions. I suspect two big piles of code that that will have to stay will be some form of session cache and logout. Probably the logout would be forwarded into the hub, where it would determine whether to "finish" or issue a LogoutRequest, and return either HTTP response to the SP.

Logout Handlers

I believe only the SAML 2 (and possibly WS-Federation) LogoutHandler is actually an extant example of this, but they would process some kind of formalized logout request coming from the IdP. It's extremely complex code with weird rules for dealing with the session, locating other applicable sessions because of how SAML SLO works, implementing application notification hooks, and finally having to issue LogoutResponses.

Again, there are clearly SAML portions that would have to be offloaded, but it will be some work to tease apart what aspects are which and maintain this kind of functionality, and the remaining code will be complex and a challenge for somebody to maintain.

Miscellaneous

The rest of the handlers are a grab bag of stuff.

AssertionLookup – this was a back channel hook to pass the full SAML assertion into applications that wanted it. Not clear to me this would be worth keeping.
DiscoveryFeed – the discovery feed, this would clearly go away though might have to migrate into the hub in some form if we intended to maintain the EDS.
AttributeChecker – basically a pre-session authorization tool, probably would need to stay in some form
ExternalAuth – this was a backdoor to create sessions as a trick, I doubt we'd keep it but it would take substantial offloading to do it
MetadataGenerator – gone, obviously, but probably replaced by something else, possibly somewhere else
SAML 2 ArtifactResolution – this is for resolving outbound-issued artifacts, I can't imagine we'd bother keeping it, offloaded or otherwise, but we could
SAML 2 NameIDMgmt – if we kept this, it would probably need to morph into some more generic idea of updating active session data via external input
SessionHandler – debugging, still needed
StatusHandler – debugging, still needed I imagine
AttributeResolver – this was a trick to do post-session acquisition of data using the SP AttributeResolver/etc. machinery and dump it as JSON; if we kept this it would have to be offloaed obviously, and we'd likely have to discuss the actual requirements with someone

Session Cache

This is the most complex piece of code in the SP (and not coincidentally the IdP). Partly this is because it's a component that tends to start life as a "self-contained" component but ends up having to solve so many problems that the final result isn't so modular anymore, and didn't get decomposed into smaller portions. Sessions in general are just the hardest part of implementing one of these modules and in some sense are the only reason to do it. In a web platform that handles sessions, it's going to make more sense to implement identity inside that platform and not generically, because the application is already stuck using that platform and will be an instant legacy debt nightmare regardless of how you do identity.

With Apache, and with IIS if you exclude .NET, you have nothing. Implementing identity pretty much means you have to implement sessions, and that's a really hard thing to do because of state. IIS and Apache (the latter moreso) tend to use multiple, constantly cycling processes, so any sort of cache has to address that.

Anyway, for now, current state is that the SP primarily relies on a separate process to hold an in-memory session cache, though there are plugins that can extend that. The "shibd" process was meant to be colocated with every module instance to hold the in-memory sessions across web server process changes, with the assumption that application clustering would usually dictate sticky load balancing anyway. That has become less true now because people (who aren't me) have typically started putting application state into databases, something I don't know how to make work reliably because of (the lack of) database locking.

The cache is actually a two-level architecture, and like many of the components at the SP layer, is internally remoted between the module and "shibd" in some very complex ways. The lower layer of the cache is a StorageService plugin to hold the sessions "persistently", though in common usage "persistent" here just means in-memory until "shibd" is restarted. The higher layer of the cache is a second buffer for sessions held within each web server process by the "in process" portion of the cache component. The first time a session is used within a child process, it's queried across the remoting boundary and read out of storage, deserialized, and returned across the process boundary for caching in memory in a simple hash table. Eventually entries get pruned out if not used.

The sessions are not live objects within "shibd", so they're effectively stored in a serialized form or in-process as active objects. There's a trick involved in order to support session timeout at a very fine granularity. For that to work across process boundaries, every session access has to update a shared timestamp. This is actually done with a hack of leveraging the expiration of the session records in storage, substracting off some knowable values, to recover the last time used. When the session is accessed, there is always a "touch" operation that remotes to "shibd" and updates the record expiration. That touch also applies a timeout check so that every child process shares that view of the session and enforces a consistent timeout.

Astute readers will ask why on earth we would spend this much effort on timeouts. Reviewing the list, or talking to a few clueless security people, will give you a pretty good idea. You either deal with timeouts or spend half your waking moments justifying, explaining, or arguing with people about the lack of timeouts. The only difference between logout and timeout is that the former is just impossible to do reliably while the latter is merely very hard and painful.
It is a valid question why the timeout really needs to be this strict. The suggestion to support only less fine-grained timeout windows on the order of minutes is something we have to consider, to reduce the need for "touch on every access".

The session cache does a number of other, related things, such as actually managing the session cookie used to associate the session with the client, making that detail "internal" to the implementation of the cache, simplifying other sections of code at the expense of the cache code.

Another thing it does is implement various logout tracking requirements because of SAML's horrendous decision to implement logout by NameID value. A lot of code is needed to support that and actually track logout messages for weird "logout arrives ahead of assertion" race conditions. That would become a hub responsibility no matter how the hub is designed.

Finally, the session cache also implements a feature copied from the IdP to implement encrypted cookies to persist sessions and recover them across server nodes. The intention of that feature was that it would be used only occasionally, with sticky load balancing minimizing the need to do it, but it has the obvious potential to be part of a revamped design that could store sessions exclusively that way. That requires giving up on timeouts because it can't be tracked client-side (we can't decrypt, update, and encrypt on every request efficiently) and makes logout entirely at user-discretion unless some kind of revocation cache is used (one can just put the cookie back in place and voila, session's back).

There are two storage plugins that were designed mainly to offload session state from "shibd", memcache and ODBC. The memcache option always seemed strange since it's not really any more persistent, and adds unreliable network requests (the clients don't work well, to nobody's surprise) to the mix, but as a replacement for "shibd" entirely to track sessions, it probably makes more sense. Notably, there's no obviously maintained memcache client on Windows. The ODBC option has been totally unreliable because Linux has no stable and free ODBC driver manager or drivers anymore, but again as purely a session store with no "shibd" involved, there's some continued relevance at least on Windows where ODBC is reliable.

This is the heart of a lot of the tough decisions to make, in particular whether the actual non-localized session cache should be migrated to the hub service, but I doubt that's really practical if the hub is a shared service. It's likely too much state in a lot of cases if a shared hub is used, and it just means the hub has a state problem to address just as the IdP does now. The IdP solution to that is local storage; that might work but it's tricky because it requires Javascript to get at it and it has to be the SP web server mediating that (the hub is not a proxy because we already have proxies). I punted trying to do that with the SP to this point and stuck to cookies because I thought the size was manageable. I'm not sure it actually is, but it may be less additional work to just figure out a way to get at HTML local storage than other options, or to start with cookies and see how it goes.
The one advantage of moving the second-level cache to the hub would be the potential of implementing other storage options in Java rather than C++, which will open up a lot more options. A possible hybrid would be to just implement session cache store operations remotely but not actually implement a real cache itself in the hub, just have it passthrough data between storage back-ends and the SP, essentially just relaying between a REST API we design and the Java libraries to talk to Hazelcast or what have you.
OTOH, a more local hub design analagous to "shibd" can clearly continue to provide a session store as it does now, and would have the added advantage of additional storage options being in Java. Furthermore, a revamped model of only updating session use infrequently for more inexact timeouts would greatly reduce the traffic, making remote deployment of a hub for a cluster of agents much more likely to scale.
All that said, we may need to revisit the idea of a simple file-system-based cache. Lots of Apache modules do this today and people have indicated to me that it scales. If that’s the case, we may want to consider starting with that and bypassing shibd entirely other than for stateless processing. The files could just be the encrypted content that would otherwise be in cookies and creative use of stat and touch with file timestamps could do a lot of things. But that might not be portable to Windows either.

Library Content

This summarizes the general functionality from each dependent library and its likely future.

Boost

Boost is a giant grab-bag of advanced C++ libraries, mostly template-based so they just compile in rather than get linked in. Much of the usage now are things that are part of C++ 11 but weren't available at the time, but even now there are a lot of things in Boost that apparently got implemented much more performantly than the standard versions.

While the goal was to elimiinate this if possible, there are three major components in Boost that are basically impractical to get rid of, so while avoiding a linker dependency is still a good goal, a compile time dependency is a given.

log4shib

This is a fork of "log4cpp" to address a lot of threading bugs that may or may not have been fixed upstream but by then it became academic to go back to it.

At this point, given the history of startup/shutdown issues with it, I would be happier to see this replaced with native log calls. There's a bit of a shim around logging provided today in the SP request abstraction that provides a lot of the shared code between the Apache and IIS modules, so maintaining that approach and layering the logging directly on Apache and on the Windows Event Log may make sense to try and eliminate this piece, at the expense of a lot of logging flexibility (categories, etc.).
Most of the important logging tends to be in shibd anyway, and that would obviously be no problem. It's even conceivable that log messages from the agents could be remoted to the hub to create a single logging stream.

OpenSSL

This is providing the cryptography support to the other libraries as well as direct support for loading keys and certificates and so on.

This ideally would be eliminated as a dependency to be successful because any touchpoints to it would imply a significant amount of complex native code remains. The most likely fallback would be that it's left as a libcurl dependency, which could lead to problems in some systems due to symbol/version conflicts with Apache but that's unlikely on all but Solaris or macOS these days.

libcurl

The most feature-rich HTTP client available, this handles any use of the back-channel as well as the transport layer for loading certain files remotely, though in practice that's only commonly done with SAML metadata.

While it would be very nice to eliminate this as a dependency, that may prove difficult if the design ends up relying on web service calls to offload work. It may be necessary to consider a lower-level socket interface vs. a full HTTP interface since that code already exists, but that would complicate securing the connection as well as actually implementing and deploying the offloading service.
It might work to leverage WinHTTP on Windows and libcurl everywhere else, but that also means more code to abstract the difference.
Obviously a more local hub design probably doesn't need this.

zlib

This is the common low-level compression library common to Linux and ported to Windows. It's used internally by OpenSSL and libcurl to implement compression support, and is also used by OpenSAML to implement DELFATE/INFLATE support in a couple of places.

I would hope all this goes away with the offloading of function. It is used by the encrypted cookie support, but if that's offloaded, then this doesn't matter.

Xerces

Xerces is the XML parser and DOM API we standardized on, mainly because the Xerces-based XML Security code available (eventually) was much more flexible and "fit" the design of the SP than the comparable code based on libxml2. Nevertheless, we bet wrong and Xerces is unmaintained in practice and full of known and unknown security vulnerabilities. To this point, most of them have tended to lie in the DTD support, which I added options around to disable fully in our builds.

Getting rid of Xerces is probably among the more paramount goals we have. I would rather not replace it with anything else, which means either replacing the configuration or offloading the consumption of it.

XML-Security-C

The "Santuario" library in C++ is the XML Signature and Encryption (and C14N) code. It's also unmaintained except by us, and while it's in better shape than Xerces, it's still somewhat hard to maintain and it has some serious performance issues on large files.

This must be eliminated as a dependency to be successful because of the complexity and the dependencies on Xerces and OpenSSL.

XMLTooling

This is the lowest layer of the SP "proper", and one of the libraries that makes up OpenSAML. It's a bit of a grab bag that started life as the main XML abstraction layer on top of the DOM and grew to include lots of other utilities, I/O interfaces, the SOAP layer around "libcurl", and other stuff, including security and trust abstractions modeled loosely after work done in Java. Because there's so much in it, it has to be used by both "shibd" and the server modules, so it builds as both full and "lite" versions, with the lite build excluding all the security-related code.

Below is a preliminary assessment of the future state of some of this code.

XMLObjects

While the designs diverged to some degree, the C++ code shares some common ideas with the Java code and has a layer of XMLObject interfaces and base classes that handle most of the XML processing in the code in a way that allows programmatic access to information at a higher level than the DOM but without breaking signatures and with a high degree of DOM fidelity.

All this code has to go away for a rewrite to be beneficial.

Utilities

There are a lot of utility functions and classes here, some of which are probably still going to be needed, but it will be the last pile of code to go. Some of the code is relatively simple abstraction classes to wrap locking, mutexes and condition variables for portability, and may be worth keeping. One of the big things in this layer are the various helpers that manage UTF-16 and UTF-8 conversions between single and double-byte encoding because Xerces is natively UTF-16.

Since Xerces is actually the API by which conversions are being done, it's inherently going to be impossible to do that conversion so we probably have to avoid UTF-16 entirely, as I'm fairly certain C++ doesn't provide conversion.

The new version should be natively relying on ASCII and UTF-8 whenever possible and on C++ 11's support for native UTF-16 where necessary. As long as the Java code outputs data in UTF-8, I would assume we can avoid UTF-16 entirely within the agents.

Storage Service

The persistence layer used by the rest of the code base is defined here, and the basic in-memory implementation is here.

Many of the current use cases for this are clearly out of scope for a replacement and have to be offloaded (e.g. the ReplayCache). The big unknown here is sessions; it may be that that piece if it stays would be rebased on some internal approach to storage rather than a shared interface but this doesn't seem all that much better and the basic implementation is quite simple anyway.
The hub would simply replace this outright with the Java equivalent.

DataSealer

There's a port of the IdP's DataSealer idea here that supports encrypted storage of data under a shared key for cross-node purposes.

Something probably needs to provide this functionality, but offloading it should be practical.

Template Engine

There's a template engine in this layer that's used to produce error pages and the interstitial pages needed to implement SAML handoffs. The latter is clearly going away and the former probably should, in favor some kind of more systematic approach to relaying error information to custom code, which is the only thing that's practical to do anyway.

HTTP Request/Response Abstractions

The core interfaces that ended up being reused across the SP to model HTTP requests and responses for portability ended up in this library. I doubt they're going away but they would be part of an eventual "single" library supporting the SP in the future, i.e., would migrate into "libshibsp".

Key and Certificate Handling, Trust, Signatures, Encryption

The layer of code that implements loading of keys and evaluation of peer keys and certificates (generically, not in a SAML-specific way) is here, as are a lot of supporting code that bridges between Santuario and the XMLObject layer.

Notwithstanding that securing any remote channel to offload work is TBD, this code needs to go away and the functions have to be offloaded.

SOAP and HTTP Clients

A limited HTTP client that has additional support for simple SOAP usage needed by SAML is implemented here on top of "libcurl" for security, with some very low-level callbacks into OpenSSL-aware code to do trust.

This code all needs to go away, and any functionality needed has to be offloaded.

ReloadableXMLFiles

There's a very complex base class here that supports the managed reloading of configuration state used across a number of components in the code, including the SAML metadata support.

For complexity reasons alone, this should really go, but probably won’t.

OpenSAML

This is the "proper" OpenSAML library itself and does not have a "lite" build because it's used directly only by "shibd" today. All of the SAML XMLObject wrappers are here, along with metadata support, message encoders and decoders, and some policy objects that implement some of the SAML processing rules. Some of this was patterned after early Java approaches, but is now significantly different from it.

Going into details here is largely beside the point; all this code has to go away

ShibSP Library

All the remaining non-Apache/IIS/etc. code is inside "libshibp" in the top-level project along with a few other add-in libraries that implement some lesser-used features.

Some of these pieces are discussed earlier, some of the remanining code is noted below.

Transaction Logging

This would probably have to end up centralized on the hub to be of much value, but possibly with some minimal session lifecycle events logged on the far side. I believe that today "shibd" actually handles this logging now.

Request Mapper

The RequestMapper is a portable way of mapping requests to settings, essentially a very stripped down version of the Apache Location feature (i.e. it's strictly URL-driven, never aware of physical files or directories). Originally it was used heavily with Apache also, but has since been reworked into a layered design that allows you to combine Apache commands with the XML syntax (Apache wins out) so that it's somewhat seamless.

This is a serious annoyance that's pretty much necessary everywhere but Apache because nothing else has the kind of maturity Apache does when it comes to attaching settings to requests. It's minimally needed for IIS if for nothing else unless we seriously entertain using IIS' internal XML configuration that nobody even knows exists. XML is a pretty natural syntax for this because it's already hierarchical, and Apache's weirdness often derives from the fact that its syntax isn't consistently hierarchical, but rather involves multiple bits overlapping.
This will be a significant issue to deal with.

Access Control

This is another thing that exists primarily because not everything is Apache and has a real authorization layer, so this was a portable alternative. It's essentially tied into the Request Mapper, though implemented separately.

I would doubt this can be eliminated since it's not really enforced at session start, but per-request based on content. In practice, I never saw much value in static rules like this and it seems as though it never actually works since people don't want to just get an error page, they want to actually force a new session to try and get a different outcome. That's something that can't really be done with Apache's authorization model, but probably could be done if it were leveled up into the request processing steps.

Applications and Overrides

This is discussed pretty thoroughly in ApplicationModel, but is basically a way of layering segregation of content on top of an undifferentiated web document space. It's tied heavily into the Request Mapper component (which maps requests to the proper application ID). Since most of the settings in this layer are really more SAML-level details, the implication is that maintaining this feature would require that the hub know about it intimately to be able to segregate its own configuration in the same way.

I deeply want to dump this concept, which has not worked. People can't handle or understand the concept. But I don't know how to just ignore it. Maybe ignoring it at the path level would fly but I can't really see losing it at the host level. In some sense, every virtual host is truly virtual to the hub. Securing all of this probably has to rely on some kind of security mapping between the virtual host and a certificate to "authorize" requests to the hub for a particular entityID and vhost. So in a sense, it's the norm to think of every vhost as its own "thing" in this new world and it sort of doesn't matter which vhosts share a physical server.
Really too early to think about the implications yet, other than to say that it would be very difficult to emulate if we moved a lot of the SAML configuration to the Java code we have now. The IdP doesn't have the concept of "use this metadata for this request and this other metadata for this other request", nor does it support separate attribute rulesets in that manner.

Other Thoughts

Security

The security model around the use of a shared, remote hub seems interesting and subtle. Superficially, it feels like something you have to deeply lock down, but...is it? Clearly the agents have to strongly authenticate the hub, which is a relatively straightforward application of TLS server authentication and both libcurl and WinHttp can handle that sort of thing in some sane way. I'm more interested in exactly why the hub cares who the agent is and what the threats are.

One obvious risk is data exposure. If you offload XML decryption to the hub (maybe it has one key for all, maybe multiple, not sure it matters all that much), you're returning processed results after decrypting the data and turning them into raw attributes. Clearly that has to be protected to a point to prevent anybody from just handing off the XML and getting an answer, but a confidentality risk (especially with the kind of data we tend to see with SAML) is somewhat less a catastrophe than an authentication risk.

I think a big argument against tying the hub into "real" session management for the agents is to reduce risk. If the agent still has to be the control point to issue a cookie and tie that back to the state it gets back from the hub, then an unauthorized party submitting a SAML (or OIDC) response/code to the hub gets back data, but not a session it can actually attack an SP with, and if it could "just" hand the same response off to the SP so that it calls the hub...that's, uh, how these protocols work. Bearer and all that.

So I'm interested in exactly what the security risks are from the view of the hub authenticating the agent beyond just exposure of data. That's enough, but is it enough to justify strong end to end TLS with keypairs? Or do we swim with the tide and just issue passwords (oh, excuse me, client secrets)? And how tight a binding does there have to be in terms of segregating which entityIDs/response URLs can authenticate within an enterprise? What damage is done inside the firewall by server A asking the hub to chew on a response issued to server B? I'm not really sure.

Of course, a non-shared, local hub is just "shibd", and likely assumes localhost traffic and/or deployer-applied security measures to protect things. There's nothing new there, and security is hard enough that trying to secure that is probably a bad idea.