Protecting data in a runtime environment: Part 2 - Transparent metadata wrappers
- What would be the pre-requisites for using such a concept?
- What is the problem I’d like to have a better solution for?
- Solution Overview
- So what does this solution look like to a developer?
- Closing thoughts
Last post I wrote about some overview thoughts about data-protection in a programming environment, in this one though I want to zoom in on the last item in that post on custom “boxing” of actual data to store metadata (that we already had at the database layer) on what entities should consume this data if we want to adhere to correct security and privacy practices.
The authentication of data access at runtime based on this metadata would be icing on the cake, but the basic goal would simply be that the metadata travels with the data, and any mergers of datums (is that a word?) merges the metadata correctly.
What would be the pre-requisites for using such a concept?
If we take a software company that might use this type of approach, they would need to be of a certain technology maturity level to be able to adopt. Here’s the requirements
- We have a datastore that has data in it that has privacy annotations (i.e. metadata).
- There is some global authorization on how a given identity can access some piece of data based upon its metadata.
What is the problem I’d like to have a better solution for?
Visually seems to work best for me, this is also duplicated in the code repo.
I’ve tried to use the colours green to red to indicate the data-exposure risk, hopefully that is ~intuitive.
Notice that when the data is in the DB on the left it is appropriately described via metadata attributes, though because this is dropped on the floor when it is fetched the
database2 all do not benefit from it and hence can expose the data insecurely.
Therefore, if instead we required all data to be packaged up with it’s metadata, and some ways in which metadata could be combined when data was combined in the code.
Then we could require that all places where raw data access is required e.g.
database2 need to know how to deal with reasoning about the attached metadata.
So what does this solution look like to a developer?
I began to create a toy project around this at github.com/joekir/data-boxes with the following ideas to increase developer adoption:
- The boxed object responds to all same public methods as the contained type, therefore code changes are minimized.
- There is a negligible performance overhead through opting for compile-time “magic” over run-time “magic”.
#2 needs a lot of work still, manifold.systems was the Java framework I found that best achieved the above so far, but this technique would not need to be in Java it’s just the language I’m more comfortable with and can accomodate language manipulation more easily than some other options.
Here’s an example of how it currently works:
// Before ... String id = "123456abcd"; Integer foo = fetchDataFromStore(id); var bar = foo + 10; System.out.println(bar); ...
// After ... String id = "123456abcd"; DataBox<Integer> foo = fetchDataFromStore(id); var bar = foo + 10; // any operation on a DataBox<T> would also return a DataBox<T> System.out.println(bar); ...
I’ve observed that tooling/framework changes that work well, are ones that can be progressively adopted, an example of this is Stripe’s Sorbet type checker for Ruby.
If the adoption is optional, then each development team within an organization can migrate when it is appropriate based upon their other priorities, this is the antithesis of the “big-bang” approach (Big-Bang Adoption), which I’ve always seen fail.
Re-stating the key assumptions for why I think a technique like this should be the future for enterprise software development:
- 99.9999% of software developers are well-intentioned when it comes to security, if you make it easier (and not performance impacting) to adopt an approach they will try it.
- It’s already becoming commonplace to require tagging of PII at a datastore level.
- The choice of authorization frameworks (cool one I saw recently is Oso) is better and granular data-access is a topic that is more popular than I can ever recall.
- In technology we see countless breaches of customer information that are the result of some data context (metadata) not being present for some evaluation of the data e.g. logging. Yet collectively as a business they did have that metadata present elsewhere in their systems.