September 20, 2023 · 10m read
Here's a non-controversial take: access management is hard. You want to provide people with just enough access to do their jobs, but no more and no less. By granting them too much access, you put your resources at risk. Over-provisioned access is a common misconfiguration which can give attackers the ability to do more damage. Limiting access is an important way to limit the blast radius of an account compromise, which is why least privilege is one of the core pillars of information security. On the other hand, with too little access, your users can’t get their work done.
At Material, each of our customers gets their own isolated, single-tenant instance in the form of a Google Cloud Platform (GCP) project. That’s a boon for security and privacy because it provides strong isolation between resources to prevent unauthorized access across the tenant boundary. However, it introduces an additional layer of complexity for our access management. Instead of a single production environment, we need to be able to support thousands. When I started working at Material, my first project was to figure out how we could improve this process.
The unwieldiness of managing access for so many GCP projects gave us the motivation to take a step back and look at this problem from all angles. We wanted a flexible solution to minimize persistent access, seamlessly automate elevated access, and capture everything in a single immutable audit trail.
Breaking Down Access
Today, we think about access management for employees in three facets:
- Base access: a persistent set of minimally-privileged roles that users (our employees) need to have all the time. Our engineers have the ability to view monitoring dashboards at all times, so they can determine what additional access they might need for troubleshooting problems.
- Just-in-time access: a lesser form of elevated access that is pre-approved by our security team. Our employees can use step-up authentication to obtain a specific set of roles on a temporary basis, and those roles vary based on that employee’s job function. The on-call engineer who gets up in the middle of the night to triage an alert can use just-in-time access to quickly obtain the roles to do things like view logs and query databases without having to wait hours for an access request from a separate team.
- Tailored access: a greater form of elevated access that requires explicit approval from our security team. This access may be temporary or permanent, and it requires justification in the form of a work-related ticket. If an engineer needs write access to a database to fix a migration problem, or a security architect needs to deploy a custom serverless function to solve a customer’s use case, then they’ll submit a tailored access request.
In GCP, you can easily implement base access for projects using built-in IAM policies. We had to build a system to implement the rest. This post focuses on how we implemented tailored access.
Automating Tailored Access Requests
Any tailored access request follows this lifecycle:
Submit request → Approve request → Grant access → Revoke access
Before we had automation for this, dealing with access requests was a manual process that had too much room for human error. We had several goals for our tailored access system:
- Remove as much user friction as possible, both for our engineers and our security team. We want access requests to be secure without being difficult and time-consuming. The less manual work involved, the better.
- Make it easier for engineers to figure out what they need. GCP has over 1,000 predefined roles and almost 8,000 individual permissions. When you lack access to do something in the GCP web console, it will tell you what permission(s) you need, but you are left with a puzzle to figure out the best role for the task at hand.
- Retain an immutable audit trail. When we approve an access request, we want to keep a centralized record that can’t be tampered with.
We knew we wanted to build a web UI to render an access request form. We started by prototyping what that looked like. First we wanted to auto-populate some of the information provided by GCP:
We also wanted to make it easier for users to find which role(s) they need:
We briefly considered using a traditional database to store access requests, but that would mean adding a ripe new attack surface to our pipeline. Instead, we realized that we already had a tool that’s perfect for this.
GitHub already provides a workflow to handle submission and approval of changes (pull requests), it already retains an immutable audit trail (commit history), it provides strong authentication (mandated 2FA) and authorization (repository roles, code owners, and branch protection rules), and integrating with it is easy (apps, libraries) .
The idea of using a Git repo to store your source of truth—and continuously reconcile system state against it—is known as GitOps, a form of infrastructure-as-code. We already use this pattern to build and deploy our application and the infrastructure that it runs on, and we realized that we could handle access management by leveraging the same strategy.
Here is a high-level view of the different components and data flows that we implemented:
We already had home-grown deployment tooling for continuous delivery (CD) of our application and infrastructure, but we needed to add new components to support the access management workflow:
- A separate GitHub repo to store tailored access configuration JSON. We chose to make this a separate repo for two reasons: (1) to have a discrete commit history for tailored access, and (2) to grant our GitHub App access to this without giving it access to the rest of our source code.
- A web app to facilitate tailored access requests. This allows users to quickly select who needs access to what resources, and for how long. This is important to reduce the chance for human error, and to make the process fast and easy.
- A GitHub App with permission to read the new GitHub repo and submit pull requests to it. Our web app uses a private key for this GitHub app to authenticate to GitHub.
We also needed to modify our deployment tooling to take the new GitHub repo into account. It needed to be able to read the tailored access configuration for each of our GCP projects, create new role bindings, and remove old role bindings.
Each GCP project has a globally unique ID, so we structured our Git repo around this. Each directory is named for a project, and it contains a security.config.json file that defines tailored access:
Each security.config.json file contains an array of “tailored access” entries that describe what role binding should be added in GCP, along with some additional metadata like the unique ID of the access request (also added to our audit trail) and the justification for the access:
We did our best to anticipate any issues, but we did run into a few things that we didn’t expect.
Integrating Multiple Repos with Google Cloud Build
We use a monorepo for our application and our infrastructure code. In this monorepo, we added a git submodule that tracks the main branch of our new tailored access repo. Cloning the monorepo using the --recurse-submodules flag would have automatically initialized and cloned the tailored access repo submodule. However, there is a longstanding issue that Google Cloud Build does not support Git submodules.
To address this shortcoming, we implemented a workaround in the form of an additional build step that configures, initializes, and clones the submodule.
GCP Basic Role Limitations
GCP supports three different kinds of roles: basic, predefined, and custom. There are only three basic roles (Owner, Editor, and Viewer), which contain a broad swath of permissions. Google recommends that you avoid using basic roles because they are broadly-scoped, but some situations warrant them.
Our deployment tooling runs multiple times a day on a set interval, so there would be some lag time between when a role binding should expire versus when it would actually be removed from GCP. When building this system, one assumption we made was that each role binding that’s added for temporary tailored access would have an attached condition to make sure it expires at the appropriate time. That means that as soon as a binding expired, it would be disabled until it was eventually removed by our deployment tooling.
However, basic role bindings cannot have conditions attached, and attempting to add one will result in an error. Since we use basic roles very infrequently, we decided to modify our tooling to add basic role bindings without conditions, and we accepted the risk that sometimes they may not be removed until a few hours after they’re supposed to expire.
We started to implement this tailored access system with employees in mind, but we realized that it could solve a lot more access management use cases for us. We decided to focus on our minimum viable product and only solve for our most important use case first, facilitating employee access for GCP projects. After this was finished and working smoothly, we extended it over time to add support for different types of GCP members (non-employee users, groups, and service accounts), additional GCP resource types (folders and organizations), custom GCP roles, and even delegate approval of certain GCP resources to different groups.
This has made a big impact in several ways, including:
- Our 4-person security team has actioned over 300 tailored access requests in the past 10 months, keeping up with increasing volume as our company has more than doubled in size.
- We are now able to approve and implement access requests in minutes.
- Automatic expiration gives us more confidence that we’re removing access when it’s no longer needed.
- People are now requesting access for shorter periods of time, because it’s so easy to submit another tailored access request for more time if needed.
In our next blog post, we’ll cover how we implemented just-in-time access! Follow us on LinkedIn here to watch out for the next one.