February 08, 2024 · 7m read
Material Security’s unique deployment model eliminates many traditional sources of third-party risk by providing full infrastructure transparency and data isolation to our customers. Every Material customer gets their own Google Cloud Platform (GCP) project that is single-tenant and fully isolated from other customers. This enables a business approach which requires less trust while simultaneously enabling us to fully customize each environment for feature rollout or customer needs.
The ability to support thousands of isolated production environments requires a firm commitment to infrastructure-as-code. Each copy of our architecture is largely event-driven and needs to operate reliably at scale without human intervention. These requirements inform Material’s technology choices and codebase primitives.
Let’s break it down.
Elements of Material’s Tech Stack
Material’s codebase is TypeScript deployed in stateless containers on GCP’s Kubernetes offering GKE.
Our primary reason for using GCP is BigQuery, which has been instrumental in our ability to experiment and remain agile in the data infrastructure space. Alongside BigQuery we make use of several other basic GCP services, including GCS, Cloud SQL (where we run Postgres), Firestore, and PubSub (which is a critical component of our event-driven architecture).
TypeScript is a versatile and expressive language which we use in our frontends, backends, and tools. Sharing code between client and server speeds up development, e.g. by sharing data interfaces between our storage systems and frontends and by sharing utility libraries. TypeScript has a large, well-supported ecosystem of 3rd party packages and a lot of developers are already familiar with it.
The codebase is a monorepo, but it’s not deployed as a monolith.
Code is organized into files called “Workers,” which bundle a TypeScript function with a declarative description of its runtime requirements. A Worker is typically an API endpoint handler, a PubSub consumer, or a cron job. At runtime, two Workers are never allowed to assume that they will run inside the same process; i.e. all communication between Workers must use an external system like PubSub or Postgres.
Workers are organized together into modules called Apps which solve specific problems. Most Apps are either backend business logic, a React frontend, or a reusable platform “Lego” on which other Apps depend. Apps are composed into Loadouts which provide a complete package of functionality and serve as the interface between Material’s development and release/deployment processes. The GCP services which must be provisioned for a Loadout are dictated by the dependencies of that Loadout’s Apps.
Today, we have six different Loadouts built from 42 different Apps built from 80 different Workers.
Kharon is a global proxy that allows Material to provide Phishing Simulation domains “out-of-the-box” in addition to any custom simulation domains a customer might register. Kharon receives traffic on these Material domains and routes the simulation request to the appropriate tenant.
Auth: a platform module for authentication plus token based authorization (non-RBAC).
Coordination: a platform module offering distributed systems primitives like leasing, election, and a key-value store that is highly available with low durability.
GCloud: an infrastructure module for running on GCP.
Kharon: the business logic for Kharon. Contains Workers:
KharonApi: an endpoint handler to programmatically configure Kharon’s behavior.
KharonProxy: an endpoint handler to actually do the redirection.
Model: a platform module offering a relational database.
Loadouts contain Apps which contain Workers which contain actual code 😎
Our Deployment Model
Every Material customer gets their own GCP project that is single-tenant and fully isolated from other customers. The customer’s GCP project wholly contains the Material application and data. As a result, no customer data is ever stored outside the project. All customer projects have GCP Security APIs enabled and monitored by Material’s Security Team.
Most customers allow us semi-managed access to provide on-call, operational support, although a fully “on prem” style is also possible. In all cases, the customer has direct GCP access.
These customer projects all run the same Loadout which provides Material’s core product. We also have additional GCP projects for instances corresponding to our other Loadouts, e.g. we have GCP projects for Kharon staging and Kharon prod.
This approach makes security, compliance, and auditing more straightforward in addition to allowing full customization of each project to suit customer needs. It does create challenges, especially with release engineering and data infrastructure.
Infrastructure As Code
Each GCP project contains exactly one Loadout. The infrastructure configuration for that GCP project is determined by the configuration of the Loadout. For example, the Kharon Loadout contains the Model App and so a GCP CloudSQL database will be provisioned in Kharon projects.
Each infrastructure component, including all telemetry, is configured statelessly by comparing the live setup in GCP with the desired configuration determined by the Loadout configuration. Our approach does not keep any state and it is intentional that any changes made manually, e.g. by humans in the GCP console, will be reverted by the automation.
It’s true that this basic functionality is redundant with what Terraform/Pulumi offer. However, we are quite happy to manage transitions between configurations using our own code. For example, determining whether Terraform is going to update a resource in place or end up destroying it and causing an outage is error prone and impossible to do from a code review of the configuration. The situation quickly gets out of hand when multiple interdependent resources need to be modified jointly, and the stateful approach to convergence makes all of this harder to debug when things go wrong. We pride ourselves on ensuring reliable configuration transitions ourselves.
Kubernetes In Particular
The GCloud App provides the most basic configuration for Kubernetes, e.g. node pool definitions and sanity-checks on auto-scaling limits. The bulk of the Kubernetes configuration is created dynamically from the Loadout as follows:
- Deployments and Horizontal Pod Autoscalers (HPAs) are created from Workers with a few considerations:
- Ingresses, NetworkPolicies, Services, etc. are created from Deployments.
- Secrets are handled primarily by storing them in GCS and granting IAM access to the Workload Identity service account, with the exception of bootstrapping secrets which are actual Kubernetes Secret objects.
After creating and updating all Kubernetes resources that are desired, we then purge all extraneous resources from the cluster. This is possible because we are always operating based on the full specification of everything that should exist in the cluster instead of taking a piecemeal approach where chunks of the config are applied independently. This lets us avoid situations where Kubernetes resources end up abandoned when their configuration is deleted.
Ultimately, Material’s deployment model puts infrastructure configuration out of the minds of most developers.
Our centralized deployment system continuously converges projects towards their desired commit SHAs as configured in Github. Modifications to what is “desired” for a project are done via Github PR like any other configuration change. In practice our projects tend to mostly be pointed to the stable release or the release candidate. When a project needs to be updated, the deployer launches the per-project infrastructure configuration to effect the change.
We have the capability to “pin” a particular project to a particular branch outside of the normal release process, but usage of this mechanism generally implies that something has been launched without proper feature gating or we are doing incident response.
Our isolation model which necessitates infrastructure-as-code for many isolated GCP projects is generally beneficial. It does present some unique challenges for integrations with software that doesn’t contemplate our approach. For example, our Snowflake integration requires us to use data ingestion APIs which are more convoluted than a native integration between two Snowflake tenants. When considering data infrastructure tools like Airflow or dbt, we have to consider how we would programmatically administrate hundreds or thousands of instances. This approach also presents challenges in terms of our observability capabilities, as we initially lacked a unified method for monitoring and alerting across all projects (more on this to come).
Our codebase layout is high quality overall, but we are experiencing growing pains in some areas such as the inability to formally express dependencies among our Apps. This makes the creation of new Loadouts a bit tedious and sometimes results in runtime circular dependency problems which ideally would be caught at compile time.
Finally, our release convergence tooling is adequate, but it’s lacking in features and tooling for non-engineering users and this is becoming increasingly painful as we scale up our Product, Sales, and Customer Experience organizations.
Stay tuned for more updates as our architecture continues to evolve and scale!
"Every Material customer gets their own GCP project that is single-tenant and fully isolated from other customers. The customer’s GCP project wholly contains the Material application and data. As a result, no customer data is ever stored outside the project. All customer projects have GCP Security APIs enabled and monitored by Material’s Security Team."