‹  Back

March 14, 2024 · 8m read

Navigating Google Drive Edge Cases

Gianluca Venturini 

We recently introduced Data Protection for Google Drive, a new product in our Cloud Office Security Suite, designed to control the sprawl of sensitive data loosely shared across your shared file repositories. This post is a deep dive into a number of technical challenges we faced, and the engineering decisions we made along the way to better serve our customers’ needs.

Tackling Google Drive is an effort that many of our customers asked for, largely in part because of our unique ability to get deep with email data protection. If you’re new to Material, a core component to our underlying architecture is our data platform, which is an isolated data warehouse deployed within each customer’s single tenant cloud environment. For email, we sync the contents of employee mailboxes and classify sensitive information that should be protected.

We took many of our learnings from years of experience working with email into the Google Drive domain, however there are many distinctions as it relates to the nature of the content, how it’s modeled, and how it’s used. For example, a list of customer names might be a bulleted list in email and a spreadsheet in Drive.

Let’s inspect three areas that make Google Drive interesting from a technical perspective: 

  1. Quickly syncing recent files across My Drives and Shared Drives
  2. Calculating file permission and ownership
  3. Remediating files that are shared too broadly

These areas are crucial to fully understand given that a key aspect of our product is the ability to pinpoint toxic combinations of sensitive data, excessive permissions, and improper sharing, so that security teams can quickly mitigate risk via automated remediations.

Maintaining a Fresh Sync Across My Drives and Shared Drives

The basic structure of Google Drive allows every user to have their own My Drive and the organization to have any number of Shared Drives. This quickly turns into a sprawl of files and folders that only gets messier over time with constant changes to content, ownership, and permissions. There’s no easy way for the Google Workspace Admin to gain a central view across the footprint without building custom tooling atop the Google Drive API or buying a dedicated solution.

When we first connect with a customer’s Google Workspace tenant, we run a sync process to model the footprint of file contents, metadata, permissions, and sharing settings. The Report Activity API sends a notification via a webhook for every new file activity in My Drive, Shared Drives, and allows querying for historical file changes. 

However, the API comes with a couple of technical limitations: it returns file events in reverse chronological order, which requires running one query for historical and multiple queries for most-recent and extra bookkeeping. Additionally, it only makes visible activities for the past six months. In order to address this limitation, we are currently implementing a Full Sync Engine that lists all the files contained in a Drive, regardless of its last change date.

For every file activity, we extract file metadata and content, and there are few subtle API differences between My Drive and Shared Drives. For example, the latter only returns permission IDs and not the permission content and no Drive name, so we execute extra calls to achieve information parity with My Drive.

As we do with email messages, file contents are scanned against our library or ML-based rules that look for sensitive content across a wide range of categories for PII, PCI, PHI, and more. For files, we download any file within every isolated customer cloud instance, parse the text from the file, and run against the ruleset. We support most native document types and file types across text documents, spreadsheets, presentations, PDFs, and images.

We leverage Google’s DLP API within our detection pipeline, however we’re not solely reliant on its results. While this is a good starting point, we found the results to be mixed, so we decided to augment its capabilities with custom models, mostly manually crafted RegExps, to improve accuracy. For example, in order to flag a document as containing US Social Security numbers, we accept “DLP thinks it’s LIKELY” and look for a string in the format “xxx-xx-xxxx” or “DLP think it’s POSSIBLE” looking for stronger signals like a prefix “SSN” or “Social Security” in addition to the digits.

While the initial sync is a heavyweight process, ongoing operations are much faster as we’re subscribed to all events from the Drive Activity API and can quickly update the underlying data model against any changes such as new or edited files and changed permissions. Changes to file contents will trigger a fresh scan against our classification rules to ensure we’re up-to-date with our analysis. It’s critical to quickly detect and react to file changes, which is why we defined a file freshness metric as a benchmark to help guide architectural improvements. Next, let’s review how we handle file permissions.

Calculating File Permissions and Ownership

If you’ve ever clicked “Share” on a Google Doc, you’ve witnessed the complexity of their permissions model first-hand. On an individual file level, there are two primary forms of access:

  1. People with access: users, groups, and calendar events
  2. General access: allows sharing within the organization or with anyone via a public link  

Relative to Google Cloud IAM, the permissions for Google Drive are basic in terms of actions, but what makes it challenging to grok are the various forms of inheritance. Like any access control mechanism, the end results are deterministic – the end user, whether logged in or not, either has access to a specific file or not. If they do have access, they have one of three roles: Viewer, Commenter, or Editor. 

File and folder ownership is another consideration, especially as people in the organization come and go. Owner is defined differently in the two Drive types:

  1. Owner role: user has all permissions for My Drive and can only be changed by transferring the file or directory to someone else
  2. Organizer role: user that has all permissions for Shared Drives and can add/remove any other permission

Knowing the owner(s) of the Drive is necessary during API Authentication (through Domain Wide Delegation) for fetching the file. Using any account other than the owner may result in extracting a partial view of the file (e.g. some permissions could be missing because they are not visible to the caller). It’s challenging to identify the owner the first time we see changes on a file, since the user that made the last change may not own the file. To solve the issue, we impersonate the Google Workspace admin, fetch the Drive metadata, compute multiple candidate owners, and validate with the API if they effectively own the file. This process is complex because Workspace admins can only access the Drive metadata, not all the folder and files metadata.

In order to keep track of changing permissions and ownership, we rely on Google Drive to notify us when a file or parent folder permission changes. We store the computed set of permissions on every single file. We explicitly optimized for correctness over performance, avoiding a separate permission inheritance bookkeeping because it proves to be brittle, error prone and could be subject to race conditions when multiple directories are moved in a small amount of time.

Directory permission inheritance works slightly differently in My Drive and Shared Drives. It’s only possible in the former to remove permissions on every file independently from the folder. Every time a permission changes on the file or in a parent directory, Google Drive notifies our webhook. Unfortunately, Google doesn’t notify us when Drive permissions change. Because of this limitation we need to perform a full Drive sync every time the owner changes in order to update the metadata on every file in the Drive. Beyond this, there are several other extra edge cases to handle. For example, a user in My Drive can own a file that’s not in their personal Drive, or an organization can have a Shared Drive with only owners from a different organization. 

So far, we’ve seen how to fetch permissions and ownership, next we’ll review how our new product helps identify and secure files that are exposed to users who should not have access.

Revoking External Access

Users of an organization often collaborate on files internally, but sometimes it’s necessary to share files with people outside of an organization or create a link accessible to anyone. A file is shared externally if it’s shared publicly or with one or more users not enrolled in the same Material instance.

In early feedback from customers, they’ve asked for the ability to remediate files that are shared too broadly (e.g. files that contain sensitive data and are shared externally), so we made it possible to revoke file permissions directly in the product. We currently support individual and bulk manual remediations, where the administrator can select multiple files that violate an internal policy and revoke all or only the unwanted permissions.

Currently we only support manual file permissions remediation, but we’ll cover how we plan to evolve the platform towards automated remediations to enforce policies.

What’s next?

We launched Google Drive Data Protection with an initial set of features to provide immediate visibility of recent policy-violating files and allow to quickly remediate them, but we’re just getting started.

In the next few sprints the team will be focusing on delivering an Auto-Remediation Engine, enforcing complex policies defined by the security team. In particular we’re implementing a more advanced remediation flow called quarantining, that moves the file in violation to a secure drive only accessible to the security team. Additionally, we will add to the architecture a Drive Full Sync component allowing us to expand beyond the six months sync barrier. Lastly we are committed to deliver an observability tool that notifies the security team any time a policy violation is happening.

Want to learn more? Schedule a personal demo with our team today or watch our product showcase webinar.