By Aseem Shrey
Compliance might mean different things for different organisations. It’s usually the process of conforming to a specification, policy, standard, or law. But one thing is common — It’s tedious and involves a lot of operational tasks. Engineers always look for various hacks to do such tasks or hope it just happens.
For example if you’re making your cloud infrastructure compliant to certain standards, you would be following the best practices created by the committee which came up with those standards. If you were adhering to widely acknowledged benchmark like Center for Internet Security (CIS) benchmark, it’s bound to improve your security posture by a few notches. As these benchmarks are articulated by the bigwigs of the community and reviewed and revised continuously, it helps in improving the security posture of the resource in question.
Technical standard is an established norm or requirement for a repeatable technical task.
It is usually a formal document that establishes uniform engineering or technical criteria, methods, processes, and practices.
Some of the well known standards are — ISO/IEC 27001, NIST, PCI DSS
A benchmark is a standard or point of reference against which things may be compared. For Android phones we have AnTuTu benchmark, for GPU benchmarking we have 3DMark, GFXBench and for mobile cameras we have DXOMARK.
Likewise for cybersecurity, we have CIS Benchmarks. Center for Internet Security (CIS) is a community-driven nonprofit, responsible for the CIS Controls® and CIS Benchmarks™, globally recognised best practices for securing IT systems and data.
There are 7 core categories of CIS Benchmarks:
- Operating systems benchmarks cover security configurations of core operating systems, such as Microsoft Windows, Linux, and Apple OSX.
- Server software benchmarks cover security configurations of widely used server software, including Microsoft Windows Server, SQL Server, VMware, Docker, and Kubernetes.
- Cloud provider benchmarks address security configurations for Amazon Web Services (AWS), Microsoft Azure, Google, IBM, and other popular public clouds.
- Mobile device benchmarks address mobile operating systems, including iOS and Android, and focus on areas such as developer options and settings, OS privacy configurations, browser settings, and app permissions.
- Network device benchmarks offer general and vendor-specific security configuration guidelines for network devices and applicable hardware from Cisco, Palo Alto Networks, Juniper, and others.
- Desktop software benchmarks cover security configurations for some of the most commonly used desktop software applications, including Microsoft Office and Exchange Server, Google Chrome, Mozilla Firefox, and Safari Browser. These benchmarks focus on email privacy and server settings, mobile device management, default browser settings, and third-party software blocking.
- Multi-function print device benchmarks outline security best practices for configuring multi-function printers in office settings and cover such topics as firmware updating, TCP/IP configurations, wireless access configuration, user management, and file sharing.
An example check in CIS Benchmarks for GCP Cloud Provider benchmark:
There are 57 checks in the CIS 1.1 benchmark, categorised into Level 1 and Level 2 for the GCP platform.
Level 1 benchmark profiles cover base-level configurations that are easier to implement and have minimal impact on business functionality.
Level 2 benchmark profiles are intended for high-security environments and require more coordination and planning to implement with minimal business disruption.
Now that we have an idea of what compliance, benchmarks and CIS compliance is, let’s talk about the problem at hand.
The Gojek Scale
At Gojek, the GCP spans across :
- More than 350 active projects excluding the
Firewall Rules> 4000
Storage Buckets> 1000
All these are constantly changing and that too from multiple teams as the ownership of project lies with the teams.
An Ideal State
Let’s see what an ideal state for GCP with respect to compliance would look like:
- No non-compliant resource
- Auto remediate any non-compliant resource
- Ability to whitelist resources
- Accountability of non-compliant resources — i.e. to say we should be able to know the business justification around it and have some process to manage these
- Temporary whitelisting of resources
- Easy to maintain
But are we able to achieve all of this? Yes.
How are we able to do this?
There are multiple parts to the project:
- Checker — CloudFunctions
- Remediators — CloudFunctions
- Whitelist — whitelist.yaml
- Accountability — Gitlab/Any other version control system
Architecture of the system
- Cloud Scheduler: Used as cron job to schedule messages to be sent to Pub/Sub
- Cloud Pub/Sub: Acts as trigger for the cloud functions
- Cloud Functions: Execute the code for checker and remediator of the different CIS checks
- Slack: Updates are sent to slack
First, the cloud scheduler sends a message to cloud pub/sub at a fixed time, like a cron job.
This message to the pub/sub triggers a cloud function which does all the check and remediation.
After the check is done running, it posts a summary on slack
Walkthrough of the code
Here’s a benchmark that suggests that SSH access is restricted from the internet. It’s a Level 2 benchmark.
Let me walk you through the code structure and some sample code of the cloud function check for the above benchmark.
Every check, which includes remediator as well, is a module and has its own folder.
Directory structure for one of these checks looks like this — it’s the same pattern repeated for all the checks.
main.py is where the whole magic happens.
It creates a backup of the current config, takes into account the
whitelist.yaml and then goes on to make changes.
remediator packed into one.
config.py — Contains config specific to the check
The following config values are common in each of these checks :
- Check Metadata — Some info about the check itself
- Backup bucket name
- Backup filename
- Google API Scope for the specific cloudfunction to work
Apart from these 3 values, it contains check specific configurations as well.
whitelist.yaml — The whitelist file
<project_name>: <key_name>: MR-Link: https://<gitlab_instance_url>/security/cis-benchmark-work/-/merge_requests/1 business-justification: For the test MR data: prod username: aseem.shrey owner: <user_email> type: development value: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIL+hIwK2q8/NtDuvzeOZ330JUPMFPYd2iKSzZx1R5zOc aseem.shrey expiry: <till_when_the_whitelist_is_valid>
project_name: The name of the GCP project for which the whitelist is being added
key_name: Key for this resource. Multiple resources that need to be added to the same project will be added with different
MR-Link: Link to the current MR. Later to be consumed in a dashboard to find out which MR is responsible for a whitelist
business-justification: Business Justification for adding the resource to whitelist
data: What kind of data will this resource be able to access, once whitelisted. Like in this case, adding the ssh-key to the current project would give the user access to production data
owner: Who’s accountable for this whitelist
value: These keys are specific to this function, as ssh requires a username and a ssh-key ( which is the value, here )
expiry: This is useful for temporary whitelisting.
Values can be :
- Maintained through version control system, here on gitlab. Every rule that is whitelisted is attributed to the person who raised that MR (merge request).
- This further requires approval from their manager on the MR (merge request).
The following is the MR to whitelist one of the resources (here firewall rule) to allow to open ssh port to the world.
The whole codebase is deployed as cloudfunctions and auto deployed using
gitlab-ci.yml which is gitlab’s CI automation, similar to github actions.
Every time there’s a change ( like adding resource to the
whitelist.yaml) in one of these checks only that particular
cloudfunction is redeployed.
After every run, functions’ send a status update.
The first line mentions which CIS check the slack alert is for.
Remediated : The projects where auto-remediated resolved the pending issue
A green visual indicator further shows that the operation was successful.
Failed : The projects where auto-remediated failed to resolve the issue
- Overall dashboard of the current state of compliance of our cloud — Push data to ELK
- Projects with the
sys-are created by default by GCP for every
appscriptrun. Full doc here.
‘When a new Apps Script project is created, a default GCP project is also created behind the scenes.’
Also check : The Quirks of Apps Script and Google Cloud
Click here to read more stories about how we do what we do.
And we’re hiring! Check out the link below: