I have created hundreds of development environments. Today, I want to review how to secure your production data in the staging environment. A staging environment is used to test new features being developed but at the scale of production data with production level security enabled. It’s as close to production environment as you can get without testing directly in production. To create that environment you need the application code and infrastructure deployed out through your CI/CD pipelines but you need a backup copy of your production database to represent your actual data traffic and storage requirements. I also replicate the data traffic by setting up a load testing solution like the AWS Distributed Load Testing solution. Again, the goal is to fully test the new features or infrastructure being developed at a production scale before it is published into the production environment. This is great concept, but the requirement to remove all sensitive data from the production data has been challenging and unresolved in many of my clients. Ideally, to protect your sensitive data you need to obfuscate or change the production data to protect the innocent. To help you solve this issue, lets walk through a typical solution I have used with a SaaS application using a Postgres RDS database:
- I created a scheduled nightly event that will kickoff this process from Event Bridge
- A Lambda function will assume a Cross Account role to request a Production RDS snapshot (backup)
- When the RDS Snapshot completes an event will fire to inform the Lambda function that a snapshot is ready and available
- The Lambda function will then assume the Cross Account role again to request the snapshot to be shared with the Staging AWS Account (please put the staging environment into a different AWS account than production)
- Immediately after sharing the snapshot to the Staging AWS Account, the Lambda function will delete the Staging RDS database
- Next, the Lambda function requests the creation of a new Staging RDS database to be created from the shared production snapshot
- Once the restore has completed an event will trigger a Lambda function
- That Lambda function will now run the obfuscation SQL script against the newly created staging RDS database. I have a base SQL script that I have used that looks through the entire database tables and columns looking for SSNs, Phone numbers, email address, first names, last names, physical addresses, drivers license numbers and a few other PII elements. I typically update this generic script with any project specific data that has been identified. When it finds data it replaces it with randomly generated fake data.
- Finally, when the data is successfully obfuscated, the Lambda function will delete the shared Production RDS Snapshot shared with the staging environment. Note: I assume you have a separate backup vault and strategy and replication process to your DR environment, so this solution is not for backups.
The environmental variables identified in the diagram, are associated to the Lambda function and can be changed quickly to point to other sources for a different process. This is a quick and simple blog that will hopefully save you days of work on securing your sensitive data in your non-production environments. If you want a link to the GitHub repository with the IaC, NodeJS, and the Obfuscation script, send me a message or write a comment below and I’ll share it out.