Reliability and Security Principles

At Voltade, software reliability and security is critical. Quite often, our products and services are used by our users' customers around the world to understand their product offerings and make purchases. This means our services have to be up, fast and secure, all the time and from all parts of the world. Hence, we cannot treat reliability and security lightly, and this document lays out core principles that we enforce.

  • Secured Perimeter
  • Availability Monitoring
  • Alert Systems
  • Service Recovery
  • User Feedback
  • Logging
  • Decoupled End-to-End Checks
  • Isolated Services
  • Redundant Infrastructure
  • Backups
  • Test Setup
  • DDOS Protection & Caching
  • Incremental Rollout System

This is not meant to be an exhaustive list. It is the base framework for Voltade developers to implement to ensure a minimum standard for our applications and services.

Secured Perimeter

Having a service go down is bad, but having users compromised is fatal. The interfaces we explicitly expose to users should avoid obvious exploits, which requires us to:

  • Provide basic authentication around some endpoints and think about safeguards about even the publicly accessible ones. For example, limit the date ranges for read endpoints to prevent a single malformed request from consuming the whole server’s resources.
  • Address users that have access to other user’s data without their credentials.

After securing exposed endpoints, we close off access to the remaining resources.

  • At the most basic level, this means throwing the remaining resources into a private network or firewall closing off unused ports and external IP addresses. Even if the internal workings of the system are not properly secured, removing their exposure from the outside world drastically reduces the attack surface.

Thirdly, we secure developer access:

  • Account passwords are obscured and have access limited to only the people who need them (“principle of least privilege”)
  • 2-factor authentication is enforced
  • Recovery emails are regularly verified
  • Any unused services and access keys are detected and removed immediately

Fourthly, we utilise intelligence monitoring tools such as AWS GuardDuty and audit trails such as CloudTrail and S3 logs to detect anomalous behaviour within our systems.

Security is never “done”, but we do not want security to stop us from building a better product for you. Hence, the goal is not to be invulnerable but to protect against the most likely attacks.

Availability Monitoring

Systems need to be available around the clock when they are brought into production. So, the most important thing we can do when running a production service is know when the service goes down. We do so by subscribing to third party services which check repeatedly if our servers and APIs are responsive.

  • Pinging service which pings our servers every few seconds to make sure they are up, i.e. Better Uptime.

  • Transactional checks to verify that previous issues do not happen again. This is our “immune system”, which learns from previous issues and writes checks to guard against their re-occurrence

  • Monitoring services that are unconnected to the services they are monitoring. A service running AWS should not be monitored by another AWS script. An availability monitoring service which fails together with the service it is monitoring is not very useful.

Alert Systems

Voltade enforces a systematic way to be contacted when monitoring systems detect something has gone wrong. We use alerting tools such as OpsGenie and PagerDuty to activate a 24/7 on-call engineer to respond to issues. This engineer is not just equipped to triage, but can also perform technical tasks to bring the service back up. Crucially, when services go down, our first response is not to fix the system but communicate to our users.

Service recovery

If services go down, we restore them typically not by writing a hot fix in production, but by pattern matching against the last time the code worked, and revert to that version. Once an on-call engineer detects a problem he or she –

  1. Restarts servers to attempt to bring them online.

  2. If restarting servers doesn’t work, revert code or infrastructure to a previous version – this is only applicable if rolling back does not break system and database constraints.

  3. If above fails, we declare extended downtime to work on a fix, test it, before releasing into production.

We do not attempt a fix without rigorous automated testing. Fixing things hastily in production can – and in several examples in much more established companies – result in cascading failure that blows the initial problem out of proportions.

User Feedback

Availability monitoring can catch obvious technical failures, but it is important to realize that products can fail even if the servers are still running. Individual APIs can fail even if others work. The builder interface can fail even if APIs are running. Not everyone will bother giving feedback, but we provide direct channels for our users to ask questions and let us know when something is wrong. This could be a 24/7 manned email address, Whatsapp number, or a support form integrated into our CRM.

Logging

Although we don’t log data that users enter into our APIs, and especially not personally identifiable information, we keep logs of how the system is operating.

This includes logging on the client side of how users are using our service, as well as server side logging of the incoming requests and internal workings of our application. This lets us know what features people find useful, what they are not using, and what might be problematic. This is critical for knowing what needs to fixed and what is worth developing further. More importantly, logging gives us clues to diagnose the problem when something unexpectedly goes wrong, such as looking at the last requests that came in before the server crashed.

Client side logging is done by services such as Google Analytics, Sentry and DataDog.

On the server side, we log incoming requests directly into AWS CloudWatch where we built dashboards to monitor for HTTP Errors – the number of web requests that ended in an error, Logged Exceptions – the number of unhandled and logged errors from the application and Thrown Exceptions – the number of all exceptions that have been thrown. We also do regular batch job monitoring for anomaly detection and ensuring database and server states are consistent with Cronitor.

We don’t just log raw requests but also significant intermediate events such as verification emails sent to users or APIs being set up, which allow us to easily identify issues in key user flows.

Decoupled End-to-End Checks

Code will have bugs. Even the most robustly tested and reviewed code inevitably has some errors in it. This means that instead of assuming our system is doing what it is designed to do, we setup independent services to verify key outcomes and invariants. For example, we have a routine job that checks if there any APIs that got stuck in a “still launching” state, and if a web calculator that is meant to compute a certain result is still doing so.

Because these systems function as checks, we decouple them from the main system as much as possible. They may read from the same database, but they are in a separate codebase and run as an independent job. That way if there are any bugs in the main codebase this independent system will not be affected and will still pickup the irregularities caused.

Isolated services

We strive to reduce mutual dependency among services, which minimises chances of cascading failures or “supply chain risks”. For instance we keep our API servers as independent from our app servers as much as possible, ensuring that changes that we push to our builder interface will unlikely affect APIs that our users are already running. This could mean hosting servers in separate network environments with minimal code dependencies among one other.

Redundant Infrastructure

At even the most established companies, servers can go down. We can spend ever increasing efforts to get ever diminishing returns on the server reliability. At Voltade, we accept that any individual server will fail eventually. We explicitly treat individual components as fallible and have failover plans among them.

The simplest way we do this is to have multiple stateless servers. By having servers store files and data in an external datastore, each individual server is now disposable. Since servers are where our code is written, they are the most likely places where crashes can occur. Having multiple servers running with a load balancer in front of them means that our service continues to function even if any individual machine goes down. We also use auto scaling services like Amazon EKS to automatically restart failed servers and spin up additional resources in response to spike in traffic and loads.

We apply the same principles to other layers of the infrastructure as well. Our databases have automated failover across multiple replicas and over multiple availability zones. We have yet to use distributed file storage (e.g. GlusterFS) and distributed logging systems (e.g. Kafka) due to the additional technical complexity needed. But they are options in our arsenal we can deploy in the future.

At a higher level, just as individual servers fail, so can entire systems. AWS is built on robust distributed systems, but there have been times where the entirety of AWS has gone down. There is an element of organizational risk even in infrastructurally redundant systems. We strive to maintain redundancy across different organisations. If AWS servers goes down we can switch to hosting on Google Cloud. If AWS RDS database goes down, we can fetch backups from Microsoft Azure. If Cloudflare CDN goes down we can switch to Netlify. If AWS SES email service goes down, we can switch to Twilio. Organizations represent a substantial source of failure correlations, and so we explicitly plan around them.

Backups

Data loss though unlikely can happen to even the most established companies. This risk is reduced at Voltade because we do not store personally identifiable information and data that users enter into calculators. However, we still treat data loss seriously, and have developed a backup strategy to recover from potential loss of data.

Data loss can happen due to any number of issues - improper isolation of environments, faulty database migration scripts, erroneous database queries, bugs in application code, AWS zone/region failure, malicious code execution or account compromise.

It is crucial to have backups so that we can ensure service continuity and reliability. We measure Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO refers to the maximum duration of service downtime before recovery, while RPO refers to the maximum amount of data that could be lost before service is restored. A good backup strategy aims to minimize both RTO and RPO.

In order to minimize the recovery time, we ensure the backup data is easily accessible. The fastest recovery plan would be to use the automated backup service provided by our database service. This could either be a point-in-time recovery (PITR), snapshot based backup or failover to read-replica. Failover to a read-replica will provide the lowest possible RTO and RPO. PITR and snapshots will have a similar RTO of about 10 minutes. Snapshots will have a very high RPO in most cases where a snapshot is taken during off-peak hours (after midnight) and if a recovery is needed in the middle of office hours. In a disaster where AWS is completely unavailable and all data on AWS could potentially be lost irrecoverably, we use offline backups that we store on another cloud service (i.e. Google Cloud) as a last resort.

Test Setup

Writing code is hard. It gets even harder for larger and more complex systems. This means that developing any system gets disproportionately more difficult as time goes by, as each seemingly insignificant code change has an exponentially increasing number of ways it can influence the rest of the system. This is especially true when we consider the diversity of environments and use cases that people will be running our APIs. In order to minimize the number of bugs that make it into production and still be able to make progress, we set up an automated test environment.

A good testing system helps automate the testing of functionality of any changes beyond a developer manually clicking around in the running application. At a basic level, we have integration tests spin up the service or large parts of it and check to see if the key interaction points respond as expected. This helps catch any high level failures from reaching production but may not be very helpful in diagnosing the problem. We set up unit tests to check individual components of the code in isolation and seeing if they still behave as expected. This may not catch all the errors, but when it does it allows developers to more quickly work out where things are going wrong.

Tests should exercise as much of the application code as possible, to catch any problems regardless of where they are. We use code coverage tools to track the execution of application code as tests are run, and provide a compiled report of how much code was covered, and what else to test for. Having good code coverage indicates how extensively an application is tested, and helps our developers identify important parts that have yet to be so.

As our systems grow more complex, the ease of testing becomes the limit to developer productivity, more than individual ability. It is not unusual for a complex system to have more than half of development time spent on writing tests rather than production code. We attempt to balance the amount of time spent writing automated tests and shipping code quickly, by maintaining sufficient testing code coverage even as our systems grow in complexity.

DDOS Protection & Caching

While we can spend significant time securing every last exploit of our system, a trivial way to bring services down is to just overwhelm them with traffic. You do not even need to be a serious hacker; there are websites where you can just pay to have someone “DDOS on demand” a desired target.

The only way to prevent that is to have more bandwidth than the attackers, and selectively filter out obvious non-users. This is what content delivery networks (CDNs) do. We use services such as Cloudflare, which sets up “shared pools” of frontend resources for all their customers. This means that anyone trying to overwhelm any individual service will need to deal with the entire pool’s resources. While still possible, it is extremely hard to do. Whereas overwhelming even a dozen servers for a single digital service is fairly trivial.

Our CDNs can even use their breadth of information to intelligently detect and mitigate attacks. System wide traffic patterns can be used to identify when attacks are happening. Known botnets are recognized and quickly filtered out. Suspicious looking traffic is presented with a captcha that allows legitimate users to still access our service, while screening out the automated attackers.

Incremental Rollout System

When making any changes to our code, it is impossible to know how it will affect users. Even if all the tests pass, and all the user studies are positive, there is a good chance that something will go wrong or users may just react negatively to our changes. Hence we avoid launching to everyone all at once, and instead slowly introduce new users once we are confident that code is robust.

We selectively reach out to small pools of users to test new features and gather feedback. We do so through a range of tools – such as manually letting users test features on our UAT servers, launching to a subset of users via beta flags, and running A/B tests with Optimizely.

The core concept of incremental rollout is to gradually increment the test surface of a new release. Developer testing is the smallest surface. Next, a pre-release to beta testers is a larger surface that incorporates real-world usage. Then a phased rollout to Production to the rest of the user base. Throughout this, any breaking issues discovered means the release should be halted or rolled back. During testing phases, the cost of making mistakes is low and so the preference is to move as fast as possible to find problems quickly. However, at the production phase, the damage to users is much higher and so the preference is to be slow and steady.

We treat testing seriously. It is not just another step to launch. It is quite common for other companies to want to keep to a schedule and report progress. And so even though they run pilots or beta tests, they push through to the final launch despite having clear signs of trouble. This completely nullifies the point of running tests. The goal of partial rollouts should be to actively try and find problems, not to get users. This means using as few users as necessary to get meaningful feedback, and halting the rollout when new information presents itself.

Updated: May 2023