Building Multi-Region Active-Active Architecture on AWS using containerised micro services

In this article, we will look into building AWS based multi-region active-active architecture, and utilise micro services deployed in Docker containers.

In this article, we will look into building AWS based multi-region active-active architecture, and utilise micro services deployed in Docker containers.

Requirements and Assumptions

To facilitate the above, we will make up some sample requirements and assumptions, which will require a multi-region active-active deployment.

  • Tier 1 Web Application: 99.99% availability. Annual Downtime: 52 mins.
  • RPO (Recovery Point Objective): Near Real-time.
  • RTO (Recovery Time Objective): n/a (Always active).
  • Application must have Multi-Region deployments to serve global user traffic.
  • User traffic must be served from their own, or nearest AWS region.
  • Components of the app:
  • Rich Internet Application (RIA) built using Angular 7+. Angular app will access micro services deployed in Docker containers.
  • ElasticSearch will be used as a search engine, and will be exposed to the app via a micro service.
  • Micro services will be developed using Java (Spring Boot), and deployed in Docker containers.
  • There is no advisory on infrastructure sizing for any of the components (as it will depend on the non-functional requirements of a specific application), and hence ignored in this article.

What is an Active-Active Architecture?

  • An active/active system is a network of independent processing nodes, each having access to a (..) replicated database such that all nodes can participate in a common application [4].
  • For all practically purpose, we will have two or more ‘sets’ of applications running in two more geographically dispersed different AWS regions. Both regions will have all the required components and data, so that in case one region fails, another can automatically take over user traffic from the failed region without having a downtime for the globally dispersed app users.

Active-Active vs Active-Passive Architectures

  • Active-active architecture requires that we have all components running in both regions at the same time
  • Active-passive architecture suggests we have all components running only in the ‘active’ region/data centre, but have mechanisms in place to automatically (or semi-automatically/manually) failover to to passive region and install/start components/application in the passive region as quickly as possible.
  • From an AWS perspective, this means that you have already configured VPC (security groups, etc) in the passive region, have some sort of data backups in that region, and can quickly deploy and start your components in that region (when failover occurs). There is a whole lot of other issues to consider (such as availability of data snapshots, etc) in the passive region, so that you can recover data and re-create database and other necessary app data in the passive region, and make it active. Those are beyond the scope of this article.

Rationale for Multi-Region Active-Active Architecture

Below are the some of the key reasons why you might consider building a multi-region active-active architecture:

  • Improved latency for end users – Closer your backed is to end users, the better customer experience they will have.
  • Disaster recovery – Mitigate against a whole region failure.
  • Business Requirements – Tier 1 apps with 99.99%+ availability non-functional requirements.

Key AWS Services

Below are some of the key AWS services that can be leveraged to achieve some of the above requirements:

  • Route 53 -DNS service, Traffic Flow management across AWS regions
  • CloudFront – Global Content Delivery Network (CDN)
  • AWS Aurora Global Database – Designed for globally distributed applications, allowing a single Amazon Aurora database to span multiple AWS regions. Cross Region: Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of less than 1 minute (in case of a region failure)
  • AWS ElasticSearch (ES) – Managed service to deploy, secure, and operate Elasticsearch at scale with zero down time
  • AWS Elastic Container Service (ECS) – Container orchestration service that supports Docker containers and allows you to easily run and scale containerised applications on AWS. With ECS, you can either use EC2 instances or AWS Fargate to deploy your containers.
  • AWS Elastic Container Registry (ECR) – Managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images
  • AWS Fargate – Compute engine for Amazon ECS that allows you to run containers without having to manage servers or clusters

Additionally a global solution will benefit from the following services (not shown in the deployment diagram):

  • AWS Shield – Managed Distributed Denial of Service (DDoS) protection service
  • AWS CloudTrail – Enables governance, compliance, operational auditing, and risk auditing of your AWS account
  • Elastic Logstash Kibana (ELK) Stack – ELK stack to extract logs from different services in one central place for issue analysis
  • AWS ElastiCache – Managed, Redis or Memcached-compatible in-memory data store. You can use it to offload traffic from your RDS Aurora DB or ElasticSearch cluster.

AWS Route 53 Routing Policies

AWS Route 53 provides several routing policies, which can be leveraged to route traffic to a certain region based on different factors. Some of the key policies relevant for a multi-region active-active architecture are listed below [1]:

  • Failover routing policy – Use when you want to configure active-passive failover. For example, if US region fails, route US traffic to EU region (failover region for US).
  • Geolocation routing policy – Use when you want to route traffic based on the location of your users.
  • Geoproximity routing policy – Use when you want to route traffic based on the location of your resources and, optionally, shift traffic from resources in one location to resources in another.
  • Latency routing policy – Use when you have resources in multiple AWS Regions and you want to route traffic to the region that provides the best latency.
  • Multivalue answer routing policy – Use when you want Route 53 to respond to DNS queries with up to eight healthy records selected at random.

Multi-Region Active-Active Overall Architecture

  • You can route a user’s API requests based on their geolocation (Route EU users to EU region)
  • Define failover regions for each region (If EU region fails, route traffic to US region so that users can still access the app)

Detailed Deployment Architecture for a Region

Each AWS region will have a deployment architecture similar to the below.

  • Bastion Host – There will be a bastion host in each region. Developers/admin can SSH to this host, and from there they can access RDS, Microservices and ElasticSearch. There is a Network Load Balancer (NLB) in front of the Bastion host, as you can have multiple low spec Bastion Hosts in two Availability Zones (AZ). If a bastion host instance is replaced by AWS, if would not affect your developers. They can still SSH using the same NLB DNS address.
  • CloudFront – Global users will access the Angular app via CloudFront CDN for low latency downloads.
  • AWS S3 – Angular based app will be deployment in S3.
  • Router 53 – All user traffic will go via Route 53, which will make routing decisions regarding which region the users requests should be served from.
  • AWS ALB (Application Load Balancers) – Each region will have its own ALB, which will route traffic to different micro services running in the Docker containers. For example, all product related web service calls will go to the ‘product’ micro services container. To achieve this, you will define a Target Group for each micro service, and map different request paths to a target group. For example, ‘/product/’ path could be mapped to a Target Group that contains all your containers running the Product micro service.
  • ECS Fargate – It will contain all your containers running different micro services.
  • AWS ES – It will contain for example product related indexed data, and will be used by the ‘search’ micro service.
  • RDS Aurora DB – It will contain all your product and user data. Each region will have its own primary and standby replica. If primary goes down in a region, AWS will promote the standby as primary. If the whole region containing primary goes down, you will have to promote another regions RDS Aurora cluster as the primary for the Aurora Global DB.

Security Architecture

In each Availability Zone (AZ), there is 1 public subnet, and 2 private subnets to segregate components. Load balancers are in the public subnet, while rest (ECS containers, ElasticSearch, DB) are in private subnets.

Security Groups (SG)

Traffic flow between different components is controlled via Security Groups. A security group allows traffic only on a certain port to another security group.

  • Application Load Balancer (ALB) SG – Allow traffic on port 443 from public internet
  • Network ALB SG – allow traffic on port 22 from your office IP range.
  • Bastion SG – allow traffic on port 22 only from Network ALB SG
  • Elastic Search SG – allow traffic on port 9300 from Bastion SG and Application Load Balancer (ALB) SG
  • ECS Fargate SG – allow traffic on port 8080 (running Spring Bootstrap) from ALB SG.
  • Database SG – allow traffic on port 3306 from bastion SG and ECS Fargate SG.

Access to ElasticSearch on local machine

To access AWS ElasticSearch Service (ES) from a local machine, developers can create an SSH tunnel via the bastion host. That way, they can access Kibana and ElasticSearch REST API locally in their browser to diagnose any problems with the data contained in the search cluster.

For example. add below to your ~/.ssh/config file to access Kibana and ElasticSearch REST API locally on your machine.

Note: Please remember to replace the keypair file name, bastion host ALB public DNS name and private DNS name of your ElasticSearch cluster.

# Elasticsearch Tunnel
# https://www.jeremydaly.com/access-aws-vpc-based-elasticsearch-cluster-locally/
Host estunnel
HostName multi-bastion-lb-abc.elb.eu-west-2.amazonaws.com
User ec2-user
IdentitiesOnly yes
IdentityFile ~/.ssh/multi-keypair.pem
LocalForward 9200 vpc-multi-es-domain-abc.eu-west-2.es.amazonaws.com:443

After this you can access Kibana locally in the browser:

https://localhost:9200/_plugin/kibana/app/kibana#/management/kibana/index?_g=()

And test access to ES API via curl or use Postman to test different API requests.

$ curl -k https://localhost:9200/

Same kind of access could be configured to access RDS Aurora locally (for Dev environments).

AWS Route 53 Routing Policy

AWS provides traffic flow visual editor that can be used to to configure several routing policies as mentioned above in this article.

  • For example, if you are using three regions US, EU, & APAC.
  • You may direct all US users HTTP requests to your US region VPC using Geolocation based routing policy.
  • To have automatic failover for different regions, you can for example define EU as a failover regions for US. This way, if the whole US region goes down, Route 53 will route traffic destined for US region to your failover EU regions VPC.
  • You can repeat the same for EU and APAC regions.

Example Routing Policy

Multi-Region Active-Active Architecture Costs

  • Running a redundant set of components across multiple regions is by no means a cheaper option. It is costly!
  • Thats why majority of the applications are either deployed in a single region within Multi AZs or use an active-passive architecture where active region has all the required components, while passive region has just bare bone VPC and some data backups.
  • So, unless you need to absolutely meet the active-active multi-region requirements as outlined in the beginning of this article, you may be fine just deploying in a single region.

Summary

  • Multi-region active-active architectures are expensive to build and maintain, but are essential if your app requires low latency for your global users, is a Tier 1 app with 99.99% availability requirements, and requires built in disaster recovery capability across regions.
  • AWS RDS Global DB simplifies the cross region DB replication, and facilitates active-active architecture.
  • Using AWS Fargate along with ECS Registry (ECR) simplifies running containers in AWS, and deployment and updates of microservices containers. Developers can easily build & test a Docker container locally, and push it to ECR. From there, they can easily pushed to ECS without any downtime for existing microservices as ECS gradually replaced the old containers with the new containers.
  • There are some tech choices made in this article, but please do your own due diligence to choose the correct technologies for your own specific use case. For example, using ECS vs Kubernetes, using EC2 or Fargate with ECS, using RDS Global Database vs DynamoDB Global Tables.

Author: Sajid Khan Niazi, experienced Enterprise Architect/Solutions Architect, Founder Axsar Contracts, 12x AWS Certified, MBA, M.Sc. Software Engineering.

References

  1. AWS Route53 Routing Policies
  2. AWS RDS Global Database
  3. AWS Fargate
  4. Active-Active Architecture
  5. AWS Elastic Container Service (ECS)
  6. AWS Elasticsearch Service (ES)
  7. AWS CloudFront
  8. AWS ElastiCache