Overview
Migrating from Google Cloud (Cloud Run + GKE) to AWS EKS while serving users from both environments requires careful planning. The goal is to gradually shift traffic to AWS without service interruption, confirm stability on EKS, then fully cut over – all while maintaining zero downtime. The domain’s DNS is hosted on Google Cloud DNS (with zones for prod, stage, uat), and this will remain unchanged. We need a strategy that allows hybrid traffic routing (to GCP and AWS) during the transition, smoothly migrates users to AWS, and provides instant failover if any backend is unhealthy.
Key requirements and challenges:
- Hybrid Serving: Both GCP and AWS instances must serve traffic simultaneously during migration.
- Gradual Traffic Shifting: Ability to start with most traffic on GCP and incrementally increase traffic to AWS (for canary testing on EKS).
- Zero Downtime: No outages or user-impacting cutovers – changes must be seamless.
- DNS Stays on GCP: We will use Google Cloud DNS for traffic steering (not moving to Route 53 or others).
- Consistent Endpoints: Users should keep using the same URLs. We’ll direct those URLs to the appropriate backends under the hood.
To meet these goals, we’ll explore DNS-based routing options, global load balancers, and service mesh/API gateway approaches. Each offers trade-offs in complexity, control, and reliability. Below is a comprehensive guide with recommendations, architecture considerations, and example configurations for each approach.
DNS-Based Traffic Steering
One of the simplest multi-cloud routing methods is to leverage DNS policies. Google Cloud DNS supports advanced routing policies like Weighted Round Robin (WRR) and Geolocation routing, similar to AWS Route 53. This allows the authoritative DNS server to decide which backend’s IP to return for a client’s query.
1. Weighted DNS (Canary/Gradual Cutover): With a weighted DNS policy, you create multiple DNS records for the same name, each pointing to a different backend (GCP or AWS) and assign a weight to each. Traffic is distributed in proportion to these weights – for example, 80% of DNS responses resolving to the GCP IP, 20% to the AWS IP (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog) (Best Practices for Zero Downtime Migration to AWS | ClearScale). By adjusting weights over time, you can smoothly shift load to AWS:
- Initial State: GCP weight 1.0 (or 100%), AWS weight 0.0 – all users resolve to GCP service endpoints.
- Canary Phase: Introduce AWS with a small weight (e.g. GCP 0.9, AWS 0.1 for ~10% traffic to AWS). Monitor AWS EKS performance.
- Gradual Increase: If stable, increment AWS weight (e.g. 30/70, 50/50, etc.) in steps, sending more traffic to EKS (Best Practices for Zero Downtime Migration to AWS | ClearScale). Google Cloud DNS will serve the right IP based on these weights for each query (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog).
- Full Cutover: Eventually set AWS to 1.0 (100%) and GCP to 0.0. At this point all new DNS lookups direct to AWS. GCP services can be turned down after existing TTLs expire.
Google Cloud DNS’s weighted round-robin policy makes this possible natively. For example, you could configure app.prod.example.com
with two A records: one pointing to the GCP load balancer IP, one to the AWS load balancer’s IP, weighted say “0.8=GCP_IP;0.2=AWS_IP” to start (Configure DNS routing policies and health checks | Google Cloud). Cloud DNS will dynamically compute which IP to return on each query according to those ratios (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog). You can change weights via the API or gcloud CLI as you progress (these changes take effect at the DNS level immediately, though clients respect TTL).
TTL and Caching Considerations: DNS-based steering relies on clients periodically querying DNS. To minimize lag during changes, use a low TTL on these records (e.g. 30 seconds) during the migration (Best Practices for Zero Downtime Migration to AWS | ClearScale). Lower TTL ensures that when you adjust weights or switch traffic, clients will pick up new DNS answers quickly. (Be aware that some ISPs or resolvers might not strictly honor very low TTLs, and browsers/device caches could retain DNS entries for longer (Best Practices for Zero Downtime Migration to AWS | ClearScale).) It’s wise to lower the TTL well before the migration starts, so that by the time you make weight changes most clients are already using the low TTL setting (Best Practices for Zero Downtime Migration to AWS | ClearScale).
High-Level DNS Setup: In Cloud DNS, you would create a resource record set for the service domain with a WRR (Weighted Round Robin) routing policy. For example:
gcloud dns record-sets create app.prod.example.com. --type=A --ttl=30 \
--routing-policy-type=WRR \
--routing-policy-data="0.8=203.0.113.10;0.2=198.51.100.50"
In this hypothetical example, 203.0.113.10
could be the IP of a Google Cloud Load Balancer fronting Cloud Run/GKE, and 198.51.100.50
an IP (or IP range) of an AWS ALB/NLB fronting the EKS service. Cloud DNS will return the GCP IP ~80% of the time and the AWS IP ~20% (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog). Over time, you’d update the routing-policy-data
weights to shift the percentages. (If using hostnames/CNAMEs – e.g. Cloud Run custom domain or ALB DNS name – Cloud DNS can weight those via CNAME records similarly. However, root/apex domain cannot use CNAME, so an A/AAAA with direct IPs or using alias/ANAME-like features would be needed.)
Health Checks & Failover: Basic DNS round-robin by itself doesn’t automatically detect outages – if the GCP service goes down while still in DNS, some clients might get that IP until TTL expires. To mitigate this, Cloud DNS also supports a Failover routing policy and can integrate health checks for DNS endpoints (Global Load Balancer Approaches) (Global Load Balancer Approaches). One approach is to combine policies: for example, use weighted routing with health checks on each record. Google Cloud DNS health-checking can detect if the GCP or AWS endpoint is down and stop returning its IP (Global Load Balancer Approaches). Another approach is to use DNS failover policy once you reach the final cutover: designate AWS as primary and GCP as secondary (failover target). During the hybrid period, though, weighted policies with manual control are typically used (since you want both active). Keep TTL low so even if one backend fails, you can quickly adjust weights to 0 for that backend (or rely on the health-check to remove it).
Geo-Location DNS (optional): In addition to weighting, Google Cloud DNS allows geo-based policies (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog). If your user base is regionally divided or if the GCP and AWS clusters are in different regions, you could route users to the nearest cloud. For example, during migration you might direct EU customers to GCP and US customers to AWS (or vice versa) using Geo DNS, gradually expanding the geo coverage of AWS as confidence grows. Geo policies can also ensure optimal latency by keeping users on the closest service (Global Load Balancer Approaches) (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog). However, in this scenario (gradual migration) weighted routing is more straightforward for splitting traffic globally. Geo-DNS could be combined with weights (e.g. weighted within each region), but Cloud DNS does not allow combining geo and custom weights simultaneously on the same record set (Configure DNS routing policies and health checks | Google Cloud). So you’d typically choose one strategy or the other. Weighted routing is usually sufficient unless you have multi-region deployments in both clouds.
Pros & Cons of DNS Steering:
- Advantages: DNS-based routing is simple to implement with existing Cloud DNS. No new infrastructure is needed. It’s a proven technique for blue-green and canary migrations (How to Setup Blue Green Deployments with DNS Routing) (Best Practices for Zero Downtime Migration to AWS | ClearScale). By gradually changing DNS weights, you reduce risk and can rollback by reversing weights if issues arise. Also, DNS can distribute load globally without concentrating traffic through a single point (each user goes directly to whichever endpoint DNS gives them). This can improve latency if the DNS policy is geo-aware or if each user sticks to a nearby backend.
- Disadvantages: DNS changes aren’t instantaneous for all users due to caching. Some users may continue hitting the “old” service for up to the TTL duration (or longer, if their resolver ignores TTL) (Global Load Balancer Approaches). This is usually manageable by keeping both environments live during overlap, but it means you can’t perfectly control the exact cutover moment for every user – there’s a fuzzy period. Also, if an environment goes down unexpectedly, clients that cached its IP might fail until they retry DNS. Health-check integrated DNS can alleviate this but may not be as fast as a true load balancer. DNS load balancing is also stateless: it distributes DNS queries, not actual traffic flows. So there’s no concept of “session stickiness” beyond DNS caching. If your application is stateful (e.g. relying on session affinity), a user might get sent to AWS on one DNS lookup and then to GCP on a later lookup, which could be an issue if session data isn’t shared. (Mitigate by using a shared session store or sticky cookies with a common domain if needed.)
In practice, weighted DNS is a great low-complexity approach to achieve near zero-downtime migration. Many organizations use it for cloud migrations – for example, gradually shifting 5% of traffic at a time (Best Practices for Zero Downtime Migration to AWS | ClearScale). As long as both old and new services run in parallel and serve identical content/APIs, end-users will not notice the difference. Just be sure to monitor both environments closely during the shift (e.g. compare error rates, latencies) and plan for how to quickly react if the new environment has issues (e.g. set AWS weight back to 0 or remove that record).
Global Load Balancer Approach
An alternative (often more sophisticated) method is to put a global load balancing layer in front of your services. Instead of relying on DNS to make the routing decision, a global load balancer can accept user traffic at a single entry point and then proxy it to either GCP or AWS backends. This can provide faster failover, detailed traffic control (at the request level), and shielding users from any DNS propagation delays.
(Load Balancing with Weighted Pools) Example of a global load balancer splitting traffic 80/20 between two origin pools (e.g., one in a data center and one in cloud). A similar approach can route users to GCP or AWS backends based on assigned weights.
Two primary options in this category are Google Cloud HTTP(S) Load Balancing (with hybrid backends) and AWS Global Accelerator. We’ll also mention third-party anycast networks (like Cloudflare) as an option.
Using Google Cloud External Load Balancer (Anycast Global LB)
Google Cloud’s external Application Load Balancer (HTTP(S) LB) is a global, Anycast load balancer that can distribute traffic across multiple regions – and even across different backend types. You can leverage it to route traffic to both your GCP services and AWS services during the migration:
- Single Endpoint: The LB provides a single IP address (anycast globally) or a single domain that clients connect to. You would update DNS once to point
app.prod.example.com
to this load balancer’s IP or CNAME. After that, you no longer need to change DNS; all traffic goes to the LB. - Multiple Backends: The LB is configured with backend services representing your environments. For example, one backend might be a Serverless NEG pointing to the Cloud Run service or a GKE Ingress in GCP, and another backend could be an Internet NEG pointing to the AWS service endpoint (AWS ALB or a public IP of an AWS NLB). Google’s load balancer supports Internet Network Endpoint Groups – which means it can send traffic to arbitrary external addresses, like an AWS load balancer, as if they were just another backend (Helping A Business Incrementally Migrate From AWS and Cloudflare to Google Cloud | DoiT) (Helping A Business Incrementally Migrate From AWS and Cloudflare to Google Cloud | DoiT). This setup effectively bridges the two clouds at the load-balancer level.
- Weighted Traffic Splitting: With the load balancer in place, you can configure weight-based traffic splitting among backend services. Google’s global HTTP LB supports advanced traffic management – you can define a URL mapping where a given path or host is served by multiple backend services with specified weights (Traffic management overview for global external Application Load Balancers | Load Balancing | Google Cloud). For instance, you create a single frontend (say
app.example.com/*
) and attach two backend services to that route: Backend A (GCP) with weight 95, Backend B (AWS) with weight 5 to start. The LB will then route 5% of requests to AWS and 95% to GCP, at the HTTP request level (Application Load Balancer overview | Load Balancing | Google Cloud). This is analogous to weighted DNS, but the balancing is done by the LB on each request, not by DNS responses. You can gradually adjust these weights over time using gcloud or the GCP console, just like with DNS policies. The difference is the LB makes the decision for each incoming request in real time. - Health Checks and Failover: The LB continuously health-checks each backend. If the AWS backend becomes unhealthy, the LB will stop sending traffic to it entirely within seconds, regardless of weight (essentially failing over to the healthy backend automatically) (Introduction to AWS Global Accelerator – Whizlabs Blog) (Introduction to AWS Global Accelerator – Whizlabs Blog). This provides near-instantaneous failover – something DNS alone cannot guarantee due to caching. Similarly, if GCP backend had an outage, the LB could send all traffic to AWS. This ensures the zero downtime requirement is met even in the face of issues.
- Latency and Geo-Distribution: An anycast LB will typically route users to the nearest point of presence. Google’s global LB has worldwide edge nodes; users hit the closest Google front-end, which then forwards to the chosen backend. If your GCP and AWS backends are in different geographic regions, the LB could be configured with routing rules to prefer the nearest backend by latency or geography (this would be a more complex “latency-based routing” policy at the LB level, or using multiple LB frontends). However, since we control weights manually in this scenario, you might keep both backends active globally and rely on weighted split + the LB’s own network intelligence to handle performance.
Architecture Diagram – GCP LB Hybrid: Imagine this setup: the DNS for app.prod.example.com
resolves to a Global LB IP (anycast). A user’s request goes to the LB, which then decides where to forward it:
- GCP path: LB → Cloud Run/GKE (within GCP, via the serverless or instance group NEG).
- AWS path: LB → AWS ALB/NLB (via Internet NEG over the internet). The LB here acts like a reverse proxy; the user’s connection terminates at the Google front-end, then the LB opens a new connection to the AWS endpoint.
From the client perspective, they are always talking to one host/IP (the LB). This indirection adds a bit of overhead (requests to AWS now go through Google’s infrastructure first), but it gives strong control. Google’s LB also supports features like Cloud CDN, Cloud Armor (WAF), etc., which you could use to enhance security/performance during the transition.
Traffic Shifting with LB: Initially, you configure the LB to send 0% to AWS (all traffic to Cloud Run/GKE). Then as EKS comes online, start with a small percentage to the AWS backend service. Google’s traffic management supports very fine-grained splits (even 1% if desired) (Application Load Balancer overview | Load Balancing | Google Cloud). Increase AWS share gradually until it’s 100%. At that point, you could even remove the GCP backend from the LB. The DNS doesn’t need to change at cutover at all – it was already pointing to the LB, so users notice nothing. Essentially, the cutover happens inside the LB configuration.
Zero Downtime and Testing: During this process, the LB ensures no downtime: it will only send traffic to healthy backends and you can adjust weights without interrupting existing connections. You can test AWS in production with a small trickle of real traffic. If any problem is detected, simply dial the AWS backend weight down (even to 0%) and the LB will immediately stop sending new requests there. This offers a very fast rollback mechanism (faster than waiting for DNS TTLs to expire) (Introduction to AWS Global Accelerator – Whizlabs Blog) (Introduction to AWS Global Accelerator – Whizlabs Blog).
Costs and Complexity: Introducing a global load balancer has some overhead. There are GCP costs for LB bandwidth/requests, and configuring the LB (especially with an Internet NEG to AWS) is a bit more work than just adding DNS records. You also need to ensure the AWS service is exposed publicly in a way the GCP LB can reach – likely through an AWS ALB or NLB with a public IP. One common pattern is to use an AWS Network Load Balancer with a static Elastic IP, so you have a stable IP for the NEG target. Alternatively, use the AWS ALB’s hostname in a “fully qualified domain name (FQDN) NEG” (the GCP Internet NEG can point to a domain name and will resolve it). Make sure to allow the LB’s health check IPs and traffic through any firewalls (the GCP LB uses Google Front Ends that will connect from Google IP ranges).
Lifecycle: Once the migration is done and AWS is serving 100%, you have a choice: you could keep the GCP LB in place permanently (still directing everything to AWS). Some teams do this for a period to allow an easy fallback. Eventually, though, you might decide to simplify by pointing DNS directly to the AWS ALB and removing the GCP LB from the path (to reduce an extra network hop). That final DNS change can be done at a convenient time since the AWS backend is already handling all traffic – or you might even continue to use the GCP LB as a layer of indirection if it offers value (e.g., using Cloud Armor WAF in front of AWS). It’s up to your architecture preferences.
Summary of Pros: The global LB approach provides fine-grained control and fast failover. Weight changes take effect immediately on new requests (no waiting for DNS). Health checks make it safer – failing backends are automatically removed from rotation (Introduction to AWS Global Accelerator – Whizlabs Blog). You also get centralized logging and monitoring of all traffic in one place (the LB), which can simplify observing the cutover. And clients only ever see one IP/endpoint, which can avoid certain DNS sticking issues or cross-origin concerns.
Cons: The main downsides are the added complexity and potential performance impact for cross-cloud calls. For example, if a user and the AWS cluster are in the same region (say both in us-east) but the GCP LB node handling the request is in a different region (or routes inefficiently), you could introduce a slight latency penalty. In practice, Google’s network is very optimized, and any extra latency is usually small (tens of milliseconds). Another consideration is stateful sessions: if your LB does not have session affinity and you are switching traffic gradually, a user might bounce between GCP and AWS across requests (unless you enable session affinity on the LB by cookie or IP – though note that Google’s weighted traffic splitting does not combine with session affinity; if you set affinity, it might override the weights (Traffic management overview for global external Application Load Balancers | Load Balancing | Google Cloud)). If session stickiness is needed, you might use a different strategy (like route all users of a certain cohort to one side using a header or path). For mostly stateless services or API calls, this isn’t an issue.
In short, using the Google Cloud global load balancer for migration is a powerful approach that essentially gives you “cloud-agnostic” traffic management: you decouple the user-facing endpoint from the underlying cloud. It requires setup, but it ensures absolutely minimal disruption during the migration.
AWS Global Accelerator
AWS Global Accelerator (GA) is another global traffic management service, but it operates at the network layer. GA provides you with a pair of stable anycast IP addresses that edge locations announce globally. It then routes traffic from those edges to designated endpoint groups in AWS (which can be regional load balancers, EC2 instances, etc.). GA supports weighting traffic between AWS regions using a feature called the traffic dial – for example, splitting 70/30 between two AWS regions (Introduction to AWS Global Accelerator – Whizlabs Blog). It also monitors health and will fail over if an endpoint goes unhealthy.
In the context of a GCP-to-AWS migration, AWS GA could be useful after most traffic is in AWS (especially if you plan multi-region deployments in AWS for high availability). However, GA by itself can’t directly split traffic between AWS and GCP, because its endpoints must be AWS resources. One theoretical approach would be to have one GA endpoint group in an AWS region for the EKS cluster, and another endpoint group that points to an AWS resource which forwards to GCP (e.g., an EC2 instance proxying to GCP). This is generally not worth the complexity – essentially it means hairpinning GCP traffic through AWS.
So, while Global Accelerator is great for multi-region AWS traffic management (and could be part of your end-state architecture for AWS-only, ensuring low latency globally and quick failover across regions), it’s not typically used to manage a hybrid cloud cutover. We mention it for completeness because it’s an example of an anycast load balancer similar in concept to Google’s, but tied to AWS. If in the final state you need global IPs for your service and multi-region resilience, you might deploy GA once you’re fully on EKS, but during the migration, other methods (DNS or GCP’s LB) are more straightforward for cross-cloud balancing.
Third-Party Global Load Balancers (Cloudflare, etc.)
Beyond cloud-native solutions, there are providers like Cloudflare, Akamai, Fastly, or F5/Citrix ADC that offer global load balancing as a service (Global Load Balancer Approaches). For example, Cloudflare’s Load Balancer can sit at the DNS/proxy level and distribute traffic between multiple origins (which could be GCP and AWS) with weights, health checks, geo-steering, etc. This can be very effective: Cloudflare’s network will direct users to whichever origin you configure (they support session affinity and fine routing rules as well).
To use Cloudflare in this way, you would typically delegate your DNS to Cloudflare or at least configure your domain to proxy through Cloudflare’s CDN. Since the question states DNS stays on Google Cloud, switching to Cloudflare DNS may not be desired. However, you could still use Cloudflare by making app.example.com
a CNAME to a Cloudflare-managed domain that does the load balancing (Cloudflare allows weighted pools as we saw). Similar capabilities exist in other DNS services like NS1 or Dyn Traffic Director – they sit between the user and your origin servers.
Pros: Third-party solutions can be cloud-agnostic and very feature-rich. For instance, you could set up health checks from multiple continents, do latency-based routing (serve each user from whichever cloud is faster for them), or even do per-user sticky routing (like send a particular user ID consistently to one backend). Cloudflare’s example in the embedded diagram above shows how weights can be adjusted to quickly shift load when one origin pool is scaled up (80/20 split) (Load Balancing with Weighted Pools) (Load Balancing with Weighted Pools).
Cons: The downside is you’re adding another external dependency and potentially cost. Also, using a third-party means your traffic flows through their network (for Cloudflare in proxy mode, traffic goes through Cloudflare POPs). This can actually improve performance (due to caching and faster routes), but it’s a change to consider. Since our primary focus is using the existing cloud providers, a third-party LB is an option if neither Cloud DNS nor GCP/AWS native solutions meet a requirement you have (for example, if you needed true latency-based routing across clouds, a service like Cloudflare LB or Cedexis would be needed, as Cloud DNS doesn’t do latency measurements).
In summary, a global load balancer approach adds an abstraction layer that can greatly smooth out the migration. It’s often used in enterprise multi-cloud deployments. If your team is comfortable setting it up, it provides the most control and safety (at the cost of some complexity).
Service Mesh / API Gateway Approach
A third approach involves the application layer routing rather than DNS or a global LB. This typically means deploying either a shared API gateway or using a service mesh that spans both environments.
Multi-Cloud Service Mesh
Service mesh technologies (like Istio, Linkerd, or Consul mesh) can be used to route traffic between services across clusters. If you have the same application deployed on GKE and EKS, you could establish a mesh that includes both clusters and then use mesh routing features (layer 7 routing) to control traffic splitting. For example, Istio’s VirtualService resource can be configured to send X% of requests to one service version and Y% to another – even if those “versions” live in different clusters. Projects like Istio Multi-Cluster or Gloo Mesh allow tying two Kubernetes clusters together in one logical mesh. You’d typically need network connectivity between the clusters (VPN or VPC peering across clouds) so that services can talk to each other. With that in place, you can deploy a common control plane or a federated service mesh configuration.
How it would work: You might expose the service on GKE (mesh ingress gateway) and also on EKS (mesh gateway). You then configure the mesh so that when requests hit the ingress, it can split them: e.g., 90% to local service (GKE pods) and 10% forwarded to the EKS service (via the mesh’s cross-cluster communication). As you gain confidence, you adjust the weights in the VirtualService to send more to EKS. Eventually, you send 100% to EKS, and you could even switch the DNS to point directly to the EKS ingress at that time. This is essentially a layer 7 load balancing done by the service mesh sidecar proxies.
Pros: This approach keeps the traffic management in the application layer, which means you have full context of requests (you can do routing based on HTTP headers, etc., beyond just percentages). It also doesn’t rely on public DNS or public load balancers – the clusters could be connected privately. It’s a very powerful technique if you already use a service mesh, because you can leverage the same tools for canarying that you use within one cluster but now across clusters. For instance, Istio can even mirror traffic to the new deployment or do gradual rollouts with rich telemetry.
Cons: However, implementing a multi-cloud service mesh is non-trivial. You need to set up secure connectivity between GCP and AWS (like a direct VPN or use Istio’s mesh VPN capabilities) and ensure service discovery works across clouds. There is also a learning curve and operational overhead to running a service mesh across two environments. If you do not already have a mesh, introducing one just for the migration may be overkill. Meshes also typically assume a relatively stable set of connectivity; using them for a one-time migration might be more work than benefit, unless you want to adopt a mesh long-term for multi-cloud operations.
In most cases, DNS or global LB solutions are simpler for a short-term migration. That said, if your architecture is microservices-heavy and you foresee staying hybrid for a while, a service mesh could provide a consistent way to manage traffic splitting, security (mTLS between clouds), and monitoring.
API Gateway
Another application-layer approach is to use an API Gateway as the unified front-end. This could be a cloud-managed gateway or a self-hosted one:
- Cloud-managed: For example, Google Cloud Endpoints/Apigee or Amazon API Gateway could front the service. But those solutions typically work best when the backends are in the same cloud or accessible publicly. You could configure an Apigee gateway (running in GCP) to have targets for GCP service and AWS service, and do weighted routing between them. Or similarly, an AWS API Gateway could point to an AWS Lambda that proxies to GCP… however, these become Rube Goldberg machines, with complexity and cost.
- Self-hosted: You might run a gateway like Kong, NGINX, or HAProxy on a VM or container that has network access to both environments. That gateway then becomes the entry point (you’d point DNS to it), and it forwards requests either to GCP or AWS. Essentially, this is like running your own global load balancer. You could even run such gateways in both clouds for redundancy (and use DNS round-robin between the two gateways, each of which splits traffic internally).
The API gateway approach, like the service mesh, gives you a lot of flexibility (you can do things like auth, transformations, etc., in one place during the migration). But again, you are introducing a new component that must be highly available itself. It can become a bottleneck or single point of failure if not done carefully.
When to consider mesh/gateway: If your system already uses an API gateway layer, then extending it to handle multi-cloud can make sense. Or if you require advanced routing logic (say only certain users go to the new environment – e.g., internal beta testers – which could be done by gateway inspecting a header or cookie), an app-layer solution is needed. Otherwise, for pure load distribution, DNS or LBs are typically easier.
Ensuring Zero Downtime During Cutover
Regardless of which approach you choose, here are some best practices to ensure zero or minimal downtime:
- Use Blue/Green Principles: Always have the new environment (blue) up and running in parallel with the old (green) before shifting traffic (How to Setup Blue Green Deployments with DNS Routing) (How to Setup Blue Green Deployments with DNS Routing). This way, users are always hitting a working version. Our strategies above all adhere to this: they route to both old and new in parallel.
- Lower TTLs Ahead of Time: If using DNS changes (weighted or not), reduce the TTL well in advance (Best Practices for Zero Downtime Migration to AWS | ClearScale). For final cutover DNS changes (like eventually repointing the domain directly to AWS), a low TTL (e.g. 60s) ensures quick propagation. After the migration, you can raise TTLs back to normal.
- Implement Health Monitoring: Continuously monitor the health of both environments. If using global LB or DNS health checks, they will do this for you and take action. If not, set up your own synthetic checks. For instance, Google Cloud DNS can be set with a failover record – you might configure that once AWS is stable as primary and GCP as failover. Or if not using that, be prepared to manually adjust DNS or LB settings in case of a failure. The key is to catch any issue before it affects users widely. Use logging, APM, etc., in both clouds.
- Gradual Transition with Monitoring: Treat the migration like a canary release. Start by sending a small percentage to AWS EKS and verify: are error rates low? Is performance good? Compare it to the baseline on GCP. Only increase traffic when metrics look healthy. Use dashboards to watch both sets of servers. This minimizes risk – if something goes wrong at 10% traffic, you can roll back quickly with minimal impact.
- Data Consistency: Ensure both environments have access to the same data sources or have data synchronized. For example, if there’s a database, you might keep it in one place (perhaps still in GCP) during the transition, or use a cross-cloud replication. If one environment had stale data, users could see inconsistent results when they switch. Ideally, the user experience is identical no matter which backend served them. (This typically means using a single DB or synchronized databases, and careful handling of any caches, etc.)
- Session Management: As noted, if your application maintains session state in-memory (say in GKE pods), a user who bounces between clouds might lose their session. Solutions include using a shared session store (Redis, etc.) accessible from both, or enabling sticky session features. If using a load balancer, you could stick sessions to the first backend they hit (though that complicates the gradual migration since some users would never move). Another solution some adopt is migrating users in “batches” – e.g., based on user hash or region, which a service mesh or gateway could do. In general, prefer stateless handling during the migration if possible.
- Rollback Plan: For each stage, have a quick rollback plan. With weighted DNS, rollback = set AWS weight to 0 (or lower it). With LB, rollback = route 100% back to GCP. With mesh, rollback = flip the weight back. These can happen very fast (seconds) if automated or a simple config change. Also, ensure engineers are ready during changes to address any surprise (perhaps do changes during a low-traffic period initially, though with proper canarying you can even do it during normal hours).
- Final Cutover and Cleanup: Once AWS is handling all traffic smoothly and you’ve run like that for some time (to ensure stability), you can decommission the GCP side. This might involve deleting the weighted DNS policy (or removing the old IP), or removing the GCP backend from the LB, etc. Do this only after you’re confident – you might choose to leave the dual setup running for a “bake-in” period (e.g., a week of 100% on AWS but GCP still on standby). That way, if you unexpectedly need to fail back, you can just reintroduce the weights. When fully done, turn off the GCP services to avoid incurring cost. Also raise DNS TTLs if you lowered them.
By following these practices, you can achieve a zero-downtime migration. In fact, users should not even notice the transition if done correctly. Many companies have done cloud-to-cloud migrations in this fashion (weighted DNS or L7 splitting) without their users ever being aware of the backend move. The combination of careful traffic management and comprehensive monitoring is key to success (Best Practices for Zero Downtime Migration to AWS | ClearScale) (Application Load Balancer overview | Load Balancing | Google Cloud).
Recommendation and Example Strategy
Considering the scenario (DNS in Google Cloud, services in Cloud Run/GKE moving to EKS), a two-phase approach might work best:
- Phase 1: Weighted DNS Cutover (Quick Win). Set up weighted DNS records for your prod, stage, uat domains to start introducing AWS. This leverages your existing Cloud DNS setup with minimal overhead. For example, in staging or UAT, you could begin sending a portion of traffic to EKS to test it under real load. This is straightforward to implement and requires no new components – ideal for early testing. Make sure the AWS environment’s domain/IP is configured in Cloud DNS with a small weight and gradually increase it (Best Practices for Zero Downtime Migration to AWS | ClearScale). Monitor results. This phase gives you confidence in AWS and is easy to roll back by adjusting DNS weights.
- Phase 2: Consider Global Load Balancer for Prod (if needed). For production, where zero downtime and fast reactions are paramount, you might introduce the GCP global load balancer in front of prod traffic. This adds more control – for instance, if during prod migration a problem occurs, the LB will automatically fail back to GCP in milliseconds (due to health checks) rather than waiting for DNS. You could either switch prod DNS to the LB from the start (and then use LB splitting), or continue with weighted DNS but with very aggressive TTLs and perhaps script-based health check adjustments. The LB approach could be implemented in parallel: you can pilot it with one service or domain first. If implementing the LB is too time-consuming or not feasible, staying with Weighted DNS for prod is still a valid strategy (just make sure to have those health checks and low TTL).
In either case, planning and testing are crucial. Test the weighted routing in a lower environment (e.g., use stage.example.com with 50/50 weights and see how the traffic flows). Test failure scenarios (e.g., what happens if the AWS service is down – does DNS or LB correctly keep traffic on GCP?). Also, test the performance when some users are served from AWS – ensure your CDN, if any, or client-side logic, works the same against both.
For a concrete example, suppose api.prod.example.com
is currently an A record pointing to a Cloud Run custom domain (which maps to some Google front-end). You want to introduce the AWS EKS service which is exposed via an Amazon ALB (say URL eks-prod-123.us-east-1.elb.amazonaws.com). Here’s how you might proceed:
- Setup AWS Endpoint: Ensure the AWS ALB is up and has the EKS service registered. It might have a CNAME DNS. For weighted routing via Cloud DNS, you could either use an A record if the ALB has static IPs (ALB usually doesn’t, but NLB does), or use a CNAME approach. Cloud DNS allows weighted CNAMES as well – you would create two CNAME records for
api.prod.example.com
, one pointing to the Cloud Run domain, one to the ALB domain, with weights. - Lower TTL: Set
api.prod.example.com
TTL to 30s (from perhaps 300 or 3600) at least 1 hour before starting. - Add AWS with 0 weight: Initially, add the AWS record with weight 0 (or a very small fraction). This ensures the record is in place but essentially nobody (or almost nobody) will get it until you raise it. Or start with a token 5% if you’re feeling confident to test.
- Gradually Increase Weight: Over a period of hours or days, raise AWS to 10%, then 25%, 50%, etc. At each step, use metrics from both sides to verify system behavior.
- 100% and Monitor: Eventually set 100% AWS, 0% GCP. Keep GCP instances running but receiving no traffic (they’re effectively on hot standby). After a stable period, you can remove the weighted policy (replace with a simple CNAME or A to AWS) or keep a failover record (primary AWS, secondary GCP) as a safety net.
- Post-Cutover: Increase DNS TTL to normal (to improve cache efficiency). Decommission GCP resources if no longer needed.
This approach would achieve the goal with essentially no downtime. Even the final step of going 100% AWS is not a “hard cut” – by that point, most users were already on AWS; it’s just the last portion.
If absolute instantaneous failover is required, adding the global LB in step 3 could replace steps 3-5: you’d point DNS to the LB, and let the LB handle the gradual routing. That might be more complex initially but gives more confidence for mission-critical prod services.
Both methods (DNS vs LB) can even be combined: you could use weighted DNS to split between a GCP LB and an AWS LB if you wanted to double layer it. However, that’s usually unnecessary.
Conclusion: Start with what is simplest and meets your needs. Weighted DNS is often sufficient for a controlled migration and is directly supported by Google Cloud DNS (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog). If your use case demands tighter control (or you want to minimize reliance on DNS caching behavior), then introduce a global load balancer. In either case, careful incremental rollout and monitoring will ensure you achieve zero downtime and a successful migration to AWS EKS. The end result will be that users are smoothly transitioned to AWS with no disruption, and you can shut down the GCP services once confidence is high that AWS is running perfectly.
References: Weighted DNS routing and failover techniques (DNS routing policies for geo-location & weighted round robin | Google Cloud Blog) (Best Practices for Zero Downtime Migration to AWS | ClearScale), advanced load balancers and traffic splitting (Application Load Balancer overview | Load Balancing | Google Cloud) (Traffic management overview for global external Application Load Balancers | Load Balancing | Google Cloud), and real-world zero-downtime migration practices (Best Practices for Zero Downtime Migration to AWS | ClearScale) (Best Practices for Zero Downtime Migration to AWS | ClearScale) were all considered in devising this plan. Following these best practices will help ensure a seamless hybrid operation and cutover. Good luck with your migration!
I’m a DevOps/SRE/DevSecOps/Cloud Expert passionate about sharing knowledge and experiences. I am working at Cotocus. I blog tech insights at DevOps School, travel stories at Holiday Landmark, stock market tips at Stocks Mantra, health and fitness guidance at My Medic Plus, product reviews at I reviewed , and SEO strategies at Wizbrand.
Please find my social handles as below;
Rajesh Kumar Personal Website
Rajesh Kumar at YOUTUBE
Rajesh Kumar at INSTAGRAM
Rajesh Kumar at X
Rajesh Kumar at FACEBOOK
Rajesh Kumar at LINKEDIN
Rajesh Kumar at PINTEREST
Rajesh Kumar at QUORA
Rajesh Kumar at WIZBRAND