Plan a cloud migration without importing old operational risk

Most cloud migrations fail quietly. The VMs move, the invoice changes, and the organisation keeps the same manual provisioning habits, unclear ownership, access sprawl, and untested recovery paths it had before.

The objective is not relocation. The objective is to arrive with a platform engineering teams can operate, audit, and hand off.

When this applies

Use this framework when moving from on-prem, colo, managed hosting, or another cloud into AWS, Azure, Google Cloud, or Alibaba Cloud across multiple environments and teams.

It is especially relevant when:

application teams depend on manual provisioning or undocumented runbooks
production access is broader than current risk justifies
cost growth is already a concern before the migration starts
the target cloud is expected to become the long-term operating model, not a temporary lift-and-shift destination

What usually goes wrong

Migrations tend to import old risk when teams optimise for speed of movement instead of clarity of ownership.

Common failure modes:

Lift-and-shift without a landing zone. Workloads arrive in a default VPC or subscription with no shared standards for tagging, logging, backups, or network boundaries.
Identity copied forward. Broad admin access from the old world is recreated in the new cloud because it unblocks the cutover window.
No decommission plan. Old environments stay running "just in case," doubling cost and preserving two operating models.
Application-first sequencing. Teams migrate apps before DNS, CI, artifact storage, secrets, or observability foundations exist.
Cutover without rollback criteria. The migration window becomes the plan, instead of one controlled step in a larger runbook.

Decision framework

Before moving application workloads, decide these platform boundaries explicitly.

1. Account or subscription structure

Separate production, staging, shared services, security, and logging where the cloud provider and operating model allow it. The goal is blast-radius control and clearer cost attribution, not account sprawl for its own sake.

2. Network topology

Define VPC or VNet layout, ingress and egress paths, private service access, cross-environment connectivity, and whether workloads need direct internet egress or centralised egress controls.

3. Identity and access

Decide who can deploy, who can change production infrastructure, how emergency access works, and how access is reviewed after migration. This should be defined before the first production cutover, not after the first incident.

4. Infrastructure as Code baseline

Standardise modules, tagging, state management, environment promotion, and approval paths. Manual exceptions created during migration have a habit of becoming permanent.

5. Observability minimum

Every migrated service needs a named owner, log destination, health signal, and alert path before it is considered production-ready in the target cloud.

6. Cutover and rollback criteria

Document data sync expectations, DNS or traffic switch steps, rollback triggers, and the validation checklist that must pass before old capacity is retired.

Recommended migration sequence

A repeatable sequence reduces rework:

Stand up the landing zone with guardrails, logging, backups, and identity patterns.
Migrate shared services first. DNS, CI runners, artifact storage, secrets, and monitoring often unblock application teams more than moving one app early.
Run one bounded pilot end to end through build, deploy, observe, and recover.
Turn the pilot into a runbook. Capture prerequisites, validation steps, rollback, and decommission actions.
Batch similar workloads by architecture pattern, not by organisational convenience alone.
Retire old provisioning paths as soon as traffic and data are stable in the target environment.

Infrastructure as Code should begin in the first non-production environment. Waiting until "after the migration" usually means never standardising.

Cutover patterns

Choose the cutover model deliberately:

Rehost when speed matters and the application can tolerate the same operating assumptions in the new environment.
Replatform when managed services reduce operational load without rewriting the application.
Refactor selectively only where the business case and team capacity justify the extra migration cost.

For data-bearing systems, define replication lag tolerance, write freeze rules, verification queries, and the point at which the old write path is disabled.

Cost and FinOps hooks

Migration is an expensive moment. Build cost visibility into the plan from the start:

enforce tagging before production cutover
set budgets or alerts per environment
identify idle legacy capacity with a fixed decommission date
review right-sizing after the first month of real traffic, not before traffic exists

Cost control should not block migration, but unmanaged spend should not be the proof that the migration succeeded.

Verification before sign-off

A migration is not complete when traffic moves. It is complete when the target platform is operable without the old environment.

Confirm:

production access is least privilege and attributable
deployments do not require manual host changes
backups and restore were tested in the target cloud
on-call ownership and runbooks exist for migrated services
cost tags, budgets, and service owners are assigned
old environments have named decommission dates

What to avoid

migrating first and standardising later
granting broad admin access to protect a cutover window
treating network and identity design as follow-up work
copying cron jobs, firewall rules, and SSH patterns without review
leaving legacy environments running without cost and security owners

Related work

This note supports Cloud migration and platform standardisation.