← Back to portfolio 2024-06-18

The Hidden Complexity of Cross-Cloud Queue Migration

AWS SQSOCI QueuesMigrationDistributed Systems

Queue migration sounds simple: point producers at the new queue, drain the old one, done. In practice, it is one of the most dangerous infrastructure changes a team can make, because queues are the memory of a distributed system.

The Ordering Problem

Many cloud queue services do not guarantee strict ordering. But applications often evolve to depend on approximate ordering because they have a single producer and messages usually arrive in sequence. When switching to a new provider, different partitioning strategies can shuffle messages. Downstream consumers that assumed approximate ordering start corrupting state.

The fix is not in the queue. It is in the consumer: add a sequence number to every message and a reordering buffer in the consumer. This should be done from day one, not during the migration.

The Dual-Write Window

During migration, producers and consumers cannot all switch simultaneously. There is always a window where some producers write to the old queue and some to the new one. A common pattern is a "bridge consumer" that reads from the old queue and forwards to the new one. The bridge typically runs for weeks until all producers have switched.

The bridge itself needs idempotency. If it crashes and restarts, it will re-forward messages. Every consumer needs to handle duplicates. This is often the hardest part of the entire migration.

What Teams Should Do Differently

Run the dual-write bridge from day one of migration planning, not as an afterthought. Build idempotent consumers before the migration starts, not during the first incident.