Ready to build better conversations?
Simple to set up. Easy to use. Powerful integrations.
Get startedReady to build better conversations?
Simple to set up. Easy to use. Powerful integrations.
Get startedWhen we migrated hundreds of millions of event records from Amazon Elasticsearch to AWS OpenSearch, we expected the hardest part to be the scale: multiple terabytes of data, thirteen monthly indices, and years of call history.
The hardest part turned out to be a single line of JSON — and what it was silently about to do to 13,5MM documents.
This never became an incident. We caught it during a routine pre-flight check that the migration plan builds in before every reindex — well before a single production document was touched. This is the story of what we found, why it would have been invisible, and how the discipline of the plan itself is what surfaced it.
TL;DR
|
Why we migrated
Aircall is a cloud-based phone system handling millions of calls a day — every call event (assignment, completion, enrichment) flows through this pipeline, indexed for search and analytics.
That pipeline ran on Elasticsearch 7.10, which was nearing the end of support. We migrated to AWS OpenSearch 3.5 for dedicated domains per workload, better compression, and improved GDPR data residency.
How dual-write works — and the plan it's part of
Dual-write is the safety net that makes a zero-downtime migration possible: once enabled, every write the application makes — a new call event, a status change, an enrichment update — goes to both clusters at once. Elasticsearch stays authoritative until cutover; OpenSearch receives a real-time mirror of all new traffic. The application generates one version number per write and sends the same value to both clusters, so every dual-written document carries an identical _version on ES and OpenSearch — a detail that turns out to shape the entire conflict pattern we'd see later.
That dual-write step sits inside a larger plan, run in two phases, smallest first: the dashboard cluster (users, teams, numbers) as a low-risk pilot, then the large call-history cluster across thirteen monthly indices once the pattern was proven. For each cluster, the sequence is identical — enable dual-write, backfill history via remote reindex, validate parity, then move reads progressively behind a feature flag (5% → 25% → 50% → 100%). Every stage is reversible: if dual-write or read traffic misbehaves, the flag rolls back with no data loss, and the source clusters stay alive through a grace period.
The implication for backfill: by the time we run the reindex, the target is no longer empty — it already holds weeks of live production writes. Any migration tool designed for an empty target is now operating outside its design envelope, silently. Which is exactly why a pre-flight count check runs before every reindex.
The pre-flight check that changed everything
Before reindexing the May 2026 index (~85M documents), we ran a count check to confirm the target was empty. It wasn't:
13.5 million documents were already there. A date histogram showed sparse early-month coverage, then full days from the end of May — dual-write had been active for over a week. These were live production writes, and we were about to run a reindex on top of them.
The hidden risk in a 'clean' migration tool
The original reindex script was built for empty targets. Its destination had no version_type, which means OpenSearch defaults to internal versioning — source always wins on conflict, unconditionally.
But some of those 13.5M documents had been updated on OpenSearch after their original Elasticsearch write — a call completed, enrichment added, an assignment changed — without being replicated back to ES. Running the unpatched script would have streamed all 85M docs from ES and, for every overlapping doc, silently overwritten the newer OpenSearch version with the older Elasticsearch copy. It would have returned Failures: 0 and looked completely successful.
No error. No alert. The only signal would have been customer support tickets, days later, about missing or incorrect call data. |
The fix: version-aware reindex
Behind the migration was a small Python script that wrapped OpenSearch's reindex API — it built the request body, fired the call, polled the _tasks API for progress, and printed a config header at startup. The --dry-run and header-check guards are behavior of this script, not OpenSearch features.
The patch added an opt-in --merge-mode flag, adding two fields to the reindex request:
With version_type=external, the source wins only if its version is strictly greater — genuinely newer. Otherwise, the write is rejected and counted as version_conflicts. conflicts: proceed keeps the task running. Default behaviour is unchanged; the flag is opt-in per run.
Setting | Layer | Purpose |
|---|---|---|
| READ | Connect to the ES 7.x source cluster (without it, the reindex moves 0 docs — silently) |
| WRITE | Reject stale overwrites on version conflict |
Same word, two different layers — easy to conflate, and both were required in production.
This never turned into a data-loss incident
What happened
The fix was written, but the safeguards around running it didn't exist yet. While the patch was still in review, we launched the live run and pasted from the wrong command block in the terminal. The unpatched version started. We caught it within 45 seconds, canceled at 45,000 documents scanned, before the reindex reached the overlap region. No data damaged — but only by luck of timing.
The patch prevented the silent overwrite. It did nothing to prevent the launch of the wrong command in the first place. Those are two different failure modes, and the second one needed its own answer.
The response: two mandatory rails
After that, the runbook gained two mandatory rails for any version-aware migration run:
Rail 1 — Mandatory dry-run before every live launch
Same command, same flags, add --dry-run. The script prints its configuration header and baseline counts, then exits before any reindex starts. Cost: five seconds. Catches wrong flags, wrong endpoints, and missing indices.
Rail 2 — Mandatory header check within 5 seconds of live launch
After starting the live run, verify the log immediately:
If either line is wrong, there's a ~10-second window to cancel cleanly before the reindex has processed more than ~50,000 documents.
The dry-run catches construction errors. The header check catches paste errors. Both failure modes showed up during this near-miss. Together they cost 10 seconds per launch. Multiple cheap safety rails beat one elaborate one. |
What the live run revealed
We expected version_conflicts to spike at 15–25% scanned. At 51.83%, it sat at just 1,875. Was the patch broken? No. We sampled the dual-write window and found that every document had the same _version on both clusters — because the app generates a single version per write. With external versioning, identical versions are rejected as conflicts, which is correct.
The spike hadn't arrived because remote reindex scrolls by segment order, not date. The dense dual-write docs were most recently written, so they sit in the last-scanned segments. The conflict count rose as a step function late in the run.
When you build a safety rail, build the instrument that proves it's working. Thirty seconds and five document IDs replaced a full run of gut anxiety with grounded confidence. |
The run completed in 5 hours:
Counter | Count | % | Meaning |
|---|---|---|---|
Total | 84,986,805 | 100% | Scanned from source |
Created | 70,798,760 | 83.3% | Inserts into empty regions |
Updated | 634,962 | 0.7% | Source provably newer |
VersionConflicts | 13,553,083 | 16.0% | Target preserved by --merge-mode |
Failures | 0 | 0% | — |
70,798,760 + 634,962 + 13,553,083 = 84,986,805 — exactly the source total. Zero documents lost.
Under version-aware reindex, the counter meanings shift: updated now means "source provably newer — overwrite was correct," and version_conflicts means "target preserved — the exact documents the rail saved." Validation confirmed parity: all 15 indices were within 0.013% of the source, and 65/65 sample documents were found.
Lessons we're taking forward
The 'empty target' assumption decays the moment dual-write starts. The application's versioning contract extends to every writer of that index, including our migration tool.
Silent overwrites are the worst failure mode. A migration that errors is debuggable; one that completes cleanly while regressing data ships undetected. Build for explicit conflict counts and sample validation.
The progress bar is the wrong dashboard for safety. It counts scanned docs, not written outcomes. The real signal is the (created, updated, version_conflicts) breakdown from the
_tasksAPI.The final report is part of the migration. The validation log and runbook update are the migration's receipts — how you answer "did we have issues with that window?" six months later.
Where we are now
The full migration was completed with zero failures across all phases. Cutover is done. The migration touched hundreds of millions of documents — and the most consequential change was 32 lines of Python and a process discipline that costs 10 seconds per launch.
How many migration scripts have this exact bug, waiting for the day someone introduces dual-write? The fix is two lines of JSON. The cost of not fixing it is 13.5 million documents. |
Published on June 26, 2026.


