November 21, 2017

What it takes to keep Expedia.com up and running


At Expedia, revenue loss due to unplanned site unavailability - moving from 99% uptime to 99.9% site availability - can result in ~$80M difference. Here's how Expedia's journey toward site resiliency looks like.

Expedia has a "test-and-learn" culture, and innovation is about constantly iterating products and features. Resilience is not always treated last a first-class citizen: there are often too many competing priorities, there are major misconceptions about resilience, and team autonomy can mean it is challenging to diffuse learnings and tooling effectively.

To address these issues, Willie Wheeler, principal application engineer at Expedia, discussed how a shared learning space was created within Expedia, which facilitated the sharing of information around resilience, and led to the creation of "resilience champions". Much effort was made to collect and present baseline resiliency data, in order to allow teams to track improvements.

A large organisation such as Expedia has a plethora of tooling and platforms in use, and it can be a challenge to steer adoption. Wheeler discussed how the focus on core principles was more valuable than individual tooling, and shared how his team defined a "resilience engineering lifecycle."

Get the full story at InfoQ