Susan Potter
talks: Created / Updated

Dynamically scaling a news and activism hub (5x traffic in 20 mins)

C-U Cloud Meetup / April 26, 2019 - Champaign, IL

Keywords

  • AWS
  • Auto Scaling
  • EC2
  • Infrastructure

Changelog

  • Since 2019 when this talk and material was created, AWS deprecated Launch Configurations. From my recent AWS dynamic auto scaling forays in AWS, I can also recommend Launch Templates to replace LCs (as AWS also recommends).
  • May 2023: Almost all the information besides Launch Configurations is relevant to building auto scaling infrastructures based on EC2 cloud primitives. I have since used serverless primitives in production and have recently written about The Pitfalls of Servless.
  • June 2023: A former coworker I hired for my team on this engagement to dynamically scale this news discussion hub website recently wrote up the pre-baked AMI work he did in Deploying NixOS Disk Images At Speed. He is also in between roles currently, so this would be a very rare opportunity to snap him up for your infrastructure engineering opportunities. I would if I could.

Abstract

On any given day we can receive traffic peaks up to five times our base traffic, sometimes requiring us to scale out to double our backend app server capacity within a 10-20 minutes window (sometimes at unpredictable times). In this talk, Susan Potter will discuss our use of autoscaling in EC2 from the essential components to some gotchas learned along the way.

Slides

Notes

In 2016, I was tasked with scaling a political news discussion website that experienced wild fluctuations in traffic throughout the day, driven by breaking news. The site delivered news, fostered discussions, and mobilized campaigns for over two million daily users.

The backend system was read-heavy (in terms of number of backend operations), but had slow and expensive content publishing backends for publishing (the writes). Complicated by user surge activity during breaking news events and no separation (at the time) of the read and write paths plus a convoluted and monolithic codebase.

The original news site had its moments of glory, but it faltered when the intensity of breaking news flooded our servers. Our backend infrastructure couldn't keep up with rapid demand during breaking news traffic spikes. Capistrano, our deployment method, proved to have horrific failure modes during scale-out and scale-in events, leaving us desperately seeking a more reliable alternative. The host-level Chef configuration carefully crafted by a prior team member faltered in epic fails, requiring more effort to maintain its dwindling efficiency.

As I surveyed the our cloud infrastructure, I realized that dynamic autoscaling with pre-baked AMIs held the key to our salvation. We needed our services to adapt and flex in real-time, like the ever-shifting tides of post-truth US politics causing the traffic spikes. It was time to embrace the cloud's dynamic scaling primitives.

In the early days of the journey, September 2016, there was only one service in a static autoscaling group (ASG). Scaling policies were nonexistent, and the manual modifications required human babysitting—a gasp-worthy revelation in the age of automation. To make matters worse, our services relied on aging based images that failed to converge due to external APT source dependencies updating out-of-band. The bootstrapping process was agonizingly slow, often exceeding the critical 15-minute mark.

Today, in April 2019, all our services stand within active dynamic autoscaling groups. We are leveraging its high-level primitives to construct a resilient infrastructure. Our frontend caching and routing services, both content publishing backends, and even our internal systems, such as logging and metrics, are deployed this way today.

The secrets of our success so far are a combination of AWS's primitives like:

  • Auto Scaling Groups (ASGs)
  • Launch Configurations (LCs)
  • Scaling Policies
  • Lifecycle Hooks
  • pre-baked AMIs (though a large part of the secret here is due to NixOS' reproducibility)

These cloud primitives now enable us to scale dynamically, predictably, and efficiently.

Along our journey, we discovered the art of fine-tuning. AutoScaling Group properties became our playground. We adjusted the properties to match the demands of our traffic. Our delivery pipeline creates Launch Configurations for each service type and builds pre-baked images for each change based on our software development lifecycles.

The heart of our dynamic scaling lays in the magic of scaling policies. These allowed us to cast a protective shield around our infrastructure. We defined relevant metrics per service type, chose their adjustment types, and watched with satisfaction as they responded to the ebb and flow of our traffic without human interference.

But it wasn't just AWS primitives that we optimized. We eliminated as much of the bootstrapping process from runtime as possible and embraced the fully baked AMIs with a Linux distribution that enables congruent configuration (NixOS). Our time-to-liveness bootstrapping time decreased by an order of magnitude (~900s to ~50s), ensuring faster scale-ups and less downtime. We achieved what once seemed impossible—a congruence of configuration across all nodes.

Our quest for efficiency didn't end there. We set out to right-size each service, choosing the perfect instance type based on its unique resource requirements.

No longer shackled by arbitrary choices, we analyzed resource usage in peak, typical, and overnight resting states. By understanding each service's needs, we optimized not only our performance but also our costs.

Purely CPU-based policies no longer sufficed. We whispered custom metrics to the AWS CloudWatch and responded to them. Now, even during the most intense traffic spikes, our site's reliability doesn't waiver.

Today to deploy we:

  • build, test, register AMIs for each service (in parallel)
  • create fresh Launch Configurations (again for each service)
  • update our ASGs to add the LCs

This allows us to transition between old and new deploys with failure handling such that common failures do not impact the live site.

Our tale doesn't end here though. We continue to evolve and improve based on our production usage, data, and experiments.

If you enjoyed this content, please consider sharing via social media, following my accounts, or subscribing to the RSS feed.