Building resilient systems

The “UX” of infrastructure

Jason Gedge

Staff Production Engineer

Gadget

an ability to recover from or adjust easily to misfortune or change

Merriam-Webster
A sparse map of Springfield with just an icon for the nuclear reactor, and the text "Network error" flashing and the caption "*alarm blaring*"
A character from The Simpsons, holding a newspaper clipping of "Old man yells at cloud", showing Abe Simpson with a clenched fist at a cloud.
Homer Simpson, holding a burning Chrismas tree, with the caption "Why do the things I love always burn?". The MySQL logo overlies the tree.

if it can break,

it will break

degraded

is better than

down

begin / rescue / ensure

rescue_from

rescue_from "BadInputError" do |exception|
  render(
    json: { message: exception.to_s },
    status: 422,
  )
end

Retries

try and try again

def with_retries(num = 3)
  yield
rescue
  num -= 1
  retry if num > 0
  raise
end
def with_retries(num = 3)
  yield
rescue
  num -= 1
  retry if num > 0
  raise
end

Consider using Fiber

if workload is complex

def try
  yield
rescue
  Fiber.yield
  retry
end

def work
  [
    try { … }
    try { … }
    try { … }
  ]
end
fiber = Fiber.new { work }
result = nil
(1..3).each do |_|
  result = fiber.resume
  return result unless fiber.alive?
end
"could not finish work"
def try
  yield
rescue
  Fiber.yield
  retry
end

def work
  [
    try { … }
    try { … }
    try { … }
  ]
end
fiber = Fiber.new { work }
result = nil
(1..3).each do |_|
  result = fiber.resume
  return result unless fiber.alive?
end
"could not finish work"

Timeouts

but not the “in the corner” type

begin
  Timeout.timeout(1) do
    do_work
  end
rescue Timeout::Error
  "timed out"
else
  "success result"
end
begin
  Timeout.timeout(1) do
    do_work
  end
rescue Timeout::Error
  "timed out"
else
  "success result"
end

Circuit breakers

short circuiting failures

three circuit breakers of varying sizes

if failure is expected,

fail fast

Closed

Open

Half Open

fail (threshold reached)

reset timeout

success

fail

fail (under threshold)

success

call (raise error)

See martinfowler.com for more info

Bulkheads

creating the unsinkable

a diagram of a ship with bulkheads, and the various flooding conditions of the Titanic
background image (no alt)
Peter Griffin, from Family Guy, struggling to close a set of blinds

Caching

who wants to do the same thing twice?

Rails.cache.fetch(
  cache_key,
  expires_in: 2.hours
) do
  do_expensive_work
end

Fallbacks

when all else fails

Dipper Pines, from Gravity Falls, falling backwards down a staircase
def with_fallback(value)
  yield
rescue
  value
end

Redundancy

a spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my facea spinning image of my face
background image (no alt)

Pooling

background image (no alt)

Health checking

Idempotency keys

a plot of grassa plot of grassa plot of grass a cartoon cowa milk bottle

Throttling

a hand squeezing a water hose, creating a kink

Resiliency matrix

the red pill of failure modes

Endpoint AEndpoint B...
Mysql unavailableUnavailableDegradedAvailable
Redis unavailableUnavailableAvailableAvailable
No disk spaceAvailableDegradedDegraded
Third-party service unavailableUnavailableAvailableUnavailable

Automated testing

resilient resiliency

stub_request(:post, "example.com")
  .with(body: {data: [1, 2, 3]}),
  .to_return(status: [503, ""])
background image (no alt)
Toxiproxy[:mysql_master].down do
  assert_equal([], fetch_things)
end

Chaos engineering

regularly break things

background image (no alt)

Game days

so your tools are always ready to go

Simulate your failure modes

in a controlled fashion

Root cause analysis

when things do go wrong...

Every disruption

should be unique

Fancy tech

buzzwords, bandwagons, et al

envoy logo
aspen mesh logo
aspen mesh logo
openshift logo
aspen mesh logo
openshift logo
linkerd logo
aspen mesh logo
openshift logo
linkerd logo
istio logo
aspen mesh logo
openshift logo
linkerd logo
istio logo
kubernetes logo
aspen mesh logo
openshift logo
linkerd logo
istio logo
kubernetes logo
Marge and Homer Simpson walking into a store with the caption "So many things, and so many things of each thing"

Useful resources

there's only so much we can do in a single presentation...

front cover of the second edition of the book "Release It!"

Thanks for listening!

❤️ ❤️ ❤️ ❤️ ❤️ ❤️