an ability to recover from or adjust easily to misfortune or change
Merriam-Webster



if it can break,
it will break
degraded
is better than
downbegin / rescue / ensure
rescue_from
rescue_from "BadInputError" do |exception| render( json: { message: exception.to_s }, status: 422, ) end
Retries
try and try again
def with_retries(num = 3) yield rescue num -= 1 retry if num > 0 raise end
def with_retries(num = 3) yield rescue num -= 1 retry if num > 0 raise end
Consider using Fiber
if workload is complex
def try yield rescue Fiber.yield retry end def work [ try { … } try { … } try { … } ] end
fiber = Fiber.new { work } result = nil (1..3).each do |_| result = fiber.resume return result unless fiber.alive? end "could not finish work"
def try yield rescue Fiber.yield retry end def work [ try { … } try { … } try { … } ] end
fiber = Fiber.new { work } result = nil (1..3).each do |_| result = fiber.resume return result unless fiber.alive? end "could not finish work"
Timeouts
but not the “in the corner” type
begin Timeout.timeout(1) do do_work end rescue Timeout::Error "timed out" else "success result" end
begin Timeout.timeout(1) do do_work end rescue Timeout::Error "timed out" else "success result" end
Circuit breakers
short circuiting failures

if failure is expected,
fail fastSee martinfowler.com for more info
Bulkheads
creating the unsinkable



Caching
who wants to do the same thing twice?
Rails.cache.fetch( cache_key, expires_in: 2.hours ) do do_expensive_work end
Fallbacks
when all else fails

def with_fallback(value) yield rescue value end
Redundancy























Pooling

Health checking
Idempotency keys





Throttling

Resiliency matrix
the red pill of failure modes
| Endpoint A | Endpoint B | ... | |
|---|---|---|---|
| Mysql unavailable | Unavailable | Degraded | Available |
| Redis unavailable | Unavailable | Available | Available |
| No disk space | Available | Degraded | Degraded |
| Third-party service unavailable | Unavailable | Available | Unavailable |
Automated testing
resilient resiliency
stub_request(:post, "example.com") .with(body: {data: [1, 2, 3]}), .to_return(status: [503, ""])

Toxiproxy[:mysql_master].down do assert_equal([], fetch_things) end
Chaos engineering
regularly break things

Game days
so your tools are always ready to go
Simulate your failure modes
in a controlled fashion
Root cause analysis
when things do go wrong...
Every disruption
should be unique
Fancy tech
buzzwords, bandwagons, et al






















Useful resources
there's only so much we can do in a single presentation...

Thanks for listening!
❤️ ❤️ ❤️ ❤️ ❤️ ❤️