Monday, September 10, 2012

How resilient is it anyway?

As engineers and scientists, we worry a lot about how well the things we build hold up.  Anything that goes out into the real world will suffer all sorts of buffets from unexpected interactions with its environment, strange behaviors by its users, idiosyncratic failures of components, and myriad other differences between theory and reality.  So we care a lot about knowing how resilient a system is, but don't currently have any particularly good way of measuring it.

Oh, there's lots of ways to measure resilience in particular aspects of particular systems.  Like if I'm building a phone network, I might want to know how frequently a call fails---either by getting dropped or failing to connect in the first place.  I might also measure how call failures increase when there are too many people into one place (like a soccer match) or when atmospheric conditions degrade (like a thunderstorm) or when a phone goes haywire and starts broadcasting all the time.

But these sorts of measures leave a lot to be desired, since they only look at particular aspects of a system's behavior and don't have anything to say about what happens when we link systems together to form a bigger system.  That's why I'm interested in generic ways to measure the resilience of a system.  My hope is that if we can design highly resilient components, then when they're connected together to former bigger components, that we will be more easily able to ensure that those larger components are resilient as well.  

Even better is if we can get compositional proofs, so that we know that certain types of composition are guaranteed to produce resilient systems---just as there are compositions of linear systems that produce linear systems and digital systems that produce digital systems, etc.  This is the type of foundation that lays the groundwork  for explosions in the complexity and variety of artifacts that we can engineer, just like we've seen previously in digital computers or clockwork mechanical systems.  I want to see the same thing happen for systems that live in more open worlds, so that we can have an infrastructure for our civilization that helps to maintain itself and that can tolerate more of the insults that we crazy humans throw at it.

But first, small and humble steps.  In order to be able to even formulate these problems of resilience sanely, we need to better quantify what this "resilience" thing might mean.  In my paper in the Workshop on Evaluation for SASO workshop at IEEE SASO, I take a crack at the problem, proposing a way to quantify "graceful degradation" using dimensionless numbers.  The notion of graceful degradation is an important one for understanding resilience, because it gets at the notion of margins of error in the operation of a system.  When you push a system that degrades gracefully, you start seeing problems in its behavior long before it collapses.  For example, on an overloaded internet connection that shows graceful degradation, things start going slower and slower, rather than going directly from fast communication to none at all.

In my paper, I propose that we can measure how gracefully a system degrades in a relatively simple manner.  Consider the space formed by all the parameters describing the structure of a system and of the environment in which it operates.  We break that space into three parts: the acceptable region where things are going well, the failing region where things have collapsed entirely, and the degraded region in between.

If we draw a line slicing through this space, then we get a sequence of intervals of acceptable, degraded, and failing behavior.  We can then compare the length of the acceptable intervals and the degraded intervals on their borders.  The longer the degraded intervals that separate acceptable and failing intervals, the better the system is.  So in order to know the weakest point of a system, we just look for the lowest ratio between degraded and acceptable on any line through the space.

What this metric really tell us is how painful is the tradeoff between speed of adaptation and safety of adaptation.  The lower the number, the easier it is for changes to drive the system into failure before it can effectively react, or for the system to accidentally drive itself off the cliff.  The higher the number, the more there is a margin for error.

So, here's a start.  There are scads of open questions about how to apply this metric, how to understand what it's telling us, etc., but it may be a good point to start from, since it can pull out the weak points of a system and tell us what they are...
Post a Comment