Mistakes Were Made

Selena Deckelmann

Slides at: Slideshare.

  • “Success Engineering” - that clearly will work, then.

  • Plan for the worst. Minimise risk. Fail. Recover, gracefully.

  • “You can’t eliminate risk.”

  • alt.sysadmin.recovery shoutout.

  • Failure is an option. Admit it.

  • The open source world has failure and recovery as a core competency, but perhaps not systematically enough.

  • Dr. Jerker Denrell publishes fantastic papers on the topic from a business perspective. “Predicting the Next Big Thing: Success as a Signal of Poor Judgement.” Looked at people who had predicted Black Swan events, and found there was a negative correlation with general quality of judgement.

  • Try “Everything is Obvious Once You Know The Answer”

  • Whatever, science, blah, onto the entertaining anecdotes!

  • Rats like fibre optic. And we can use stories about this to help inform our planning.

  • Document, Test, Verify is like Stop, Drop and, Roll.

Documentation

  • Documentation tools are mostly pretty terrible, and there’s good work that could be here.
  • Making time to update documentation when you do stuff.

Testing

  • Verify your success criteria. What does success look like, what are you trying to achieve.
  • Make sure you actually write tests, however simple, and have a buddy sanity check your work.
  • Have a plan: make sure you involve other people with it, too.
  • There are no shortage of testing tools, which should be repeatable.
  • Do stuff in repeatable shell scripts.
  • Have staging environments.

Verify

  • What does pg_dump -d actually do? Well, it depends.
  • Needed a plan for what to do if things go wrong. Staging environment. And test your rollbacks, not just implementation.
  • People are really important. Having a buddy.

Failure to Imagine

  • Telling externals they need to tell you when you have a problem is not going to work. Trust no-one.
  • Share your stories of failure and talk to a diverse group of people, people who are different to you.
  • Sharing lets you head failure off at the pass.
  • People who are different to you means outside IT - business, musicians, the construction industry.
  • Go and physically look at things you might need to do, don’t just sit in a room.

Reflection

  • The post-mortem/debrief.
  • Keep a notebook of your work, learn from it.
  • Plan to have a post-mortem, even if there’s success.
  • Document your plan with a timeline, allocate time, and actually test the plan.
  • IRC is great, speaking is better. A headset is great.
  • Have a timekeeper and alert people to when you’ve hit your drop-dead point.
  • Limit improvements to 1-2 things. An endless list will never be worked upon.

Read the DailyWTF.

Share