Slides at: Slideshare.
- “Success Engineering” - that clearly will work, then.
- Plan for the worst. Minimise risk. Fail. Recover, gracefully.
- “You can’t eliminate risk.”
- alt.sysadmin.recovery shoutout.
- Failure is an option. Admit it.
- The open source world has failure and recovery as a core competency, but perhaps not systematically enough.
- Dr. Jerker Denrell publishes fantastic papers on the topic from a business perspective. “Predicting the Next Big Thing: Success as a Signal of Poor Judgement.” Looked at people who had predicted Black Swan events, and found there was a negative correlation with general quality of judgement.
- Try “Everything is Obvious Once You Know The Answer”
Whatever, science, blah, onto the entertaining anecdotes!
Rats like fibre optic. And we can use stories about this to help inform our planning.
Document, Test, Verify is like Stop, Drop and, Roll.
- Documentation tools are mostly pretty terrible, and there’s good work that could be here.
- Making time to update documentation when you do stuff.
- Verify your success criteria. What does success look like, what are you trying to achieve.
- Make sure you actually write tests, however simple, and have a buddy sanity check your work.
- Have a plan: make sure you involve other people with it, too.
- There are no shortage of testing tools, which should be repeatable.
- Do stuff in repeatable shell scripts.
- Have staging environments.
- What does pg_dump -d actually do? Well, it depends.
- Needed a plan for what to do if things go wrong. Staging environment. And test your rollbacks, not just implementation.
- People are really important. Having a buddy.
Failure to Imagine
- Telling externals they need to tell you when you have a problem is not going to work. Trust no-one.
- Share your stories of failure and talk to a diverse group of people, people who are different to you.
- Sharing lets you head failure off at the pass.
- People who are different to you means outside IT - business, musicians, the construction industry.
- Go and physically look at things you might need to do, don’t just sit in a room.
- The post-mortem/debrief.
- Keep a notebook of your work, learn from it.
- Plan to have a post-mortem, even if there’s success.
- Document your plan with a timeline, allocate time, and actually test the plan.
- IRC is great, speaking is better. A headset is great.
- Have a timekeeper and alert people to when you’ve hit your drop-dead point.
- Limit improvements to 1-2 things. An endless list will never be worked upon.
Read the DailyWTF.