The now (in)famous Joel Test asks 12 questions to determine the maturity of an organisation. Nearly 2 decades later the development ecosystem has evolved and many practises and techniques have become commonplace.
The DevOps Ladder is an attempt to document the best practises in the form of an information radiator that provides a guided path for continous improvement.
The vertical (Capablity) ladder includes 10 steps that are required throughout a project or product’s lifecycle. The horizontal (Maturity) ladder has 4 levels or steps. The goal of any project should be to implement all capabilities and then move them to of Level 1 ASAP.
Adding a capability or improving the maturity level is called climbing the ladder, There are many styles of climbing:
With a combination of styles being common:
Many DevOps and “agile” anti-patterns are highlighted by the outliers.
Capability | Underwater | Level 1 | Level 2 | Level 3 |
---|---|---|---|---|
Leadership | Command and Control | Servant | Transformational | |
Teams | Dysfunctional | Functional | Cross Functional | Empowered |
Safety | Physical | Job | Psychological | |
Failure | Feared | Embraced | Celebrated | |
Westrum | Pathological | Bureaucratic | Generative | |
Architecture | Ivory Tower | - Just Enough - Last responsible moment |
Loosly Coupled | |
Source Code / Version Control | Shared Drive / Zip | CVS Artifact Repository |
versioning strategy e.g. semver |
Branching Strategies e.g. Git Flow OR Trunk Based |
Build | Manual / IDE | Snowflake | Phoenix | Reproducible Builds |
Continous Integration | Long lived branches | Trunk Based | Test Data Management | |
Test Automation | None | Nightly | Per Commit / PR | - Matrix - Epemeral testing instance per PR |
Testing | None | Integration OR Unit Testing |
Integration AND Unit Testing |
- Fuzzy Testing - Matrix - Downstream |
Code Review | None | Pair Programming OR Code Review |
Static Analysis | - Security scanning - Dependency scanning - Architecture compliance |
Deliver | Using a checklist | 1-step to production | Manual release strategies e.g. canary or blue/green feature toggles |
Automatic rollout strategies based on business metrics e.g. using Spinnaker |
Deploy | Heavyweight change control | Lightweight change control | Business driven | |
Run | Snowflake | Phoenix | Run offline | |
Monitor | System Metrics | App Metrics | Business Metrics | |
Docs | Getting Started / README | Architecture Decision Records | - Playbook - Cultural Manifesto |
Accountability and Responsibility
Failure and Safety
DevOps has it’s roots in the lean and Agile movements where the concept of failure takes on new meaning:
Rather than viewing failure as the enemy, the agile view is that failure is a vital and necessary part of learning and expirementation - if you are not failing than you are not trying hard enough.
Embracing failure entails accepting the inherent nature of all systems to fail and build systems and processes that are more resilent to change and uncertainty - focusing more on mean time to recover (MTTR) than mean time to failure (MTBF).
How people react (and more importantly how they respond to other people) during and after failure is critical. Failure is an unplanned investment, the only thing you can control is the ROI.
The modern agile framework includes this concept of failure in 3 of it’s 4 pillars: - Learning and Experimentation - Safety and a requisition - Make people awesome
Continuous Integration (CI)
CI is not about build and test automation, they are both core components of CI, but they are not CI.
CI is about ensuring that the different parts of a system are tested to ensure compatibilty as early as possible (ideally daily).
Continuous Delivery (CD)
CD is not about continously deploying to production, rather about the state of being that allows deployment into production at any time - this is possible due to the software always being delivered in a stable and tested state.
Focus on rapid recoverability / deployment
When building and running systems always focus first on the ability to recover at the cost of almost everything else.
Style | Database | Sesson Management | Deployment |
---|---|---|---|
Stateless | Event Sourcing | JWT | Immutable Infrastructure |
Active / Active (Replicated State) | Data Guard Streaming Replication |
Database / InMemory session replication | GitOps |
Active / Passive (Shared State) | Oracle RAC SQL Server Clustering |
Database / InMemory session clustering | Configuration Management |
Active (Snowflakes) | Session Persistence | Manual |
Failure design
Make the right thing, the easy thing
“If something is hard – do it more often and you will get better!” – Mary Poppendieck
Find and eliminate snowflakes
Find and eliminate information silo’s
Every team and project’s ladder should be unique - which levels you are targeting will reflect the architectural decisions and trade-offs being made.
Mature ladders will have most if not all capabilties at Level 2 and a few at Level 3 - but never everything at Level 3. If everything is at Level 3 you are not stretching yourself - reconsider what L3 means for you. e.g. If you are already conducting code reviews then maybe stretch to having the reviewers automatically selected by git history or an OWNERS file - There is always something more you can do.
One or two L1 capabilties may also be OK on a mature ladder. e.g. Open Source projects rarely have planning above issue tracking - there is nothing inherently wrong with that. Likewise your deployment target may be very static and a L1 CI system is sufficient.