The DevOps Ladder

The now (in)famous Joel Test asks 12 questions to determine the maturity of an organisation. Nearly 2 decades later the development ecosystem has evolved and many practises and techniques have become commonplace.

The DevOps Ladder is an attempt to document the best practises in the form of an information radiator that provides a guided path for continous improvement.

The vertical (Capablity) ladder includes 10 steps that are required throughout a project or product’s lifecycle. The horizontal (Maturity) ladder has 4 levels or steps. The goal of any project should be to implement all capabilities and then move them to of Level 1 ASAP.

Climbing Styles

Adding a capability or improving the maturity level is called climbing the ladder, There are many styles of climbing:

With a combination of styles being common:

Anti Patterns

Many DevOps and “agile” anti-patterns are highlighted by the outliers.

The Ladder

Capability Underwater Level 1 Level 2 Level 3
Leadership Command and Control   Servant Transformational
Teams Dysfunctional Functional Cross Functional Empowered
Safety Physical Job Psychological  
Failure Feared Embraced Celebrated  
Westrum Pathological Bureaucratic Generative  
Architecture Ivory Tower - Just Enough
- Last responsible moment
Loosly Coupled  
Source Code / Version Control Shared Drive / Zip CVS
Artifact Repository
Git
versioning strategy e.g. semver
Branching Strategies e.g.
Git Flow
OR
Trunk Based
Build Manual / IDE Snowflake Phoenix Reproducible Builds
Continous Integration Long lived branches Trunk Based Test Data Management  
Test Automation None Nightly Per Commit / PR - Matrix
- Epemeral testing instance per PR
Testing None Integration OR
Unit Testing
Integration
AND
Unit Testing
- Fuzzy Testing
- Matrix
- Downstream
Code Review None Pair Programming OR
Code Review
Static Analysis - Security scanning
- Dependency scanning
- Architecture compliance
Deliver Using a checklist 1-step to production Manual release strategies e.g. canary or blue/green
feature toggles
Automatic rollout strategies based on business metrics e.g. using Spinnaker
Deploy Heavyweight change control Lightweight change control Business driven  
Run   Snowflake Phoenix Run offline
Monitor Twitter System Metrics App Metrics Business Metrics
Docs   Getting Started / README Architecture Decision Records - Playbook
- Cultural Manifesto

Culture and Leadership

Accountability and Responsibility

Failure and Safety

DevOps has it’s roots in the lean and Agile movements where the concept of failure takes on new meaning:

Rather than viewing failure as the enemy, the agile view is that failure is a vital and necessary part of learning and expirementation - if you are not failing than you are not trying hard enough.

Embracing failure entails accepting the inherent nature of all systems to fail and build systems and processes that are more resilent to change and uncertainty - focusing more on mean time to recover (MTTR) than mean time to failure (MTBF).

How people react (and more importantly how they respond to other people) during and after failure is critical. Failure is an unplanned investment, the only thing you can control is the ROI.

The modern agile framework includes this concept of failure in 3 of it’s 4 pillars: - Learning and Experimentation - Safety and a requisition - Make people awesome

CI/CD

Continuous Integration (CI)

CI is not about build and test automation, they are both core components of CI, but they are not CI.

CI is about ensuring that the different parts of a system are tested to ensure compatibilty as early as possible (ideally daily).

Continuous Delivery (CD)

CD is not about continously deploying to production, rather about the state of being that allows deployment into production at any time - this is possible due to the software always being delivered in a stable and tested state.

Run & Operate

Focus on rapid recoverability / deployment

When building and running systems always focus first on the ability to recover at the cost of almost everything else.

Style Database Sesson Management Deployment
Stateless Event Sourcing JWT Immutable Infrastructure
Active / Active (Replicated State) Data Guard
Streaming Replication
Database / InMemory session replication GitOps
Active / Passive (Shared State) Oracle RAC
SQL Server Clustering
Database / InMemory session clustering Configuration Management
Active (Snowflakes)   Session Persistence Manual

Failure design

Getting Started

Make the right thing, the easy thing

“If something is hard – do it more often and you will  get better!” –  Mary Poppendieck

Find and eliminate snowflakes

Find and eliminate information silo’s

Maturity

Every team and project’s ladder should be unique - which levels you are targeting will reflect the architectural decisions and trade-offs being made.

Mature ladders will have most if not all capabilties at Level 2 and a few at Level 3 - but never everything at Level 3. If everything is at Level 3 you are not stretching yourself - reconsider what L3 means for you. e.g. If you are already conducting code reviews then maybe stretch to having the reviewers automatically selected by git history or an OWNERS file - There is always something more you can do.

One or two L1 capabilties may also be OK on a mature ladder. e.g. Open Source projects rarely have planning above issue tracking - there is nothing inherently wrong with that. Likewise your deployment target may be very static and a L1 CI system is sufficient.

More Reading