About this blog

'Going Spatial' is my personal blog, the views on this site are entirely my own and should in no way be attributed to anyone else or as the opinion of any organisation.

My tweets on GIS, Humanitarian, Tech, Games and Randomness

Tuesday, 4 March 2014

What the hell is DevOps?



The last year or so, I have been trying to enable 'DevOps' in my team. I bit hard and swallowed the whole DevOps sandwich a couple of years ago at Velocity Conference. I wanted to experience the benefits of DevOps that included:

  • The ability to deploy often and without fear,
  • Faster recovery,
  • Infrastructure as code (all Ops guys secretly want to be software gurus),
  • Reduced complexity.


Here's the virtuous 'DevOps' mobius loop:

 


What have I found since that intoxicating first bite? It has been a tricky process and I am nowhere near achieving even half of my tasks and experiencing only a smidgen of the benefits. Now, you may ask: 'does that mean that my team and I are not trying?'

Oh no, we are definitely trying but it appeared to me that we were spending a lot of time down little rabbit holes and dead-ends. Tinkering with this and that.

So I asked my team, what do you understand to be DevOps, what is your understanding of what we are trying to achieve? Out of a team of four, I actually got SIX different answers.

Granted, there was a bit of overlap between the answers but it was obvious to me that it was not crystal clear. It amused me that we had more answers than we had practitioners. One respondent thought DevOps was a specific skillset that had to be attained while another was pretty convinced it was a new role we were migrating into.

I haven't even bothered to ask the wider software developer team, you know the 'Dev' bit of DevOps? for their thoughts as I am pretty sure that I would get another multiple of answers. I will ask them, I will.

So, as I said I was trying to enable DevOps (without the input from the 'Dev' side!) - just what was I trying to do and not succeeding?

DevOps - the holy grail?

My definition of DevOps (notice I said my 'own' definition) is broadly similar to the definition as per Wikipedia. I see DevOps as being 'the development of better tools and communications between software developers and operations staff to ensure that applications and services are delivered quickly, safely and reliably'.



DevOps: the guys (and gals) who develop the software and the guys and gals who handle the operations side of the environment and ensures that everything in production/live stands up and is properly monitoring, protected and load balanced.

What a great idea.

From my own operational side of things, I saw that DevOps should deliver itself as improved control over infrastructure, configuration management that works and is consistent and the most attractive: infrastructure as code. That is, instead of a visio diagram and a (huge) word document describing the entire environment stack - it would be rendered down into (re)useable and understandable code. It will be handled as code (versioned, stored in a repository, automatically deployed) and will also mean an opportunity in upskilling the Ops team.

What I have seen in practise

Let me give you an example of a hypothetical situation involving a web operations team and a number of hosted applications. We have a live environment that has been steadily built up from disparate components over a number of years. We know that the sum of all knowledge of this precious environment is split unequally between five to seven persons in the entire company. A document describing the environment was created during the initial stages of design, that has now ballooned to include configuration information, test scripts, actual deployment instructions and even copies of email threads of varying importance. Every time there's a deployment, the document is opened up and referred to, page by page as we follow the sacred instructions laid down by our forefathers on how to update/deploy/renew/remove/tweak the application.

However, over time - errors and omissions have been introduced into the run book. Not everything important is noted down in the run book. Worse, some of the ops guys end up taking a copy of the run book onto their own machines and use this as their own master copy, since they don't trust the company's CMS. The intention is to eventually check the document back in. This probably doesn't happen.

The developers sometimes have to deploy new code in response to an emergency, for example a just-discovered exploit in the application, so a quick patch is introduced into the environment and everyone breathes easier. Of course, the patch changed some aspects of the instructions in the run book and no-one had the sense to check or to update the book except one of the ops guys and he performed the update on his own local copy of the run book. He neglected to synchronise it with the master copy (remember: he didn't trust the company's CMS or worse, he didn't know how to use it properly)... then another deployment is carried out and the live environment goes BANG. It worked in test so why did it blow up on live?


Efforts at fixing the application fail and the everyone is frantic - there's no option: we have to roll back to the last known snapshot....from three weeks back!



There are howls of protest as this would reverse a lot of new functionality, as not everything is broken. However, why did everyone bunch up all the releases into one big software drop anyway?

Fingers get pointed and operations blame the developers for lack of testing and shoddy code. Developers blame operations for poor configuration management (it worked in dev after all!) and lack of in-depth application knowledge. This reinforces the barrier between the group to the point where operations are reluctant to update the live environment; limiting releases to 2-3 a year and weeks of testing. The developers are frustrated as they need to wait in line to deliver new code so decide to work around the operations team and start to work on their own AWS account, totally ungoverned.

Chaos ensues.


The result

The developers and operations team end up even further apart, not closer together. The environment is not documented or managed properly so its redeployment is in doubt. God help us if there's a scenario where the disaster recovery plan requires the entire environment to be rebuilt in another location. The developers are using a whole load of new applications (Elastic Beanstalk! Uhuru! OpenShift!)
 

So what is 'DevOps'? 

Well, what I can confidently say is that DevOps is not whether we should choose Chef or Puppet as the configuration management tool - though yes, infrastructure as code dammit. Nor is it automated testing or some nifty tool that can deploy golden copies of your environment to a number of hosts. It isn't creating a framework for self-provisioning or better tracking of resource use. While the cloud is elastic and flexible, the humans working in it and their processes is anything but flexible I find. The humans are bottlenecks because there is a resistance to change, the process required is unclear and there's comfort in what we know already works, even if it works horribly, slowly.


Devops is about better communication and sharing. It is about challenging yourself, your team and your processes and always asking 'can we do things better, faster and safer?' It is not a role or a job title but a state of mind. You need to be inclusive, to be able to share information and good ideas. DevOps also cannot survive in a vacuum, no point in just one person believing it - the team needs to buy into it and ultimately, the business. 

Failing to achieve this will mean that the same old processes and issues kept repeating itself until the whole structure breaks down in front of you.