I am often asked what I do in life and somewhat cannot give a clear answer. Not because my work is fuzzy but because it relies on concepts that most people do not understand. The least technical way to explain what I’m working on is:
« I’m in charge of putting into production ».
Certainly, but what is « production »? It is not a complicated notion but it is based on many other concepts that make its explanation quite difficult. In short, I take care of production in the broad sense. This includes the following areas:
• Solution Architecture
• Transition from development to production (deployment pipeline)
• Maintenance in Operational Conditions (MCO)
• Disaster Recovery
Each of these areas deserves further development to explain the ins and outs. Let’s have an overview of these different notions throughout this article.
To make it accessible, I will start from the presentation of a common case: an application containing a back-end and a front-end. In our example, the app will be used to find restaurants.
To do this, the company will need to develop several elements:
• A database to store the addresses and specificities of each restaurant
• A back-end: the heart of the application that will make research, register new restaurant, etc.
• A front-end: the visible part for the end user, often in the web browser or a mobile application
Architecture is the part of the work that will allow us to draw the following diagram:
We see that for this small application, we already need 7 machines, configure network layers and provide redundancy services. This remains a very simple example, it is quite common to be faced with twenty services spread over 30 to 40 machines.
Of course, not all companies implement all this complexity from the first line of code. You have to make trade-offs and know what is important for the service to be properly delivered to the end-user. Just by looking at this diagram, I have already a dozen ideas to complete the architecture. So here is the first part of my job: thinking about the system as a whole to see the strengths and weaknesses of the solution.
Once we have our architectural scheme, we have 30 machines running 20 different services. How do you maintain all this? How do you evolve?
From a technical point of view, it would be enough to connect to the machines to configure them properly one by one. This is a way of doing that I have often seen at several companies. Unfortunately, this is not the right solution.
Connecting to the machines to configure them manually results in snowflake configurations. Since it’s done manually, and even if you want all your machines to be configured in a similar way, there are always small differences between them.
Consequences of this are disastrous: putting into production is difficult as hell in such conditions. Each machine must have a different treatment, production deployment can take days and you cannot be sure to get something out of it.
Industrialization is therefore the second and most important part of my work. This is a fairly long process, which requires investment in both duration and means. The added value is not immediate so it is not always companies’ priority.
By industrializing, you automate the creation of images. Since connecting to machines in order to update them involves risks, you create images instead.
An image is an « instant picture » of a machine already configured and ready to use. When you deploy a machine with your image, it is ready to deliver the expected service without requiring manual operation.
Then you automate the replacement of old images with new ones to ensure a smooth transition in production, with the least possible downtimes.
This will especially allows you to press the button « go for production » with peace. This is thanks to the tests of each of your images. This is precisely where your production pipeline is important.
The pipeline is the concept of a production line, similarly to car assembly lines. Our chain starts with the work of developers who propose new features. Those features will be tested and approved by other developers. Then it will go through the step of creating our image, which will serve to create a qualification environment. This environment will allow customers to validate the feature before validating the release.
The main idea behind industrialization is therefore to create a production line that allows the company to:
• Deliver customers faster
• Improve reliability of deliveries
• Banalize delivery (more frequent, faster)
• Be able to validate or invalidate new features more quickly
Even if, at first glance, the establishment of this software factory may seem complicated, the added value is undeniable. I have never seen a company regretting having invested in the industrialization of their platform. In most cases, the company is simply not aware that it could invest in industrialization.
MAINTENANCE IN OPERATIONAL CONDITIONS
Putting into production is fine but life does not stop there. Despite several years of dealing with different software platforms, I have always been amazed by the amount of things that can go wrong, especially in production environment.
Once the application is in production, you must maintain it in working order. A stopped production is a loss for the company. The application can only bring money to the company if it is available to its customers. We must therefore distinguish several possible states:
• Optimal operational conditions: customers have no trouble connecting
• Deteriorated conditions: involving slowness, disconnections and a lot of frustration for the user
• Shutdown: production is no longer accessible, customers cannot connect anymore
This is why maintenance in operational conditions is so essential. It’s a fancy name though incorrect in my opinion. In practice, you cannot really guarantee to stay in optimal conditions all the time.
However, what we can do is recover an optimal state when an alert is raised. When a machine stops, the service will remain available but in degraded mode if the architecture is solid.
Degraded mode is better than no longer having service. The goal is to find a fully operational state to ensure a good experience for users.
The good news is that by industrializing production, creating new machines to replace broken ones is easy and fast. However you need to get informed that something is wrong in order to take actions.
Hence the third aspect of my work: keeping the best conditions possible in production, which is also called monitoring or supervision. It’s about setting up tools to receive alerts if something goes wrong on our system.
Still, you can go further than just receiving an alert and wait for a technician to do something by automating the repair:
• You dismiss the « sick » machine (you keep it to understand the error)
• You start a new machine that works
• You start all health checks
Health checks are the equivalent of unit tests under development. These are small programs that can ensure that the service is in good health. It is quite difficult to automate this part but if the industrialization went well, monitoring and supervision will be fine too.
Our product is deployed in an industrial way and it self-repairs in case of problems. Everything is awesome!
But there are things I did not talk about. It is difficult to make a separate section to talk about security because it is not a project that can be added. There is no miracle recipe to secure what already exists.
You must have security in mind in all the previous steps. You must always have an eye on this aspect. From development to production, every effort must be made to ensure that important data are protected, encrypted and isolated.
Let’s start with encryption. It is about transforming a message so that nobody can read it and your application will then be the only element to be able to decipher it. Then, if an attacker manages to intercept data, he will not have the ability to read them and therefore have no way to exploit them. If this topic interests you, I recommend you this article.
Then, what is important and that few companies put in place is the hiring procedure and the departure procedure. When a new colleague arrives, how does he access to all the tools? And how do you do when he leaves the company?
Many security flaws arise from this part. Few companies are able to list precisely who has access to the different tools they use and some people are able to act on their production environments without the company having explicitly agreed.
Finally you have to keep the tools that you use frequently. One of the benefits of industrialization of releases is that you can apply updates more often and be less vulnerable.
However, the problem of computer security is that it remains very illusory. You can never be really sure of your own safety. You can only reduce the risks.
SURVIVE CATACLYSMIC EVENTS
I told you about what we can put in place to have a solid production. And that’s how we get to the last aspect of my work: disaster recovery, or the art of surviving the apocalypse.
Well, if a meteorite fall threatens to end humanity, your production may not be your first priority. But when an excavator pulls the fiber optics out of your datacenter, the production will not be accessible anymore.
Concretely, you can overcome this kind of event easily if you have done your industrialization properly. It will just take time to restore normal operating conditions. The goal here is to stay proactive. Predict the worst case scenarios that you can imagine and become aware of the probability of this kind of event. I happened to lose a data center twice in a year. And I don’t count the ones that didn’t affect the services I was working for.
There is not much to do when the infrastructure itself breaks under your feet. But instead of giving up and starting to pray, you can prepare for it:
• Imagine potential disasters
• Assess the risks of each one of them
• Assess the potential impact on your production
• Prepare for impact
These painful events are theoretically rare but believe me: it can happen more often than you can think of. Preparing for it is a bit like taking first aid training: you hope that you will never have to use it but you are relieved of having taking these courses when the situation requires it.
I recently realized that this whole universe of skills was completely foreign to the people I met. They continued to ask me — even after several years — what my job was concretely. This recurring question has motivated me to write this article. I think that I have done a quick overview of my work here and clarified some of the things I have been working on.
Every point that I mentioned in this article could be the subject of other articles, without counting the notions that I did not address such as the DevOps culture, methods applied in development teams or networks-related issues.
I will write other articles in this style to clarify the concepts I am working on every day.
See you soon!