Continuous Delivery At Scale
Continuous Delivery At Scale
Haroon will share the story of how Glovo's engineering culture and mindset evolved over time with the introduction of continuous delivery practices that enabled the team to adopt zero downtime continuous deployment. Follow the journey from monolith architecture to now growing microservices and the benefits.
Hello, everyone. And thank UXDX for providing this opportunity to present a case study that we went through, at implementing continuous delivery at scale. So the high level agenda today is, you know, we'll introduce something about what the problem, what we were trying to solve, and why we chose continuous delivery for that specific problem. And what the process was that we went through as a team. And then what are the outcomes and where we are heading next.
So let me introduce myself. My name is Haroon Rashid. I've joined recently as an engineering manager for Glovo. I'm new to Barcelona. And I'm looking forward to going to explore the city. For now, we haven't been able to because of the kind of COVID situation. For the last few years, my focus has been around promoting and adopting continuous delivery and engineering productivity practices in different organisations. So something Glovo in terms of the organisation, our vision is to create a super app that makes everything in your city accessible to you at the fingertips. So we have two mobile apps, one for android, and one for iOS. And we are a three sided marketplace. Obviously, we have users and customers, we have Glovers, which are kind of our courier that deliver that have flexibility to choose their working hours on our platforms, and can work on their own schedules.
We have partners, so we enable partners to connect to new and existing customers, bringing incremental revenue to their business. And you can see in terms of the growth that we went through, in the last 18 months or so. And that's phenomenal. Last year, we were opening sites at least every four day, we were opening up in a new city. And two years ago, we were operating in three countries and eight cities. And now, currently, we are operating in around 22 countries and more than 300 cities.
So the problem we had was around one of the key components that we had in our technology stack. And we call it Monolith. So Monolith is the back end service, that kind of powers whatever we are doing globally. So to make you understand how critical this system is for us. If Monolith goes down, then Glovo stops. So you can consider our customers cannot make orders or couriers cannot see what they want, or what the orders or they need to deliver, our partners won't be able to receive orders, and they won't be able to prepare. So imagine a situation, if something goes down around the partner side, our couriers will be at the partner restaurants, but they won't be able to get the order that the customer might have ordered, right? So these kinds of situations are very critical to our business. And Monolith is a single code repository software that is shared across many different teams.
So people who are developing features for courier for our partners for our customers, all of them are contributing to the same code base. And the sheer volume of the code 560 K lines of code is kind of very huge as well, right? So if something goes wrong there, we kind of lose a lot of money in revenue, which is quite impactful when it comes to the organization. Right?
So the problem we had was our deployments to production for the Monolith for menu, it is until late 2019. Our deployment frequency was once per day. So imagine a scenario that as a business, you come to our engineering folks and say okay, we want this feature to be developed and our engineers have developed that feature, but we are not able to release because our deployment window has gone by and so the business and the development team has to wait for 24 hours for the next development window to be available. So that they can release the feature.
And whenever we used to deploy the key thing was our change failure rate was around 10 to 20%. Like out of 100 deployments, 10 to 20 deployments will result in rollbacks and rollbacks are not a good thing to have in any engineering organisation. Right? When you have a rollback, you have to do the root cause analysis, you have to involve people from many different teams. You need to conduct post mortems and then come up with action plans and action items for team members to make sure this kind of situation does not happen again.
So there is a lot of waste that was there in terms of our process. And it was kind of preventing us from moving faster. And also, we had a strict policy that we don't deploy on Fridays and weekends. Because if we, Fridays and weekends are kind of traditionally the busiest time for us. And if we deploy on Fridays and weekends, we can potentially impact business, which impact can be a lot higher. So this was the main problem that we were trying to solve globally.
So we decided to adopt Continuous Delivery practices to resolve these specific issues. Most of the things that you will see on the following slides is not something that is new to Glovo. And it's not something that is very specific to Glovo as well. So what we have done is, we have read about what continuous delivery is, we are taking these practices from the industry experts, there are a lot of smart people who have written books about these things. And we believe, you know, we can leverage from their experiences.
So for us, continuous delivery is not just about CICD. And incidentally, you know, Dave Farley, who has written a book on continuous delivery kind of agrees with us. Continuous delivery is much more than just automating your deployments. And it actually is more broader than DevOps in his opinion, and we tend to, agree with him. So we define when it means continuous delivery, it's not just about automating your CI, automating your CDA, implementing a tool, like Jenkins and Spinnaker is not going to save, or it was not going to represent that we have adopted continuous delivery practices.
So for that we defined continuous delivery, with this kind of definition. So the idea is, our business can come up to us at any given point in time, and they can say, you know, they want this idea to be implemented, as an engineering team, we should make sure we should be able to develop and accept and go through this whole build, test and deploy lifecycle as quickly as possible, maybe in an incremental way. And we should be able to ship our software whenever our business is ready. Right? So our main branch should be able to release at any time. And it often requires not just the technical changes, it might require organisational change, which luckily we are going through right now. And we are there to adopt these practices. So the structure is there already. Technical changes are required, as well as development, cultural changes might be required. And we'll go through what kind of changes we adopted when we went through this journey.
So again, the question then we asked ourselves, why do you want to adopt continuous delivery? You know, obviously, we want to fix the problem that we had around Monolith, which is time to end reducing time to market and make sure our quality is not compromised when we go faster. But what do we want to do, and what can we learn from the industry?
So this is quite key for us, right? We said, you know, we want to go faster. So we want to reduce the lead time, shorten the lead time, we want to improve the feedback loop. And we want to enable our business to experiment a lot more when it comes to new features and new features on the platform. So the idea is that our business can come up with an idea, we can develop it natively as quickly as possible. And then we can roll it out to our customers and our operations folks can get the feedback from our customers. And based on that feedback, they can iterate over their feature requirements. And this is quite key, because, without experimentation, you cannot enable your operations or your business to get the meaningful insights on a day to day basis or within the platform and John Alspa has this kind of famous diagram which highlights you know, like how slow delivery cycles can impact your business as well as your development teams as compared to fast delivery cycles.
So on the left hand side, the graph shows you know you have long batch sizes that are going into production. If it is not daily, it might be weekly, it might be monthly or it might be a quarterly release. The more you delay your release, the more lines of code that you are releasing into production. And also it will take a long time to release at the same time there will be a lot higher chances that something will go wrong because the batch size is too big. On the other hand, if your change is small, and you are kind of deployed frequently, then the chances are your code changes will not impact your systems outage, right? And your business will be happy. And in turn, your development teams will be happy as well.
One of the key things for us was, we looked at, you know, different surveys being done by companies. And one of the key surveys that was done by the puppet lab was quite close to what we're trying to achieve as well. So it highlights that there are a number of key KPIs or metrics on the left hand side like deployment frequency, lead time for changes, mean time to recovery as well and change failure rate. And as per them, when they floated the survey to different organisations, depending on the answer, an organisation can be deemed as low IT performer, medium IT performance, or higher IT performer. For us, when we answered these questions, we were around medium IT performance.
So we were releasing every day, our change failure rate was around 20 to 30%. And mean recovery meant time to recover was around less than one day. And their motivation was that by adopting Continuous Delivery Practices, we can move to maybe a high IT performer organisation where we can deploy multiple times per day. We should be able to deploy ideally less than one hour, but it was not possible for Monolith, and I'll explain why. And our change failure rate should be between zero to 15%.
So to resolve that problem, the approach we adopted was to initiate a project and the name of the project was Valkyrie. And in the project Valkyrie we gathered, different people who were contributing to Monolith. So developers from the courier side, the developers from the partner teams, and developers from the customer teams. So all these back end developers were tasked to do one thing, and there was only one objective: to adopt Continuous Delivery practices for the Monolith, and what those practices are? So one of the key things was to enable continuous deployment to production. On each committee, we had to change how we manage our source code.
We use GitHub for our source code repository. But the branching and merging strategy that we used to follow was Git flow. And Git flow, although it's good too, it's a good branching strategy. But that did not enable us to have continuous deployment. So we moved to trunk based development, which means every commit by a developer to master will go to production automatically. And there is only one source of truth for our code base, which is Master. So by adopting continuous development, you can say, you know, whatever is in Master is in production. So there is only one source of truth. And if you want to read more about trunk based development, there is a very good website that is managed by Paul Hammond. And he spent a lot of time making sure all these strategies are kind of defined property and are clear. So the adoption path is very clear by following this website.
Again, one of the key things for us was to define what testing means to us, traditionally, what we have seen in the software industry, and in many organizations as well, that testing is something that is a phase in the software development lifecycle. And as a developer, if you finish something, you will move your story from development to testing and someone else, either in your team or a separate department in your organisation will pick up that story and start testing, right? And when you do that, the risk of doing this is you go into this pace, where you throw your feature over the wall, to some other department or some other team member, and then you're going to the cycle where you have, you know, like delta or something, they'll report bugs, and then you fix and then that can go on for many iterations. And we think that it is counter productive.
We think testing is a cross functional activity that involves the whole team. And that happens throughout the project. And it starts when the project starts, but it never finishes. Right? It's going on throughout the project. And this diagram, which is again, from Dan Ashby kind of highlights if you want to adopt either DevOps or Continuous Delivery practices. Then testing is involved in each stage or each phase or each activity, whatever you want to call it, right? Planning, crunching, coding, merging, building, whenever you release, then if you are deploying, then you have post verification, post deployment verification is one thing that you want to test as well.
When you release as well, you're testing there. Your monitoring is also kind of contributing to validation and verification activity. So all these things are tested in some form or the other and the whole team is responsible for that. When it comes to automated deployments or releases, we adopted a tool, which is open source Spinnaker, and it's a cloud native continuous tool. Netflix kind of contributed a lot to Spinnaker, and it came out of Netflix as well. In 2015, I think Google joined Netflix to develop this further. And in 2016, they open sourced this tool. And since 2016, many companies are kind of contributing to Spinnaker, and are using Spinnaker as a tool for their continuous deployment workflows, orchestration.
So in terms of our deployment strategies, Spinnaker out of the box supports all these different types of strategies. We've gone for bluegreen. In Spinnaker terms, they call it red black, I'll just explain why we went for bluegreen, and how this is helping us in terms of our continuous deployment efforts. So what we do is we deploy a new cluster with the new changes, we start rolling out the traffic to the new cluster, so we do 50/50. So 50% of the traffic is going to the new cluster, while the 50% of the traffic is going to the old cluster, and we monitor, we monitor for half an hour. And then if within that half an hour, nothing goes wrong with the deployment being deemed successful, then we destroy the old cluster.
Obviously, there are other strategies as well. And we are evaluating Canary deployments. And we hope to, we are evaluating Canary because we think by adopting Canary analysis, we will be able to reduce the verification time window, and we will be able to do automated roll backs. Right now our rollback strategy is if something goes wrong, then developers can initiate a rollback using Spinnaker, which obviously, we want to, you know, automate as soon as possible.
One of the key things for us wast his is kind of a philosophical point, more than a practice, I would say. So Continuous Delivery says, you build software, and then you are responsible to run the software or operate the software in production as well. And we wanted to adopt this practice. But to adopt this practice, we had to develop some tooling, a lot of release process documentation, a lot of run books as well, to make sure when something goes wrong, then developers know what they need to do. So run books are there to help them to understand if this specific situation happens, what are the next steps they need to take to make sure either they can go back to a stable state, or they can resolve that problem as quickly as possible.
So this highlights the tooling that we did, and how developers are utilising this tool. And the bot, that specific bot, we call it internally, The Valkyrie Police. And this basically helps our developers to send notifications when their changes are about to go live. So this highlights that it has made some changes. And his feature is about to go live. And we have sent him a message on his slack personal message that says, okay, here are your changes that are about to go live. And this is the change set, and now raise responsible to make sure that the deployment is successful. He's responsible to make sure he's monitoring the systems when his changes have been deployed for the half an hour window that we have. And if something goes wrong, he would initiate a rollback. And all of these are then events, or we are sending it to data dog, which is our observability platform. And then we can monitor over time, how many successful deployments we had, how many rollbacks we have initiated. And this helps us to identify what is the change rate in terms of our deployments.
So post Valkyrie, now our deployments are automated. And also, we are deploying at least 8.7 times per day. Obviously, our goal was around 10 deployments per day when we started the project. But Monolith because it's too huge, it takes a lot of time to build tests and make sure that the quality is not compromised. The deployments are fast, but the verification is still kind of manual for that change for that half an hour window. So that's why my turnaround time is 2.5 hours, but if you compare it to before, you know from 24 hours to 2.5 hours is a big improvement, but the key thing is our change failure rate has gone down from 10 to 20% to 1.53%. So our business is much happier, our platform is much more stable, and we can release with much more confidence. So whenever we release, release now is a non event.
So previously, you know at Glovo release every day used to be an event like the platform team will gather. The developer will hand over their code, we will try to deploy it manually and if something goes wrong, the whole team will be starting to look for issues and debug issues. Now it's not a non-event, it happens automatically. The only thing developers notice is a Slack message in Slack via the Valkyrie police that the change has been rolled out. And if something goes wrong, they have the tooling available to initiate that roll back. And then we have processes in place to do post mortems. And these types of things and come up with the action items, if we need to improve the tooling, or the process or whatever. But that's really happening right now. So that's a big win for us as an engineering organization and for Glovo in general.
So, what this has enabled us to do is to plan our next project. And this figure I've shown you before as well, I won't go into detail. But the main thing here is because Monolith is shared. So nobody kind of owns it. Right? So the ownership problem is there. If something goes wrong, we try to scramble people and say, "Okay, what went wrong, what was the change, who's responsible, who needs to roll back, and all these types of things", obviously, because it's too huge. It's brittle in nature. So if you change something, as a developer, you don't know what the impact is going to be on the other team or other department, or the features, other features that we have in the app. And that's why we have come up with a lot of processes where you have, you're making impactful changes, you need to initiate these processes, a lot of discussions. And we think that we can eliminate these types of inefficiencies in our processes. And obviously, it's very complex.
It's a Monolith. It's a legacy system. And it's very complex to operate. It's very complex to build, test and deploy. So the idea for us is, we have initiated a new project, which is called the Darwin Project. It's our journey from Monolith to Microservices. And we believe you know, that by having specific micro services which are owned by different teams based on their domains, we will solve the ownership problem. Also, these will be more resilient, because they are small in nature, it's easy to reason about the changes, how we are building, testing, and deploying new changes. And also it will help us to be more agile. So if our business comes to us with a new feature, we should be able to respond to their requests much more easily, much more quicker.
At the same time, if something goes wrong in production as well, we should be able to isolate that problem to a specific micro service or set of micro services, and should be able to resolve it quickly. Our first micro service meant to live in production. And again, using the same mechanism, you know, Spinnaker as a deployment tool, have a lot of automated testing in place, what we were able to do was we were able to deploy have 15 deployments per day for our first micro service, or feature turnaround time for that specific micro service when down 2.5 hours. So if a change is ready to be deployed, we can go through the whole automated pipeline within half an hour. And that's quite a big milestone for us.
So the key takeaways from the project Valkyrie and the project Darwin so far are, it always helps to have focused teams, which are delivering very focused objectives as well. So you need to define whatever project you are going to initiate, if it is a technical project, what are the business KPIs? And what are the kinds of technology, technical KPIs that you want to measure the success criteria for that specific project.
Then there are some key principles that we have adopted on the way so immutable infrastructure is one of them. Immutable infrastructure means that you run some scripts, and your infrastructure is there. And the source of truth for your infrastructure is those scripts. So there is no config that is done manually. There is no setting that is done manually on any of the parts of your infrastructure. And the idea is you throw away your old infrastructure, you bring up your deploy, you deploy a new version, you bring up a new set of servers or load balancer, or whatever is required to run your piece of software. And you throw away the old infrastructure.
It helps you to have repeatability in your process. And repeatability and consistency is the key when something goes wrong. Obviously, when everything is successful, you don't find consistency and repeatability to be a big bonus point. But if something goes wrong, and then you can reliably reproduce that issue, then it's a big one. And we have seen this many times if something goes wrong, and you can run the same script and you get the same results. Then it's a lot easier to debug and find out what the root causes. But on the other hand, if you don't have this kind of repeatability and consistency in your infrastructure or in anything you do, especially around testing as well, then you introduce flakiness. And flakiness can contribute to loss of confidence in your infrastructure in your practices as well.
If something goes wrong, you run the same script and you get a different result then it's very difficult to find out what the problem might be. Again, think about, you know, what kind of deployment strategies you want to adopt. We went for bluegreen because we thought it's well suited for us. Now, retrospectively, Spinnaker has been in production for the last three months. And we think, you know, by reducing the half an hour time we have, the developer is kind of proactively looking at the logs and the monitors to make sure nothing is wrong with our production system.
By adopting a Canary analysis with automated rollbacks, we think we can bring in a lot more efficiency in terms of the process. We can reduce this half an hour, to maybe 5 to 10 minutes. And with automation in place, what we can do is we can free up the developer, if something goes wrong, we should be able to automatically roll back and send a notification to a developer, "hey, you deployed something, we had to roll back, these are the logs, these are the issues that you need to investigate", rather than a developer, you know, proactively looking at, okay, I have initiated my changes now I need to monitor for half an hour. So automation and deployment strategies are key to make sure you're freeing up your developers time to focus on innovation and new changes going into production, new features going into production.
Operational integration is quite important, all the data that you generate from your CICD processes. From your testing, we are sending it to data dog, which is our observability platform. And that in turn, helps us to identify where the bottlenecks are. So again, same example, that half an hour window that we have right now, has been highlighted to the data dog integration we have. The time to market is 2.5 hours, and out of those 2.5 hours for half an hour. We just wait. Although the feature is being exercised in production, we are just waiting. And to reduce that time. What we can do is Canary analysis and automated rollbacks.
Again, a lot of tooling is required to make sure developers are informed of the changes going into production. They have the documentation available, the release process is very clearly documented, especially for onboarding new people and run books are there for people to understand, you know, if something goes wrong, what they are supposed to do. So all of these contribute to our very healthy kind of engineering culture that we already kind of have at Glovo but we are striving to improve that a lot more based on the feedback that we are getting from all these systems and all these initiatives that are going on at Glovo. Thank you