Reducing Risk With Continuous Delivery

Talk

Reducing Risk With Continuous Delivery

Enabling the Team
UXDX APAC 2021
Slides

Dan Harper and his teams at Xero just wrapped up big changes within the business. With involvement required from 40+ teams, changes to many critical legacy systems and the need to build several new platforms, they carried massive risks throughout the project and particularly when "going live" to customers.
In this talk, Dan will share the learnings on what worked, what didn't work and the continuous delivery practices they implemented that helped the team reduce risks and incrementally deliver changes to production across global regions and different customer segments.

Dan Harper

Dan Harper, Senior Engineering Manager ,Xero

Hi, my name's Dan Harper. I'm a Senior Engineering Manager at Xero. And today I'm going to lead you through an example of a project with a whole bunch of teams who realised that they could leverage continuous delivery principles to reduce risk and have a great successful outcome for customers. So we're talking about a challenge that a bunch of teams had, where they have a very high complexity project that had very low knowledge of the systems that they need to change. And the systems themselves had very low knowledge, all spread out across Xero of all the employees as well. There was a very low confidence about the success of this project because due to the complexity and due to the low knowledge. Stakeholders did not have a great deal of confidence that we were going to be able to deliver successfully on time. And we had the biggest challenge of all, which was to release this whole group of changes with zero downtime. Xero has about two and a half million small business customers spread out all around the world all in different time zones, which was a great challenge to be able to deliver a project of this scale and to be able to do it with zero downtime and no impact to customers. So we had this hypothesis to start with. We have a very highly complex project, a very highly complex piece of work. Could we break this project down into small chunks to reduce risk, to deliver it to production in small increments, to be able to measure feedback and get feedback quickly? Could we do this? What would it look like? And how would we grow? Well, Where did we stop? Well, we, first of all started with hiring some key champions. These people, these engineers had done continuous delivery in previous projects to this one. So they had a lot of experience on what worked and what didn't. And there were some key openings within our team that we could hire those people in. And we could use those people to influence others around them, to teach them and to grow them and to show them how continuous delivery worked. And how it would reduce risk for the whole project. So they started to influence a smaller group of people around them and then influencing a bigger team as time went on. Now, there was some challenges though. If you bring in new people who have a lot of different types of experiences, they are going to clash - if you like with the people who are already there, you have people coming from two different worlds. You have people who have not had any experience with the continuous delivery and people who do. How do you reconcile the two? How do you help them to work together? How do you help them resolve differences? How do you help them reconcile? So what we did to start with was to build a new culture within the team. How do we do that? Well, we started with some core values and practices and principles in the team. Now these are borrowed from ThoughtWorks. They're not designed by us, but the ones that we borrowed for the team. So we started with some core values, which are those four in the middle of there: clean code, fast feedback, simplicity and repeatability. And then around the outside, we have a bunch of practices and principles for the team that would kind of be a contract within the team to say, this is how we will build software together. This is how it will behave. These are the kinds of practices we will do regularly as a team. So we would split work into vertical slices. We would have a simple design. We would have collective code ownership, pair programming, refactoring, test driven design, continuous integration, and automation of repetitive tasks. Now I wanted to touch on a particular example, which covers fast feedback. And see, I just to give you a bit of a glimpse for how, what we did and how some of the systems worked. So here's an example, not only did we want fast feedback for customers and fast feedback to learn knowledge about the system and how things worked, but we also valued fast feedback in terms of the developer tooling experience. Now, this is an example of how we did it. So we started with the principle. That we should have the same tests, the same kind of checks and balances that we have in our CI environment. We should have that available 100% replicated on a developers environment, developers local environment. So what we did was we created whatever we created for CI in terms of integration, tests, and unit tests and how we scripted that we made that script available as a pre-check in for when developers would run before they push their code to master. So that meant that if they had the feedback relatively quickly, instead of waiting for CI to spin up agents and start a build, they could run it directly in their local environment. They could get confidence that they needed to then be able to ship and push their code to master. Now, one important element we had post the pre-check in script. Once the commit was made and the push was made, we had full a hundred percent automated, so continuous deployment, from the push all the way through to a deployment and verification in production. The other thing we did, which helped us enable this was to use containerised environments. So we use scripting of course, quite heavily, but we also had containerised environments all the way through to setting up dependent services and integration environments and integration level testing all across. And that would run on the local developers local machines. The other part that we looked to do was to track metrics. Yeah, we thought this was important to track where we are now, where we want to be and how we're tracking towards the target. We wanted to make, of course, changes in very small and fast increments, but the Accelerate book, which you may have heard of or read before I recommend reading it, if you haven't, but it taught the Accelerate book, talks about best practices across the industry in terms of devOps practices. And they have four key metrics that they encouraged teams to measure. We particularly focused on two of them. One was the deployment frequency. So how many deployments we would make per day? We hit a target in one team of 20 deployments a day, which is great. We're averaged around that number. And we had a delivery lead time of 12 minutes. Now the way that we define that delivery lead time, I've noticed across teams that delivery lead time can sometimes be interpreted differently. So I'll classify or clarify how we actually measured it. So what we did was measure it from a developer commit into master to a production release, and that we got that down to 12 minutes, which is great. I want to talk, you just mentioned a little bit about cycle time. So when I talk to engineers or groups of engineers, sometimes, if we discuss the concept of cycle time, it's regarded as the bottom level. The bottom of the rung of metrics to track and questioning how much value there is. There can be dependencies that are outside of your control, which can affect cycle time. If you're dependent on another team to get some work finished and maybe it's not a measure of a high performing team. I do agree with that concept and that sentiment. But there is some important things that can come about from measuring cycle time and having targets to reduce cycle time. In this particular team that I was working with, they had a cycle time of around four days. Actually, when I first started the conversation with them, they weren't even measuring cycle time at all, but I encouraged them to start to measure it. So they came up with the number of four days. Okay, that was their average. I gave them a challenge to say, what would it mean if you brought cycle time down to one day. Okay, which is quite an aggressive target. For me personally, as if I'm an engineer at a team, I would love to have one day as a target cycle time. I think if you're an engineer in a team and you're able to ship code at the end of the day, straight to production, you feel good about the work you're doing. You feel good about the achievements you've made. So it ends your day with actual software in production, which is a great outcome. So I led the challenge to them. What would it look like if you went to one day. Now, one day was a little bit aggressive for them. So we negotiated to two days. What would it look like for two days? Now, this particular team had some challenges because the work that they were doing required them to work mainly within legacy code base, which is in a monolith, which was shared by a number of different teams. Now that meant that your releases because of the age of the code and the kind of practices that were done around the code in the early stages or during the life of the monolith, there wasn't a great setup or culture of automation, of testing, so a lot of it was manual. The team had this challenge, they were working mostly in this space. Now, one element of working in these kinds of areas is that you need to kind of schedule releases and coordinate with a whole bunch of other teams because multiple teams are making changes to this at once, but you generally only want to have your changes be released at one time because you want to reduce risk and not have other things actually being released at the same time. Otherwise you have things that are unknown and if things go wrong can be very difficult to fix. But in the challenge to get the two days for cycle time, the team had some very important discussions within the team around what would that look like? How would they actually structure their work? How would they break down their work even more? It introduced the new level of discipline in the team. What they ended up doing was they were able to shift and their release timeframes and have them more often. So a lot of teams were kind of booking around a four week timeframe between releases. And that means that what you're deploying can be substantial. You've got a whole teams of work over four weeks being deployed at once. So they booked releases that were, say three, four days, five days apart around a working week. And that meant that their changes were a lot smaller. It meant the risk went down. They spent less time actually testing, which is great, and that we're able to get faster feedback. They were able to get changes into production faster. And generally speaking for the team, culturally, it increased the feeling of flow within the team, which helped them view themselves as a higher performing team, making improvements, be able to make changes and kind of feel that happening in the day-to-day work that they were doing, which is a great outcome. So what about working with other teams? Because we can go for continuous delivery in an ideal world, you have a team isolated, they can be fully autonomous, they can do whatever they like, but in reality, you probably have some constraints. One of them may be that you have other teams that are working at a different cadence, so they're not ready to go to continuous delivery. And if you have to collaborate closely with them, how do you reconcile that? Well, we had that exact challenge with the project that we were working on. For starters, if you have dependencies that in itself can carry risk, particularly around delivery timeframes. So if a team that you're dependent on, takes longer than expected, that can actually flow on to all the dependencies that are downstream from that team and push out timeframes and things being taking a little bit longer is quite common across software delivery, as we know. So that can be a big issue. The other challenge we have is a high risk release situation. So if you have multiple teams hitting into a target all at once, or if you have releases or, a bit of a big bang approach for when you're actually launching into production, if you have multiple teams making deployments at the same time or changing more than one thing in production at the same time, all those things add to higher risk. Could we have a situation where teams run in parallel? They all create their own targets, their own delivery targets, which are communicated to everyone else. But they're able to run their own game. You can get rid of dependencies altogether, which would be wonderful. Well, we couldn't do it a 100% in the ideal sense, but we were able to do it somewhat successfully using contracts. So we were able to early up front in the project's life cycle. We were able to start developing contracts, to be sent out to other dependent teams, to give the notice. And we also included some documentation to say, this is how the system will behave. This is how the contract will look. Is it the kind of data you can expect? We tried to give teams enough confidence that they could start building with the confidence that they weren't going to be major changes coming down the line. They can stop their build and they can start to do things autonomously in parallel with other teams and not so much be fully dependent upon our work going from start to finish before they could stop. Now, developing these contracts ended up being more difficult than we thought. We had representatives from teams all across Xero. The project we're doing across the number of domains and we were all in rooms together thrashing out what the contract should look like. Now, when you talking about multiple domains, you talking about teams and talking about history across Xero, who works on what, who owns what was a particular thing that we had to decide upon. If we're going to define these domains and define them well, and how they interact with each other, how does the ownership model will go in? What belongs in what domain? All these would be challenges. One of the biggest challenges we had. I've got a question there of what is a feature. That is one example to show that phrases and terms or what you call something can mean different things across domains. It can mean different things. It can have different contexts teams can interpret things differently. If you call things in what you understand the feature to be. But another team understands it to be something else that can be really difficult to reconcile. We had some of those challenges and we invested a lot of time to be able to work through these kinds of issues, to try and communicate through them, to try and come to an understanding of how the systems would interact. We also had some challenges around things like changing behavior for the better, for the better future of Xero. One of those examples was moving from synchronous API calls into asynchronous behavior. When you do those kinds of changes, they have big flow on effects on both sides of the contract teams or sorry, systems or teams that build the systems, have an inherent kind of assumptions of how systems are going to behave. How they're going to interact with them, what they can expect to get back and what kind of timeframe? When you're shifting to an asynchronous model, those things can move around quite a lot. So that required some extra level of communication to be able to explain things to teams. How things were going to change, what things, what they could expect from systems as this got deployed into production? I want to talk to you about feature flags. Now feature flags is commonly used across continuous delivery processes, but we used it to the next level, I think. So I wanted to cover exactly what that looked like to give you an idea of the depth that we went to. So here's a bit of an example of how we segmented customers. This is very simplified, but we looked at what a customer is, what a metadata or how you interpret the different types of customers and how we can treat them differently. So some examples here, we told them about global regions where they exist around the world for Xero, whether they're new or existing subscribers, whether they're on a free trial, whether they fully signed up and what source they came from so you can upstream systems that came into our system. So that was one view of how customer segmentation worked. Now, functionally breaking up things and feature flagging. This is how it's commonly used. So it'd be a bit of an example here. So we have on the left-hand side, some channels, and that kind of describes upstream systems from the platform we're building. We have a monolithic microservices. We have, these are downstream from our platform. But all of these things are affected by the feature flags that we have on or off and in the green, I've just kind of highlighted one execution path here. So we used feature flags to be able to adjust, enable certain functional slices to be active in production at any one time. And that meant that we could limit the amount of code paths that are actually being executed in production, which reduced high risk and we could do this on the fly. So as soon as a transaction from a customer hit the system at the upstream level, everything downstream would full level the feature flags, according to what we had set, and we had teams monitoring it and making sure that things were working as expected. So one thing that we did do, very early on in the project, we came up with this idea to have things running in parallel, code execution running in parallel. We had the legacy system and the new platform running in parallel. And we designed it intentionally so that we could run them in parallel in production without having any adverse impact, that meant that we could compare apples to apples in production and compare the data output of both code paths just to be able to confirm that we hadn't missed any edge cases because our knowledge of the systems and the edge cases were quite low. We wanted to just make sure that things worked as expected. Sometimes when we received new types of metadata on customers, our code would need to respond in different ways. So we want to verify and make sure that our code was doing what it was supposed to do. Now, one thing that we did do with this was we actually were able to flick it on and off quite quickly. So just with one change on a feature flag. We were able to direct customer requests and transactions through the system, into the new platform or the legacy platform. Now I said we were running a side-by-side, which was true, but when we started to gain more and more confidence and we started to deploy our systems to take over production, we would do it on a transaction by transaction basis. And that greatly reduced that risk. So instead of just switching it on and leaving it on for days at a time or weeks at a time, and kind of measuring. What we would do is switch it on and then wait for a customer transaction to happen. And as soon as we saw one, we'd switch it back to the legacy platform. That meant that we had real live, a real live transaction, just to be able to go in and immediately verify: did it do what we expected it to do? And that enabled us to gain more and more confidence about what we were building was working correctly which was a big win in terms of being able to show the confidence. Be confident in ourselves, be able to put this into production without adverse impacts for customers doing it this way was it was a big win. We also decided, we had a big discussion and we ended up deciding about how we would go live and what that would look like. Now we challenged ourselves to go live with the very smallest thing that we could. What was the smallest thing we could do, which had end to end value? What could we go live with? And when could we go live with it? And we aim to get it live as soon as possible. Now that brought a bit of a discussion about amongst the team. Which is, what does that look like in terms of production support? What does it look like in terms of being on-call or monitoring production or anything just around that nature even customer support issues that would come through and disrupt the team if things went wrong. Now, that was a valid concern. What we decided to do was to go forward and do it anyway. And part of the reason was just due to the amount of unknowns, we had, due to the amount of risk we were carrying, due to the amount of confidence we started with. We really needed to have a win on the board. We needed to have something live in production that was working, that was actually working for customers to solve some of those issues around unknowns and confidence and that thing - that ended up being a great success. The big impact of the team didn't really happen. We did have some issues that we had to attend to and we had the jump on, but actually what happened was just having some of those issues in production was also a win. Because the engineers actually got used to what their system behave like in production, with a very small amount of customer transaction. So very low impact. We did have a couple of issues that we needed to jump on. They weren't critical, but there were things that we had to adjust and tweak along the way. So having that feedback from real customer transactions was a massive value win for the team. Now, your architecture is something that is affected by continuous delivery. I want to just touch on briefly of some of the things that we did that worked well. So I've illustrated here on the left we have upstream systems, on the right we have downstream systems. And everything in between is what the teams would be building. Now, you would think, right that we're building out of domain, everything starts with the domain API, but in actual fact, our system actually started with a group of adapters. Now the reason why is because it enabled us to decouple ourselves from legacy systems first of all, and it began to enable us wherever they were kind of upstream systems that maybe were lagging in change, or maybe couldn't fulfill some of the contract expectations that we had. So things like asynchronous behavior, things of that nature, where we couldn't get changed in time or we had mismatches in domains. We were able to use an adapter layer and we had one for roughly one for each upstream system to be able to translate into the domain API if needed. If we were able to document and influence and line up priorities with other upstream systems and teams that were building it. Then they could come straight to the domain API, and that was all fine, a big win, but we use the adapters as temporarily temporary placeholders to be able to have a system that meant that we kept out the main API pure if you like. Then we had kind of our internal domain, which was just modeled there. And then a bit of an example of how we used async we use queues heavily so that we could, just have like async behavior. We could then make things quite fast. But one of the challenges we had was we moved to a model that was eventually consistent basically, and that was a bit of a change. So we had to do a lot of discussion with upstream and downstream systems because upstream systems originally in the old system, we're using synchronous calls where everything would go through and they would get high confidence back that. Yes, it's consistent. Yes, it's been done. You can go ahead with the next stage. When we moved to async, that created some issues. We had to actually collaborate across upstream and downstream because we were looking at like feeding off the downstream, getting a response back and then being able to send that back to an upstream to say it had been done, so that was a real challenge. But one of the advantages of building an architecture in this way was mostly about using the different components gave us a lot of flexibility. And we did have a system where one of our upstream systems was experiencing a delay. That wasn't a huge delay, but it was enough to impact customer experience. Now what ended up happening was it was just a combination of two different things within our domain of our architecture we'd built that weren't quite interacting together properly. So we were actually able to swap out one of those components with something else. And the impact was quite low. We did it within a sprint and everything was good. One of the main challenges really was for the team to figure out how do we measure this for the future? How do we make sure that other upstream systems get the speed that they need? That we have confidence in the service level that we're delivering. So that actually took more discussion than fixing the architecture. So what was the outcome? Well, overall, it was a great success. Now I wanted to talk a little bit about some of the reflections. Just thinking back about this project, what went well? What didn't go well? Maybe what we would change next time. First of all decoupled architecture, which was a huge win. We actually spent a lot of time going through the architecture. A lot of talking about edge cases, what's going to work. What's not going to work. Dealing with a highly complex system required us to put a lot of work upfront into architecture, but the work that we did had a huge payoff and really enable the system to scale quite easily once it was fully delivered, which is great. Feature flagging and the depth that we went to was also amazing, really enabled us to fine grained. That's such a fine grain level to switch things on and off and be able to measure. Was a huge win and enabled us to really take very very small steps in production. But every time we did it, we increased our confidence. We could bring out a new segmentation of customer into production and then verify that it was good and be able to communicate that out and go, and we're at the next stage. We're able to do not only this region, but now this region, that was a big win for the team. And going live to production early and often was also a huge win. I don't think we could've really easily anticipated what the production support level would have been. So it could have been a loss. They ended up being a huge win our assumptions and our thinking about what kind of impact that would be for the team and whether they would be distracted or whether the time would be spent in production support issues that ended up not happening. And so we gained a lot of the benefits. Without much of the projected costs of the things that could've gone wrong, which was great. So how would we improve for next time? Well, I think looking back over the project, we actually started, immediately started delivering and started shifting the teams into a continuous delivery framework. I think looking back at that maybe it would have been better to negotiate with stakeholders and executives and people around the team to say, hey, maybe we should take a couple of months just to embed these new kinds of practices in the team to teach engineers, to show them how it works, to realise some of the benefits before we went on to actually deliver. I think, trying to deliver and make the change at the same time, added an element of stress for the team. It would have been great to be able to teach the 'why' more. So, in some examples, when we were doing a lot of work in architecture, sometimes we had to say to the engineering teams, look, this is how we're doing it and this is what we're doing. But to teach the why and teach some of the principles as to what the thinking was behind it, was it a lot more difficult. I think because we were executing at the same time. I think we felt time poor, where we weren't able to teach the why more, but it would have been great, I think, to actually go back and do that as more learning opportunity for the engineers that were working on the project. And I think we could have over-communicated even more than we did. We did communicate a lot. We wrote a lot of documentation. We had a lot of meetings. We had a lot of Slack comms, but I think we could have communicated even more above and beyond that. And sometimes that would have, eliminated some of the challenges that we had in collaborating with other teams. This whole project was like 40 different teams in play at once. So there was a lot of moving parts. I think over-communication would have been a great thing to introduce into just up it to the next level would have been great. So I hope you enjoyed that. I hope you got a lot out of it. I hope that this showed an example of how we reduce risk via continuous delivery. It was great. Thank you.