Data Driven Engineering Team - Changing The Culture

Talk

Data Driven Engineering Team - Changing The Culture

Continuous Delivery
UXDX Europe 2020
Slides

Since joining Hootsuite, Greg Bell has been leading the change of his engineering teams, not only between themselves but how they are working with other teams to focus on the customers needs.
This session Greg will look at how he's changed the culture over the past few years within this team to become a data driven engineering team.

Greg Bell

Greg Bell, VP, Software Development,Hootsuite

After reading most books on strategic leadership, organizational change or cultural change, you'll likely have a mental model that looks something like this. Along the time axis, you'll build a vision and mission and create some values and then you'll collaboratively work with your teams to build goals and assign some measurable metrics and voila - leadership, complete, clear sailing from here on out. It will be results, results, results up into the right and my experience is that it feels a lot more like this one we're in it. While we're deep in it, progress is not always clear. It may or may not feel like we're heading in the right direction and the whirlwind of every day is very real and progress feels very slow. Are we making progress on strategic objectives? Who knows? I need to deal with this specific issue right here right now. So, cultural change is not a linear path.
Hi, I'm Greg Bell. I'm the VP of Software Development at Hootsuite, where I lead a team of 200 engineers across six offices to deliver on a mission to make social the highest performing customer engagement channel. Our software delivers 30 million posts a month and processes over 50 million social events a day for large and small businesses around the world.
So, cultural change, data-driven engineering. These are some lofty topics. They're loaded terms and are full of different meanings to each of us and I'm not possibly going to give them justice in the next 25 minutes. I do however want to share our actual journey, not just kind of the glossy successful version but some of the challenges that we encountered and where we made change and what worked. At Hootsuite, we have a core value of working out loud and today I'd like to do just that. So, for me, cultural change is about changing behavior in a way, my role leading software development, I have peers like our Head of Design, our Head of Public Management that I collaborate with every day and the culture of our engineering team is deeply connected to the culture of our design and product management teams. We work together to build the product and our culture is really just one.
I've learned that we must change behaviors together and data driven engineering. What do I mean by that? Well, for me, I'm thinking about running an entire department and not just specific teams. So, when I'm talking about data-driven engineering, I'm worried about what measures and metrics we need across 200 engineers and hundred designers and product managers to increase the quality and value of our products in-market over the long-term. At Hootsuite, we were split into teams of six or eight people fairly standard in our industry now but with representation for product design and development, we have about 30 of these teams that make up our entire product and development organization and like many organizations especially right now we're either split across offices or working from home a lot. Each office ends up having its own culture that has evolved over the years. So, today's journey sort of by design is from the perspective of a VP Engineer. Hopefully the story is helpful, no matter where you sit within an organization.
But to get us started, I'd like to rewind to January, 2018 which seems like a long time ago now. But many of you I'm sure have these kickoffs at the beginning of the year, Hootsuite is no different. And I've been at the company for over five years now and we've done it every year. Yearly strategic planning ends up taking a lot of time, even with a company of our size. We're about a thousand people worldwide. One of the biggest challenges is what I call the Desk Drawer Strategy. And the Desk Drawer Strategy plans are the ones that our leadership team puts together at the beginning of the year to sort of rally the troops at some type of kickoff event without really doing the hard work of planning, how the organization will actually deliver on them and the results they expect.
So, the plan is then put together and then put in a drawer and everyone goes back to their regular scheduled work. At the end of the year in preparation for the next year's planning cycle, the leadership team opens that door again and finds the plan from last year. At this point, it’s been what? Remember what we said last year and how did we actually do on our plans? And as you can imagine then maybe some of you have actually experienced, this is not the most effective way to make change in an organization.
So, circuit 2017, this is generally the style of strategic planning I saw in our organization when 2018 planning came about, we decided to build goals for the year that we could look back at and decide if we'd actually delivered them for better or worse. We were looking for a behavior change. We needed all 30 of our teams to build a cadence of delivering on their team, roadmaps and departmental or corporate initiatives. And we needed to hold ourselves accountable to the results we desire, not just a set of projects that we wanted to do. So, we made the decision to structure these projects as objectives and key results. Of course, OKRs were all the rage and they continue to be and we really liked them. But the key property, what we liked about OKRs was that they forced us to both clarify what we were trying to accomplish and create measurable results that we would be able to track to know how we were doing against them.
So, we went away as an engineering leadership team and ran a process to figure out what we do. You wanted to accomplish the 2018. We did the hard work to prioritize the objectives down to three. Well, I'm glossing over it here. I don't want to underestimate them how challenging it is to do this. Well, you'll probably hear me mention it a few times through this presentation but prioritization is hard work. To do it well and create a sense of direction and alignment for a large team. It takes a lot of effort but in the end then we had our three department objectives along with three key results for each. We were incredibly excited to finally, we would be able to measure the success of our yearly plans. So, our product management team went away and did the same thing instead of three objectives, they prioritized four which in and of itself wasn't that big of a deal so then our operations team did the same thing and we found ourselves with 10 objectives and 30 key results across three departments that all work together to deliver our product to our customers. So, the positive side, we did the hard work to come up with measures that we could hold ourselves accountable for and we built communication plans to ensure that everyone knew the objectives we were trying to accomplish for the year and how they could help. Everyone in our department could read about that and get clarity about what it was we were trying to accomplish. The process created actually, a game for our teams and it was a game we could actually win. Individual team members could actually plan how they could contribute to these goals. So, for some, it was incredibly engaging. And I'm certain by the slide you're looking at that you could already guess some of the challenges we ran into. We went from zero measurable targets in 2017 to 30 in 2018 and while we did our best to build a cadence of reporting, we had none of the organizational discipline structures or tools in place to really measure ourselves well. And the big challenge we ran into is that we didn't have a single bulls-eye instead we had 30 bullseyes that were all competing for Mindshare and attention. Teams didn't know if they should be working on departmental objectives or their own teams’ roadmaps. Teams did a great job of what they had but by midway through the year, it became clear and we were missing the boat. We're making progress against some of our OKRs' but teams were confused about what they should be working on. And the net result of all of this was that quality in our product started to go down.
This is a chart of our bugs over the last half of 2017 and the first half of it of 2018. And I'm not particularly proud about this but this is what being more strategic actually got us. We were super clear on the major things we wanted to accomplish. However, we did it at the cost of everything else. The blue line is a new bug in our system and so you can see a big backlog of bugs growing.
So, fast forward to July, 2018, we're seeing this situation where all the bugs were growing production incidents were a problem for us and the performance of the platform was getting worse. On top of this Cambridge Analytica happened which meant that APIs that we rely on to build our base business we're in a state of rapid change. We knew we had to act and we had to act quickly. So, on July 18th, we had a Hackathon plan. Now, Hackathons are a big deal at Hootsuite. Every six months, we spend three days working collaboratively across dev and design, product and ops to build something innovative, native or fun or just to learn something new. And so, three weeks before the Hackathon, I made the very hard decision to rebrand the Hackathon as the Quality-a-Thon. There was a collaborative group that got together to finally make the call or product managers and our operations leaders also. But to this day today, I have the sticker actually on my laptop and it's a reminder to me and our teams to be humble. Never to forget that we had to rebrand one of our Hackathons which was supposed to be nothing but fun to Quality-a-Thon.
So, for the Quality-a-Thon, we clarify three goals for a week-long Quality-a-Thon. Reduce the number of open bugs, increase the number of automated smoke tests and increase the observability of our microservices architecture. We spent the last couple of weeks before the Quality-a-Thon building scoreboard that would allow us to know if and when we were making progress during the 5-day Quality-a-Thon and to all of our surprise the Quality-a-Thon was actually a huge hit. We had a ton of fun. We gave out amazing prizes and we made a huge difference for our customers. There's a lot of pride in building great stuff and that week we built a stronger culture and actually cultivated some of that pride.
Another benefit that we saw was that we built a bunch of dashboards to actually run that the Quality-a-Thon and in particular daily dashboard for the number of escape bugs, it was an integration between Zendesk and JIRA, which these were kind of two systems that didn't talk to each other before and these were tools that we could actually use over the long run. However, we knew that the scoreboard alone wasn't enough. We needed to give teams a system to actually prioritize work. Something that would allow local teams to make the right choice based on their context. Again, we had about 30 scrum teams, so we couldn't micromanage all of them. We needed them to feel empowered to make the right choice, to work on maybe new features or to fix bugs.
And so, we designed a very simple tool, a prioritization framework work that went along with the health scorecard. The prioritization framework gave teams a decision-making tool. It's incredibly simple. It says that if your health scorecard is red then that's priority number one, that's at the very top. Second is your committed roadmap items. What have we committed to our internal customers or external customers that we're actually going to build? Third is longer term or strategic projects. And after rolling this out, along with the health scorecard, I received immediate feedback on how valuable it was and it was a real learning experience for me. It seems simple and obvious this prioritization framework but it removed a lot of guesswork for teams. They had a tool that they could point to and priorities where they didn't have to go and speak to managers and directors to try to find out what they should actually be working on or other teams. They could point to the prioritization framework and a way they go.
So, the prioritization framework is only as useful as the set of metrics that we actually had for that first layer of it, which was the health scorecard. So, we designed a very simple health scorecard that was visible to everybody. It was reported on monthly and at first it was almost completely manual. We got started. It was very simple instead of a Google sheets that drove this reporting and the health scorecard and the prioritization framework meant that we had data and tools in place to allow the teams themselves to actually prioritize and be aligned with where we were trying to go as an organization. This step was gold and we essentially have been iterating on this same concept ever since we had it. I won't go through each of the different metrics here but I'm certain that this will be available after the fact and you can certainly review each of them and I’m happy to chat about them more afterwards.
So, what was the result? Well, the result of the Quality-a-Thon and then the prioritization framework is that we had a massive decline in the number of escaped bugs month over month and during this time we tried a bunch of other tactics to get better quality but the most impactful thing that really stuck was this idea of clear metrics, visible scorecards and systems to prioritize and keep teams accountable. These are the tools that helped us change the behavior of our teams in a repeatable way.
January, 2019. So, we were able to speak directly to how we performed in 2018 because we had the OKRs that we set at the beginning of the year and it was humbling out of the 30 key results that we designed at the beginning of 2018. We only hit a few and we came to realize that the OKRs we'd set out weren't actually the most important work for us to be doing. We did however see just how important our health scorecard became in the last half of the year. We made amazing progress against our bug numbers and we're making some progress against our reliability targets. For us, 2019 would become the year here where we would focus on reliability.
So, we set out along with product and design to create a system like we had for bugs but for reliability, think availability, correctness and latency that would allow us to create the same behavioral change on teams like we have with bugs but for reliability. We realized that we couldn't centrally manage a set of reliability projects although there were hundreds that were on the table to consider. Instead we wanted every team to own the reliability of their own products and production. We didn't know how but we knew that we needed a way to encode metrics within a system of accountability that enabled the teams to prioritize reliability work against all the other work they could possibly do be doing of which of course, just like your day to day there are a ton of different types of work that you could be doing.
So enter: SLOs, SLIs and Error Budgets, we like many other organizations have taken inspiration from the site reliability engineering practices that Google has shared with our industry and these terms are all from the SRE world as it's known. I'll briefly define these terms for you, but for a thorough description, I highly recommend reading both of the SRE books. They're available for free online and they're very well written and give a ton of insight into how Google does SRE. I'll use one of our actual examples to define them quickly. So, they get an idea of how these all work together.
So, first are the SLIs. SLIs are service level indicators and you can think of these as the actual metric. The important aspect of this metrics is that they're expressed as ratios. So, in this example, we're saying that the SLI is the proportion of sufficiently fast requests were sufficiently fast as defined as less than 1.5 seconds. It seems it's a very easily readable language, anybody on a team it's not just kind of engineer program manager, product managers, designers, all participate in creating these and the SLOs are services level objectives and they define our targets or objectives for any given SLI. On the right, you can see the SLO we actually agreed on. We agreed that we want our system to respond to 99.9% of responses sufficiently fast. And you can think of the air budget as the inverse of the SLO. We want 99.9% of our responses to be sufficiently fast which means that during any given 30-day period, we have 0.1% of our traffic to actually play with to try out new stuff. We can be slow and those were okay. We've all agreed that we can be slow on those requests.
Okay, now that we have those high-level concepts out of the way, let's talk about what actually attracted us to them in the first place. And it's the contract. The thing that caught her attention most was this idea of a contract. The SLRs, the SLOs and the Error Budgets all work together to create a contract that we can all point out. It's another system of prioritization and accounting. Yes. It's based on data. It also touches on many other aspects of how to shift a culture. It's not just about the data. And most importantly, to the point of this entire conference, SLOs are not an engineering decision tool, they are agreed upon by design, product and engineering leadership teams together.
Here's actually an actual example of the top matter of an actual SLO at Hootsuite and you can see that there's a list of reviewers and approvers and this list is made up of the product manager, the design lead, the engineering manager and a staff engineer. We generally call that the triad and the triad actually own the SLO. They can get to decide what the expectations of our customers are and this allows us to connect the reliability of our systems all the way back to our customers and agreed upon across product engineering and design.
So, at the start of 2019, we had zero of these SLOs defined and we set an OKR or Q1 OKR was to have one SLO in production per portfolio. It's not that important but in Hootsuite our teams are sort of split up into portfolios. You're usually three to five teams are grouped into a portfolio. At the time, this felt like a ridiculously small goal. And I knew that our organization needed to learn how to do this stuff but I was also pretty sad that we were only committing to building out what we're under 10 SLOs and it's a good reminder for me just how incremental this process feels. while you're in it. In hindsight, lots of progress can be made but while you're in it, it can feel very slow.
So, leading 2019, we actually had over 120 documented SLOs. We're actually up closer to 150 now that had been negotiated across product design and engineering and they documented there are budgets and policies for what we would do if they tripped. This has been incredibly useful for our teams to proactively decide on nonfunctional requirements and then to prioritize reliability work. It's all the other work they could have on their plates. Again, it's a tool that helps us systematically change behavior. So, in tandem, in 2019, we updated our health scorecard and you can see that SLOs became a part of it down to the bottom right. We modified some of the different measures that were on our health scorecard and continue to iterate on it throughout the year. This is basically what it looked like. Again, we'll kind of go through all the details, but happy to talk about any of these specific measures later. So, what were the results? Well, in 2019, we saw a 56% decrease in the total hours that our system was in a degraded state year over year and we did this by focusing on clear metrics, scoreboards that supported the behavior change that we all wanted to see across product design and engineering to drive the quality of our product for our customers.
So, we now find ourselves in 2020 and we're expanding on these ideas. Again, we actually spent the last half of 2019 building out a three-year product and technology vision for Hootsuite. Hundreds of people contributed and it was a huge effort. We kind of did one of these look under every stone activities where we considered everything and in the end, we had a really compelling plan for the next few years. We wanted to make it extremely accessible to everyone across all of our teams. Our goal was to have an engineer designer, product manager, or be able to read a single page and understand immediately what our long-term technology vision was and we did just that. We built what we call the technology master plan. The technology master plan is a one-page document but in it, we actually define 10 different metrics that we're accountable for delivering in the next three years. On the left-hand side, you can see reliability type metrics. This is kind of performance and availability of our systems and on the right hand, you have more internal metrics that, that we're holding ourselves accountable to. These 10 metrics along with the strategy to deliver them will guide our decision making over the next three years. And I do hope to be able to share more on what's working and not working on this journey as we make progress. Today, just for context we've seen another 60% decrease in the total hours that our system was in a degraded state year over year. So, all signs are pointing in the right direction that using these different tools is really paying off. The culture has shifted. We've become a more customer obsessed team and we have better tools to help us prioritize all the different types of work we need to do on a large system using data.
When I look back over the past four years, here is a summary of what I think has worked for us; we've done a lot of hard work around prioritization; we've built compelling scorecards; we've created systems of accountability that create alignment and create the opportunity for teams to be empowered to actually deliver on their own missions while being aligned to the organization and we've measured both the health and strategic metrics, both of those at one time. And probably one of the biggest things that have come through utilize is that it's not actually just about having one piece of this pie. It's actually about having the entire pie and the only way to make progress towards the entire system is actually to do it very intermittently. It's not dissimilar to how we build software. We have to iteratively shift the culture quarter after quarter and year after year.
So, I'll start where I began, which is that this stuff is not an easy work even in telling the story of the past few years at Hootsuite. It puts a little bit of a glassy tinge to it and it makes it seem like we knew the path more so than we actually did. But the truth is that we had a vague idea of where we want to go but we didn't have the roadmap to get there. And now I'd expect that many of you on similar but slightly different journeys within your organizations.
My hope in sharing this, it has inspired you to iterate on your internal processes and systems to create more direction, alignment and commitment. In the end, doing the hard work to shift the culture. Thank you.