Handling Sudden Growth

Talk

Handling Sudden Growth

Continuous Delivery
UXDX APAC 2021
Slides

2020 brought us massive growth in usage but big challenges came along with it (as they say, a good problem to have!).
In this talk, I will share the challenges we encountered and how we responded to them, including the decisions we had to make and the processes we developed.

Hi, everyone. Good day. I hope you're enjoying the talks and workshops here at UXDX so far today, I'm going to share about our experience as Quipper about handling sudden growth. I am Kristine Joy Paas and you can reach me by this handle on GitHub and on social media. So, I'm currently an Engineering Manager at Quipper. I mostly work with Ruby, Ruby on Rails, JavaScript Unix systems and Kubernete's. My hobbies during the pre-COVID era were traveling and swimming and they helped me relax from the very stressful rigors of the job. And unfortunately, we cannot do those now because of their restrictions. So, recently I've been reading a lot and watching series and movies via Netflix basically, and this photo was taken in both sides and South Korea, a couple of years back with my friends. So, Cats love Busan and so I also love cats by the way, and I hope to come back here soon. So, my company is Quipper. So, we are an ed-tech company. One of our services that we offer is learning management system. So, this is used by teachers, students, and even parents for education. So, yeah, that's for our company and our agenda for today is first, our company state early 2020, and the challenges we had last year, our response, the results of our efforts there, the future, and some of the takeaways from the experience. So, in early 2020, the first couple of months there, we just finished our restructuring. So, this was a very big decision for us. So, we decided to split the Japanese business from the rest of the global businesses. So, the implication of this in the product is we had to do a code split. So, that means we made a copy of the code base and basically go our own way from there. So, as part of the global product team our goal was to develop features for the global markets. And we had the product roadmap ready for this. And why did we have to do this big decision? Mainly, it's because of the difference in academic calendars. So, what makes it complicated when you're in the ad tech business, especially if it's used by schools, is that you cannot release very, very big changes on the platforms while the school year is ongoing. So, because of the different academic calendars in different countries, that's make it a lot complicated and because of that that affects our business sustainability. And that's why one of the reasons we had to do this. But then on March, 2020. And I think this affects everyone of us in the world now. So, the COVID situation were said, and many countries had to go on lock down, including Philippines and Indonesia, where we have our operations. So, we had to switch to our work from home set up. And everything had to be done online. This was not the big change for the product team as we work from home once in a while, as well, but for the other departments and our clients, it's a very big change. And in effect, our users also had to adapt to the situation. So, if before our services were most mostly used for as a supplemental service. So, things like assignment to give the students for them to answer after class now they use our services for synchronous education. And if before the teachers only send corporate contents, their students, now they also create their own. For example, the exams. For the departmental exams, they have to actually align the contents of the exam based on what they teach their students. So, that's why they have to customize their content. So, they have the grade, their own content on our platforms. So, to illustrate this change in behavior, let's see that usage pattern in 2019 and 2020. So, for the teacher you were saying there's really not much changed. There was a spike, a bit of spikes here and there but and here in 2020, there's a drop here which I assume is their lunchtime, but basically that trend is similar. However, if you check, if you see the students’ pattern, it's a really big difference and students were more active in 2019 in the evening. So, that supports the behavior that they only were answering their assignments and Quipper after their classes and come 2020, this usage pattern of the students closely mirrors the usage pattern of the teachers. So, they are using it at the same time. So, synchronously. We also observed the change in the number of classes. The top one is the graph for teachers and the bottom one is for the students. So, we had our lock down around the middle of March. And as you can see, it's like overnight me, we already have a drastic increase in our usage. So, it dwindled down around May, but the Philippines school calendar ends type here around April or May. So, this change was expected. So, for the students, it was the same pattern, but it looks very small compared to the teachers but that's because at this time only the existing clients were using. And finishing their contract for the year, but around August and September new clients came in and they started using more of our services. So, these changes brought about a lot of challenges and it has effects on our product. So, the first is that there was a higher demand for features, the facility online education in the US at the start of the year after the code split, we had a product roadmap ready, but because of this change, we had to revisit that roadmap and reprioritize and check which ones should be prioritized to match with the demand of the teachers and students. And the second is that there is a higher traffic as explained before, and it's not just because of the increase in the number of users, but they are spending a lot more time on our platforms. So, for example, if teachers were just a sending assignments with contents prepared by Quipper, now they also had to author their own contents there. So, they had in turn, they had to use our systems for a longer time. And the third one is that more bugs were discovered with increased usage because they weren't covering a lot more use cases. And frankly, some of this are things that we didn't design for, we had to fix those because they were adapting and we had to adapt to how they use our systems too. And since the usage is synchronous, we had also to fix and address these issues with a higher urgency it's because if they experience an issue while they are doing class, it already just disrupts the entire flow of the class. And at the same time around August when the new school year started and some of the private schools started in August, but the public schools which have a much higher population were scheduled to start in October. However, in September, we already had incidents in the week. It was hard to use our platforms. So, the challenge for us was to address this issue. From September until October, we had like less than a month to make the system stable. So, some of the things that were experienced during those incidents, some of the main symptoms were slow access on some pages, which eventually led to errors. The second, this lead delivery of activities to students. So, for the second one, imagine the situation where you have a class from 9:00 AM to 11:00 AM and the teacher sending a quiz to students. So, when the teacher sends the quiz, the students, they are expecting to receive it right away. But what happened was it was taking a long time. So, there were some cases in which they were not able to receive it within the schedule class. So, it was a really big disruption in their class though. So, what we found that caused this, first there was slow performance on the DB so, we had to optimize this further. There was no CDN or content delivery network set up on some platforms so we had to set that up. Next is that auto scaling set up didn't work because it was based on the assumptions that were not true anymore. So, we improved the scaling based on the observed patterns, based on the data and considering the peak times. So, some more details on our response, this was our reaction. Everyone was really panicking because we were getting a lot of feedback from the users. We felt we were not ready for this and we didn't expect to grow as fast. If your business grows too fast, the problems go too fast as well. So, we had to really do something about this. So, on the platform side, we had these efforts that we call platform optimization efforts. The reasons why we had to do this is that we realized that by focusing on developing features, we had less attention on scalability and because we focused on that we also accumulate that technical debt and now it started to be that bad. And lastly, then new features are useless, no matter how flashy it this, but if users cannot use our core features, it's going to be useless. So, which led to our decision to put the freeze on developing new features and focus on improving our systems for the mean time. So, I was chosen to lead the efforts and we broke it down to three main focus points. As an engineering manager, and I said that my main focus is on the technical part, but we also have to maintain communications and the operations. So, for the technical part of the efforts, our purpose was to prevent these accidents from happening again. So, we had to do a lot of things like clean up code, remove unnecessary requests because we observed our web clients and some mobile clients were sending unnecessary requests to the API which just adds a load to our systems, but eventually the client apps don't use that data. So, we removed those. A third is we improved performance by various techniques like caching and plus one fixes, algorithm improvements and others. And next is we studied our data and we set up auto-scaling and scheduled scaling based on those. And we also ensured that the apps had enough resources so that we don't crash. And lastly, we had to improve the database for higher capacity. So, everything we had to do within a month and to be honest, it wasn't finished and it's still an ongoing effort, but we had to do what we had to do with the time given to us. The second is the communication. So, the goal for the communication part is to keep everyone on the same page. So, first before we proceed when the technical things that we had to do, we needed to agree on the timeline with everyone, from the product, the business, the support, and at the same time, we also communicate it to the users about the things that we are doing so that they are assured that we are doing things to improve their experience. And third is we set up our weekly check up with the platform teams so that we can maintain cross-platform communication while we are changing a lot of code. Because what has been happening before is that the platform teams were doing their own thing. And some, most of the time communication is forgotten. And the fourth one is that we compiled common issues encountered so that the support officers can respond to users faster. And also, this issue stakes a lot of time to investigate some of the issues we had to investigate for an entire day, or sometimes get even to multiple days. And we had the focus on improving our system. So, this is like time better spent on that. We did this also so that the devs can focus on enhancing our systems. Lastly, we created an incident management flow so that everyone is up to date on the status of incidents that may happen. So, basically, we had like a single source of truth when there's an incident happening. And so that it's not chaotic where people are communicating on different channels. And for the operations, our goal was to recover from the incidents as quickly as possible. So, the main thing that we had to do for this is to have developers on standby on a rotation basis from 6:30 AM, because that's when the teachers start to get active. So, this was a big sacrifice for the devs, especially because we usually start our work at 9:00 AM or 10:00 AM in Philippine time. So, starting at 6:30 is are really, really early for most of us. So, what are the results of this. First is that there were much less crashes on our applications. The red bars here are the error occurrences. And when we had incidents in September, as you can see, it was crashing a lot, but before October came, we were able to stabilise the system. And when the big schools came in in October, it was already stable. Okay. Second is we improved the overall response time. This is a result of all the things that we did to optimize the system. And it's been on a downtrend since we started it and we are pushing to maintain this trend. So, it's a continuous improvement. And the next ones for me that there were no major incidents says despite increasing usage. So, we were actually crossing our fingers when October came that we hope this thing doesn't happen and good thing it didn't. Not to the extent that we had in September. So, our scaling system worked perfectly with the new configurations because it was based on the data of the current situation. And the third is the cross-platform communications. So, the platform teams feeling like that, we have the weekly updates. Because aside from giving other teams a heads up of what they are doing, they also learn from each other what the other teams are doing. And they also adapt some processes to their own teams. And fourth is that we raised awareness on how developments in different departments affect each other. It's not just the product team in educating the business and support on how we do things but we also learn from them how they respond to crisis. So, it was really a big win for us, and we manage the traffic at peak times, but the war is not yet over. So, it's an ongoing improvement. And actually, in October, there was a small thing that was still happening. So, there were degradations at some points of the day at one time in October and we found that we hit our limit of comparing usage. So many students and teachers were using at that exact time. And we found that our system's capacity he was hit. So, we have to do something to increase that capacity and that was already increased, but we are still ongoing and further improving that, because if we want to grow our business, we have to allow more people to access our systems at the same time. And other things that we need to do is to establish workflows and processes. So, some of this we had to like think on our feet. To solve the problems, but moving forward, it's not enough. We have to establish workflows and processes for that and the second is to improve our incident response process. The first time we used the incident management flow that we developed was a bit chaotic and people were still confused. So, we found some points of improvement on that, and we hope that the next time we use this, although it's not good. If we get the use it, I hope it's more organized. And the next would be to switch from being reactive to proactive. We survived the crisis because of the efforts that we made, but it was a very reactive response because we waited that for the things to happen before thinking of ways to solve it. In the future, we would like to be more proactive in a sense that we will anticipate things that can go wrong and already have some plans in place to resolve them when they happen. And fourth is that to have changes on an organizational level from the lessons learned. And as I said earlier, we learned from different teams how we handle crisis. Because of that, it opened the communication line, not just within the product team, but with other departments. And so, we have to make some improvements in our systems, from the things that we've learned from each other. To summarize everything, here are some big takeaways from the experience. So, first is that data is your friend, always analyse the data to make the sound decisions. And especially, when in an unfamiliar territory like these things that we experienced, no one had really the exact experience in that. It's like we were blind, but checking the data and analyzing it would make us see what's really happening and that helped us prioritize and decide on things more confidently. And second is to not procrastinate in technical issues. They will come back to bite you when you are less prepared. So, don't wait for them to snowball. We’re a young company, so growth is very important and many startups also focus on growth, but we shouldn't forget this because later on, when, when we get the big users that we want
it's going to bite us back. So, yeah, don't procrastinate and at least allocate some resources to, to pay back the technical debt. And third is we can prepare all we want. We can have roadmaps, we can have different workflows processes and plans, but there will always be situations which we are not prepared for and we'll be caught off guard. So, we must learn to think on our feet and learn to adapt basically. And fourth is be decisive for the users. So, in Quiper one of our values is user first. So, we think what is good for the users. And well, in the company we like the order, but sometimes cascading decisions and discussions through that process can be costly. And in our situation, we had less than a month to solve the problems. If we decide and decide in one week, two weeks that's time that we could already use to solve a lot of issues. So, we had to be decisive on this and we just made the call and owned whatever the result of that is. So, yeah, this is not always applicable but, in our experience, this is one of the situations in which you need to decide quickly. I hope you have learned a lot from the experiences that I've shared, and this is not the only challenge that we face and as I said earlier, we are continuously learning and continuously improving. So, if you want to improve and grow with us be a distributor of wisdom and join us. So, we are looking for some members of our product team in Japan, Indonesia and the Philippines. And if you are interested, please visit our career page at career.quipper.com. Thank you everyone. And I hope you've learned a lot.