Yesterday we had an unscheduled outage of Customer.io that lasted for 11 hours.
For those 11 hours, we failed you, our customers. I’m sorry for the downtime and the delays in sending out your emails.
The service is back up and fully functional. Your emails are sending again. There’s a post-mortem with technical details.
Beyond the technical description, I wanted to offer you some insight in to our business.
The requirements of an infrastructure business
We made a decision to go in to a business that requires us to be always on. When you implement Customer.io, we become part of your infrastructure.
We never want the service to be down.
Downtime is a poison pill for our business and we spend a lot of time working to avoid it.
John, our CTO had been working for a couple of months on a solution to the underlying issue responsible for the outage. He was a few days away from rolling it out to production for our account. Imagine getting punched in the stomach a few yards from the finish line after running a marathon. The timing was unfortunate.
We can’t promise that we’ll never go down, but we’re making tremendous progress to make sure that doesn’t happen.
If it does, my commitment to you is that we’ll be transparent and open with you about what happened and what we’re doing to fix it.
In the event of an issue, we post on our status page in the event of an issue. If you aren’t yet subscribed, I’d encourage you to subscribe there for updates.
How we’re scaling the back end
As we’ve grown the business, keeping up with your data has been a challenge.
We’ve scaled from 0 to 50 million emails a month, 150 million messages processed per day, and many terabytes of analytics data.
We’ve invented queueing infrastructure to make sure other people on the system don’t block your work from getting done.
We’re in the middle of our hiring process for a senior distributed systems engineer who will join John to work on problems like the one that caused the outage. We have been blown away by how a global hiring process can give you access to some extremely talented people. We plan to make this hire within the next 2-4 weeks.
Product improvements are around the corner
We’re making dramatic improvements to the experience using the Customer.io interface with a ground-up rebuild. Candidly, this has taken too long, but we’re nearing completion.
Michael and Henry are working hard to get the new release out. We’re targeting a release that you can use alongside the current product at the end of the month.
At the beginning of the week I switched to the new app to help iron out the kinks.
Looking to the future
John and I have always thought about Customer.io as a business we want to build and grow over 10 years. I think of yesterday’s outage as growing pains - a small blip as we build a stable, long-lasting company to serve you.
Over the next few weeks, we’ll be sharing more with you about our progress reducing the chance of an outage.
In the meantime, if there’s anything else we can do, feel free to email me - email@example.com. I’m happy to answer any concerns or questions you have about the outage or your service.