Lessons learned growing API traffic by 1,900%
November 22, 2023
Our API traffic at Databento grew a staggering 1,900+% between June and August 2023! We managed this exponential growth solely through a self-hosted infrastructure, without relying on Kubernetes clusters or sophisticated cloud-based autoscaling.
- Keep it simple. In the words of Dan McKinley, "Choose boring technology." We have a simple Docker Swarm cluster and MySQL database doing a lot of heavy lifting. It's easier to invest developer time to scale and optimize your stack when it's simple tech that everyone's familiar with.
- Take control of your traffic by open sourcing and maintaining high-quality, well-documented official client libraries. Some companies allocate substantial resources and have over a hundred engineers working on their public API. Paradoxically, most end up with a limited number of unofficial client libraries on GitHub that are often poorly implemented. Having your own client libraries makes it easier to ensure your customers are handling compression, pagination, rate limits, polling, etc., efficiently.
- Going from clustering and redundancy to high availability takes a monumental effort. Most of our downtime during this quarter, sadly, came from systems that we designed to have redundancy, e.g. a Galera cluster.
- Achieving the same level of uptime in a clustered system typically requires about ten times more effort compared to a monolithic system.
- It takes ten times more marginal effort for a clustered system to achieve the uptime of a monolithic system. Anyone who's spent a long time in DevOps or systems will tell you stories of a single ZFS filer, monolithic web server, or bare metal host staying up for years. We've seen many more cases of catastrophic failure of a poorly configured Gluster cluster, HA Traefik ingress—you name it.
- Strobe your own APIs and report uptime transparently. We're the single biggest user of our own APIs via all of our alerting and monitoring tools.
- Optimizing can be easier than scaling. We have several optimizations that make our APIs blazing fast and efficient: zero-copy transfer through sendfile(2), zero-copy messaging, kernel bypass, and use of application-level zstd instead of gzip. Some of these optimizations were achieved with just 4 lines of code, a few $300 NICs, and a few lines of configuration, which is much cheaper than setting up your own in-house Kubernetes cluster and scaling horizontally.