Public Notes on
View Public Collections

Highlights

A straightforward way to get our availability number is to add a bunch of these machines/nodes into a cluster.
The two most common scaling strategies are vertical or horizontal scaling.
For many systems, the four nines availability (99.99%, or about 50 minutes downtime per year) is considered high availability.
Systems usually have a lot of noisy requests, hence the p95 and p99 latencies are more practical usage in the real world.
To make sure we build the right thing, a system that is "better" than it's predecessor, we used SLAs to define expectations.
Horizontal scaling is about adding more machines (or nodes) to the system, to increase capacity. Horizontal scaling is the most popular way to scale distributed systems, especially, as adding (virtual) machines to a cluster is often as easy as a click of a button.
Vertical scaling is basically "buying a bigger/stronger machine" - either a (virtual) machine with more cores, more processing, more memory. With distributed systems, vertically scaling is usually less popular as it can be more costly than scaling horizontally.
Before diving into planning a system, I have found the most important thing to decide what a system that is "healthy" means. "Healthy" should be something that is actually measurable. The common way to measure "healthy" is with SLAs: service level agreements.