It starts with a simple question – if you are building a website, how do you count the number of visitors for the past 1 minute?
“Design hit counter” problem has recently been asked by many companies including Dropbox and the question is harder than it seems to be. This week, we’ll uncover all the mysteries of the problem. A couple of topics are discussed including basic data structures design, various optimization, concurrency and distributed counter.
What’s special about this problem?
I always like to tell our readers why we select this question to analyze so that you’ll know exactly whether it’s worth your time to read. As an interviewer, I have a strong preference for questions that are not hard to solve in the simplest case but the discussion can go deeper and deeper by removing/adding specific conditions. And this question is exactly the case.
Also, the question doesn’t come from nowhere, but has real use cases. For many systems today, we need a system to track not only users numbers, but different types of request numbers in real time.
If you haven’t thought about this problem, spend some time working on it before reading following sections.
Simple case
Forget about all the hard problems like concurrency and scalability issue, let’s say we only have a single machine with no concurrent requests, how would you get the number of visitors for the past 1 minute?
Apparently, the simplest solution is to store all the visitors with the timestamps in the database. When someone asks for visitor number of the past minute, we just go over the database and do the filtering and counting. A little bit optimization is to order users by timestamp so that we won’t scan the whole table.
The solution is not efficient as the time complexity is O(N) where N is the number of visitors. If the website has a large volume, the function won’t be able to return the number immediately.
Let’s optimize
A couple of ways to think about this problem. Since the above approach returns not only visitor numbers, but also visitors for the past minute, which is not needed in the question. And this is something we can optimize. From a different angle, we only need numbers for the past minute instead of any time range, which is another area that we can improve potentially. In a nutshell, by removing unnecessary features, we can optimize our solution.
A straightforward idea is to only keep users from the past minute and as time passes by, we keep updating the list and its length. This allows us to get the number instantly. In essence, we reduce the cost of fetching the numbers, but have to keep updating the list.
We can use a queue or linked list to store only users from the past minute. We keep all the element in order and when the last user (the earliest user) has the time more than a minute, just remove it from the list and update the length.
Space optimization
There’s little room to improve the speed as we can return the visitor number in O(1) time. However, storing all the users from the past minute can be costly in terms of space. A simple optimization is to only keep the user timestamp in the list rather than the user object, which can save a lot of space especially when the user object is large.
If we want to further reduce the space usage, what approach would you take?
A good way to think about this is that to improve space complexity, what should we sacrifice? Since we still want to keep the time complexity O(1), one thing we can compromise is accuracy. If we can’t guarantee to return the most accurate number, can we use less space?
Instead of tracking users from the past minute, we can only track users from the past second. By doing this, we know exactly how many visitors are from the last second. To get visitor numbers for the past minute, we keep a queue/linked list of 60 spots representing the past 60 seconds. Each spot stores the visitor number of that second. So every second, we remove the last (the earliest) spot from the list and add a new one with the visitor number of past second. Visitor number of the past minute is the sum of the 60 spots.
The minute count can be off by the request of the past second. And you can control the trade-off between accuracy and space by adjusting the unit, e.g. you can store users from past 2 seconds and have 30 spots in the list.
How about concurrent requests?
In production systems, concurrency is the most common problems people face. If there can be multiple users visiting the site simultaneously, does the previous approach still work?
Part of. Apparently, the basic idea still holds. However, when two requests update the list simultaneously, there can be . It’s possible that the request that updated the list first may not be included eventually.
The most common solution is to use a lock to protect the list. Whenever someone wants to update the list (by either adding new elements or removing the tail), a lock will be placed on the container. After the operation finishes, the list will be unlocked.
This works pretty well when you don’t have a large volume of requests or performance is not a concern. Placing a lock can be costly at some times and when there are too many concurrent requests, the lock may potentially block the system and becomes the performance bottleneck.
Distribute the counter
When a single machine gets too many traffic and performance becomes an issue, it’s the perfect time to think of distributed solution. significantly reduces the burden of a single machine by scaling the system to multiple nodes, but at the same time adding complexity.
Let’s say we distribute visit requests to multiple machines equally. I’d like to emphasize the importance of equal distribution first. If particular machines get much more traffic than the rest machines, the system doesn’t get to its full usage and it’s very important to take this into consideration when designing the system. In our case, we can get a hash of users email and distribute by the hash (it’s not a good idea to use email directly as some letter may appear much more frequent than the others).
To count the number, each machine works independently to count its own users from the past minute. When we request the global number, we just need to add all counters together.
Summary
One of the reasons I like this question is that the simplest solution can be a coding question and to solve concurrency and scalability issue, it becomes a system design question. Also, the question itself has a wide usage in production systems.
Again, the solution itself is not the most important thing in the post. What we’re focused on is to illustrate how to analyze the problem. for instance, trade-off is a great concept to be familiar with and when we try to optimize one area, think about what else should be sacrificed. By thinking like this, it opens up a lot of doors for you.
Another approach:
Scaling tends to make even simple things, like counting, seem difficult. In the past, businesses used specialized databases for particular tasks, including high-speed, high-throughput event counters. Due to the constraints of legacy systems, some people still assume that relational databases cannot handle high-throughput tasks at scale. However, due to advances like in-memory storage, high-throughput counting no longer requires a specialized, single-purpose database.
Why do we even need counters?
Before we get into the implementation, you might be asking why we need counters at all. Why not just collect event logs and compute counts as needed?
In short, querying a counter is much faster than counting log records, and many applications require instant access to this kind of data. Counting logs requires a large table scan and aggregation to produce a count. If you have an updatable counter, it is a single record lookup. The challenge with high-throughput counters is that building a stateful, fault tolerant distributed system can be challenging. Fortunately, MemSQL solves those hard problems for you, so you can focus on building your application.
In the rest of this article we’ll design a simple robust counter database running on a modest MemSQL cluster, and benchmark how it performs.
Counters are records
Let’s start by creating the following schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | create database test; use test; create table counters_60 ( time_bucket int unsigned not null, event_type int unsigned not null, counter int unsigned not null, primary key (time_bucket, event_type) ); create table event_types ( event_type int unsigned not null primary key, event_name varchar(128), owner varchar(64), status enum ('active', 'inactive') ); |
The column time_bucket is the timestamp on the event rounded to the nearest minute. Making the time_bucket and event_type the primary key allows us to easily index events by time and type.
1 2 | insert into counters_60 select unix_timestamp() / 60, 1234, 1 on duplicate key update counter = counter + 1; |
If a primary key value does not exist, this query will insert a new record into MemSQL. If the primary key value exists, the counter will be incremented. This is informally called an “upsert.” The management of event_types is outside the scope of this article, but it’s trivial (and fast) to join the counter table to a table containing event metadata such as its human-friendly name.
Let’s also insert some data into the event_types table:
1 | insert into event_types values (1234, 'party', 'memsql', 'active'); |
Querying Counters
Now you have the counts of each event type bucketed by minute. This counter data can easily be aggregated and summarized with simple SQL queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | -- all-time historical counts of various event types select e.event_type, e.event_name, sum(c.counter) from counters_60 c, event_types e where c.event_type=e.event_type and e.event_type in (1234, 4567, 7890) group by 1, 2; -- total number of events in the last hour select sum(counter), sum(counter)/60 as 'avg per min' from counters_60 where event_type = 1234 and time_bucket >= unix_timestamp() / 60 - 60; -- total number of events in time series, bucketed in 10-minute intervals select floor((unix_timestamp()/60 - time_bucket)/10) as interval, sum(counter) from counters_60 where event_type = 1234 and time_bucket >= unix_timestamp() / 60 - 60 group by 1; |
1.6 Million increments per second
Inserting naively into the counters table, one record at a time, actually gets you pretty far. In our testing this resulted in a throughput of 200,000 increments per second. It’s nice to get impressive performance by default. Then we tried to see how much farther we could go.
In this simulation we processed 1,000 different event types. We created a threaded python script to push as many increments a second as possible. We made three changes to the naive version: multi-insert batches, disabling cluster-wide transactions, and sorting the records in each batch to avoid deadlocking.
1 2 3 4 5 6 | insert into counters_60 values (23768675, 1234, 1), (23768675, 4567, 1), (23768675, 7890, 1), ... on duplicate key update counter = counter + 1; |
We used a 6 node AWS cluster with 2 aggregators and 4 leaves to simulate the workload. Each node was m3.2xlarge consisting of 8 cores and 15GB of RAM, with an hourly cost of $2.61 for the entire cluster. When starting this script on both aggregator nodes, we achieved a throughput of 1.6M upserts a second.
Data Collection
In this simulation we use a Python script to simulate the data ingest. In the real world, we see our customers use technologies like Storm, Kafka and to collect events in a distributed system for higher throughput. For more information on MemSQL integration with stream processing engines, see on how Pinterest uses MemSQL and Spark streaming to track real-time event data.