Ever since Server Check.in started using multiple, geographically-distributed servers running a small Node.js app to perform server checks, we've been monitoring the number of checks per server on an hourly basis, and calculating the standard deviation based on the numbers for each server.
We noticed an alarming trend; some servers were checking more than 40% more servers than others! The main queue server that controls when servers get checked uses timestamps to control when the servers are checked, so we thought to introduce a little jitter by adding +/- 10 seconds to the timestamps every time they were checked. Unfortunately, this did nothing at all to spread the checks among the different servers.
Then we noticed something peculiar: the main queue server is in New York, and the server with the greatest consistent number of checks was the server geographically closest, in Atlanta. Then came the server in Dallas, then one in Seattle, etc. until we reached the stragglers, all located in Europe.
We finally realized the problem: all our servers are synchronized via NTP to within a few ms of each other, and our custom Node.js app queries the master queue server every minute, at the beginning of the minute, for a list of servers to be checked. Because some servers were geographically closer than others, their requests almost always arrived a few ms sooner than others. Because of this, the closer the server was to the master, the more likely it would get a full chunk of servers to check every minute.
Servers that were further away and had slower ping times (~70 ms vs. ~20 ms for the closest server) were more likely to be cleaning up the tail end of the list of servers to be checked in a given minute.
Solution: Add jitter to cron jobs
The solution for us was to add jitter to the cron/periodic jobs on the distributed servers. Instead of all the servers running the same command at the exact same time (every minute), the servers run the job with a little jitter—variation in the precise time the job runs.
There are a few ways to accomplish jitter, and the easiest is to run cron itself with the
-j [0-60] option, which uses cron's built-in jitter... however, this (a) applies to all cron jobs, not just the one you want to have jitter, and (b) only works with vixie-cron and it's derivatives (so, FreeBSD, CentOS 5.x, etc., but not most modern linux distributions).
The solution we're using involves calling an intermediary shell script from cron (instead of the original command directly), which adds its own jitter. Here's an example of the script:
# Run shell-script.sh after a few seconds of delay (jitter).
# @see http://stackoverflow.com/a/16873829/100134
# Add jitter by sleeping for a random amount of time (between 0-15 seconds).
WAIT=$(( RANDOM %= 15 ))
# Run the original command/script.
And instead of invoking the original command/script from crontab (
crontab -e), we call the shell script instead:
# Contents of crontab.
* * * * * /path/to/jitter-script.sh
We've been running with this setup for a few days now, and the standard deviation is down to within about 5%, which is fine by us. Our server check load is now spread among all our servers evenly, and our capacity and data reliability is improved as well.
It's enough to make us want to dance a little jitterbug :)
Besides server checking, there are many other situations were adding a little jitter can help—when sending backups or grabbing data to or from a particular server, jitter can save that server from getting slammed! Maybe it's time to introduce a little uncertainty into your periodic tasks.