Blog

Helping you keep your server online and your website fast.

1 minute checks for all accounts

After over a year's worth of testing and expansion, Server Check.in is able to handle 1-minute checks for everyone, regardless of the plan you're using.

Our check infrastructure has not been the bottleneck for some time (since we've expanded to, now, 6 servers throughout the world)—it was mostly the database server not being able to gracefully handle the load of (potentially) 10,000+ checks per minute. The database structure has been improved, and frequent writes have been optimized, and we hope Server Check.in continues to be one of the most inexpensive, simple and robust services for monitoring your servers and websites!

Server Check.in, one year later

It's been a year since my original Show HN post announcing Server Check.in, so I thought I'd post a reflection after my first year running the service.

Server Check.in is an inexpensive website uptime monitoring service that I started out of my own need—if I had no users, I'd still keep it going. Fortunately, there are enough paid customers to keep me interested, and I've learned a lot from their feedback.

I had a good initial batch of signups after the announcement post on HN, and some other posts on Reddit, Low End Talk, and other forums. These early customers gave me a lot of feedback, and some suggested fairly major revisions to the service. It took a lot of discipline early on to make sure I only worked on the most valuable features first.

I decided early on to focus on stability, scalability, and performance before working on some of the requested features. This has paid off many times, as I have not yet had to revisit Server Check.in's architecture and basic functionality since a couple months in. The distributed Node.js-based server checking is working very well, and the master app server running Drupal with PHP/MySQL (which runs this blog) still has plenty of room for growth.

Below are some of the major lessons I've learned in the past year; I hope you can learn something from my experience.

1. Your priorities are not your customer's priorities.

This is a continual struggle. I have some pretty neat features I'd like to work on to make the site look and function better—but most of the features my customers (both existing and potential) need are features that are more invisible or are a little less fun to implement. I find that if I develop with in a tick-tock cycle, where I develop a feature I want, then one a customer wants, I keep the development interesting, and keep my customers happy.

2. Sales is an uphill battle (for a developer).

I understand why sales people seem to be pushy—it takes a lot to convince some people that a product would benefit them, even when they already know it. Especially with for-pay services, you need to make a hard sell to get most of your customers. (This assumes you have a good product in the first place, of course!).

I'm a developer. I like creating cool things, and I don't like 'wasting' my time selling these things. The reality is, though, that development is meaningless without sales, and time spent getting new customers is never wasted.

3. Contributing back to OSS improves you and your code.

Not only does contributing back (patches, documentation, testing, modules, etc.) a Good Thing™ in general, it also helps you build rock-solid components. Instead of spending an hour on a bit of code and coming up with an inflexible solution, you'll end up with a flexible, stable, and much more useful tool or solution if you decide you want to contribute it back to a community.

I don't do sloppy work in any of my projects, but I am especially thorough when I write code for an OSS community. Through my work on Server Check.in, I've been able to supply a Node.js module, some Drupal contrib module patches, and some blog posts on the process of using different OSS platforms together to build Server Check.in.

Finally, if I hadn't been following best practices for the different coding communities in which I participate, I would not have as secure of code, almost complete test coverage, or a highly organized infrastructure.

4. Keeping a good work-life balance prevents burnout.

In the past, some of my side projects became burdens due to the time I devoted to them, and my excitement for them fizzled after a few months. Most weren't profitable anyways, but I learned this: if you spread yourself too thin, you will become a lot less effective at the things you do. You only have enough bandwidth for a certain number of things.

I develop Server Check.in and other projects in my spare time, and I keep my priorities straight: family first, full-time job second, side projects third. I still ensure my side projects are reliable (Server Check.in has had more than 99.9% uptime this year, as measured by Pingdom); but I split my time in a way that keeps me sane. This helps everyone to whom I have an obligation to serve—wife, kids, co-workers, and customers.

5. Experimenting makes you a better developer.

Without Server Check.in's HTTP request scaling issues, I probably wouldn't have spent much time learning Node.js or learning how to code for asynchronous functionality. I went from being able to check a few hundred servers per minute to being able to check thousands (and more, by simply adding more Node.js check servers).

I also learned (and have adopted for other infrastructure needs) Ansible after realizing my hacked-together shell script deployment strategy wouldn't scale past a few servers. Now instead of taking an hour or two to get a new server spun up, I can have one in a couple minutes.

Finally, I've worked quite a bit on Stripe and Twilio integration, and now know parts of their APIs pretty thoroughly. This has already helped me in my current full-time job, and will continue to pay off—both literally and figuratively!

You aren't allowed to take risks or go off in crazy new directions in most jobs (especially in the 'enterprise' or corporate arena). Flex your development muscles and increase your enjoyment of developing with a side project or two. Who knows? Some of the things you learn might become extremely valuable for your next job—or help you in your current one.

6. It's harder than you think to build a simple tool like Server Check.in.

If you're building a one-off utility to check whether a server is up or down, for a few servers, and you're the only one that will view this utility, it may be simple. But add in historical tracking, latency monitoring, SMS and email notifications, thousands of concurrent requests, a user and billing system, and other bits and pieces, and the project is no longer as simple as it seems.

Many developers (myself included) think to themselves, "this sounds really simple—I'll just build my own". Almost always, it takes more time than you'd think to get everything running. What is your hourly rate? If you're working on a project for 12 hours, at $70/hour, was it really worth the almost $1,000 to have a product that isn't as good as one you could've paid for at a fraction of the cost?

Sometimes it's worthwhile because of point 5 above—but other times, what's the point? Don't waste time on side projects that won't hold your interest. Often paying for something or using an open source application that fits most of your requirements is better than spending a lot of time on a bespoke application.

Stay Hungry

These are just a few of the lessons I've learned. I could probably think of many more, but I need some material for next year's retrospective! I hope something you read here can help you when you're deciding what to do next, or how to improve your current projects. If you have any more questions, please let me know on Hacker News or in the comments below.

A little jitter can help (evening out distributed cron-based tasks)

Ever since Server Check.in started using multiple, geographically-distributed servers running a small Node.js app to perform server checks, we've been monitoring the number of checks per server on an hourly basis, and calculating the standard deviation based on the numbers for each server.

We noticed an alarming trend; some servers were checking more than 40% more servers than others! The main queue server that controls when servers get checked uses timestamps to control when the servers are checked, so we thought to introduce a little jitter by adding +/- 10 seconds to the timestamps every time they were checked. Unfortunately, this did nothing at all to spread the checks among the different servers.

Then we noticed something peculiar: the main queue server is in New York, and the server with the greatest consistent number of checks was the server geographically closest, in Atlanta. Then came the server in Dallas, then one in Seattle, etc. until we reached the stragglers, all located in Europe.

We finally realized the problem: all our servers are synchronized via NTP to within a few ms of each other, and our custom Node.js app queries the master queue server every minute, at the beginning of the minute, for a list of servers to be checked. Because some servers were geographically closer than others, their requests almost always arrived a few ms sooner than others. Because of this, the closer the server was to the master, the more likely it would get a full chunk of servers to check every minute.

Servers that were further away and had slower ping times (~70 ms vs. ~20 ms for the closest server) were more likely to be cleaning up the tail end of the list of servers to be checked in a given minute.

Solution: Add jitter to cron jobs

The solution for us was to add jitter to the cron/periodic jobs on the distributed servers. Instead of all the servers running the same command at the exact same time (every minute), the servers run the job with a little jitter—variation in the precise time the job runs.

There are a few ways to accomplish jitter, and the easiest is to run cron itself with the -j [0-60] option, which uses cron's built-in jitter... however, this (a) applies to all cron jobs, not just the one you want to have jitter, and (b) only works with vixie-cron and it's derivatives (so, FreeBSD, CentOS 5.x, etc., but not most modern linux distributions).

The solution we're using involves calling an intermediary shell script from cron (instead of the original command directly), which adds its own jitter. Here's an example of the script:

#!/bin/bash
#
# Run shell-script.sh after a few seconds of delay (jitter).
# @see http://stackoverflow.com/a/16873829/100134
#

# Add jitter by sleeping for a random amount of time (between 0-15 seconds).
WAIT=$(( RANDOM %= 15 ))
sleep $WAIT

# Run the original command/script.
/bin/bash /path/to/shell-script.sh

And instead of invoking the original command/script from crontab (crontab -e), we call the shell script instead:

# Contents of crontab.
* * * * * /path/to/jitter-script.sh

We've been running with this setup for a few days now, and the standard deviation is down to within about 5%, which is fine by us. Our server check load is now spread among all our servers evenly, and our capacity and data reliability is improved as well.

It's enough to make us want to dance a little jitterbug :)

Besides server checking, there are many other situations were adding a little jitter can help—when sending backups or grabbing data to or from a particular server, jitter can save that server from getting slammed! Maybe it's time to introduce a little uncertainty into your periodic tasks.

New Server Check.in Features - October, 2013

Clock - Check Interval        Earth - Global Checks        Radar - Ping

We've been relatively quiet for the summer months, working hard to improve our infrastructure and many behind-the-scenes aspects of Server Check.in. But there are a few features we're very excited to announce today:

  • 1-minute check intervals for Premium plan, 5-minutes for Standard
    This is the most often requested feature we've had in the past year, and now that our architecture has been improved to the point where more frequent checks won't affect the quality of Server Check.in's core service—notifying you when your servers are up or down—we have added it! Stay tuned, though; there's more to come...
  • More check servers (U.S. and Europe) added
    As mentioned in our blog post about reducing technical debt, we are now checking your servers from four geographically-distributed servers, and will be adding servers as time and budget allows. Please see the Check Servers page for a listing of the servers, their IP addresses, and their geographical locations (login required).
  • Infrastructure improvements (better uptime, even fewer false positives)
    Through the summer, instead of focusing on features, we've been hard at work identifying false positives (there have been very few, but we strive for perfection!), finding bottlenecks, and improving the site's UI and response times.

Server Check.in has continued to improve in reliability, speed, and features since day one; we're proud to report over 99.9% uptime since launch (as reported by Pingdom)! Please continue to post comments here and contact us to let us know what we can do to make Server Check.in better for you, personally!

And, if you're reading this and aren't yet a customer, tell us what you'd like to see from Server Check.in to entice you to sign up!

Improving architecture and adding features by containing technical debt

Server Check.in uses a small Node.js application to perform checks on servers and websites (to determine whether and how quickly they're responding to pings and http requests), while the main website and backend is built on Drupal and PHP.

One of my main goals in building Server Check.in with a modular architecture (besides being able to optimize each part using the best technologies available) was to ensure that the infrastructure and backend could easily scale as the service grows.

I'm happy to announce that Server Check.in now has multiple servers performing status checks (there are currently three check servers in the USA; more will be added soon!), and I'd like to explain how the Server Check.in architecture works and why it's built this way.

Priorities, priorities

Some people have asked why Server Check.in only used one server to do status checks initially—especially in light of the fact that most other uptime monitoring services tout tens or hundreds of probe servers as a headline feature. There are two reasons I chose to wait to expand to multiple check servers until recently:

Simplicity and stability trumps features.

Now, this is not true for every situation, everywhere, but I've found through the years that people prefer a dependable, reliable, and simple solution (as long as it meets their basic needs) over a flaky solution, no matter how many features it has.

Server Check.in's core product is the sending of a notification when a server goes down, and comes back up. If Server Check.in fails to notify you when your server is down, it failed. And if Server Check.in tells you your server is down when it's not, it failed even worse, because you start ignoring notifications.

Therefore, for the first few months of Server Check.in's existence, most of the development time was spent refining the accuracy and speed of our server checks, and making sure every notification was delivered successfully. As a direct result of these efforts, check frequencies have been increased and a new premium plan was added with even more value for the price. Server Check.in has also been up more than 99% of the time most months since launch!

Get the architecture right, or die slowly.

A major factor in choosing what new features go into Server Check.in is the limited development time available. To be honest, since Server Check.in is one of Midwestern Mac's side projects (a selfish one at that—I just want to be able to easily and inexpensively monitor my own client's servers!), there aren't a ton of resources when it comes to developing new features or making large architectural changes.

Therefore, before I embarked on a specific architecture for distributing server checking across many servers, I wanted to make sure my architecture was sound, and would be easy to scale horizontally.

Other projects I've worked on have died a long, slow death because small architectural decisions made at the beginning of the project slowed down or halted development at a later time. Not only did development time have to be reallocated to maintenance and bug fixes, developers themselves were demoralized as they didn't get to spend much time on creating shiny new features or improving more interesting parts of the system. It's a vicious cycle.

While it does mean features come at a slower pace, keeping a lid on technical debt has allowed me to stay interested in building new features for Server Check.in's and making it faster.

New Server Checking Infrastructure

Server Check.in now has multiple Node.js servers running our server checking application, and all of these servers communicate with our central application server to see if there are servers that need to be checked, then post back the check results. All data is transferred via a simple internal RESTful API, and with the current architecture, I'm confident Server Check.in can handle at least a few hundred check servers and thousands more clients without any extra work on the backend.

The internal API communication between the Node.js servers and the main application server is extremely simple, and this is what makes it so powerful. Put another way: by keeping the distributed part of our architecture simple, I avoid making an already complex situation (multiple servers communicating over private LANs or the public Internet) even more complex.

Additionally, our application servers (VPSes hosted in geographically disparate locations with different hosting providers for better redundancy) are using centralized code and configuration management (more on this in a future blog post, I hope!), so boring server management and deployments are trivial.

This architecture will also be very helpful new types of checks and other new features are added, since everything will be distributed among all our check servers automatically. More time can be spent on developing new features rather than managing infrastructure and architecture.

Please let me know what you think of this post below, or on Hacker News or Reddit.

RODE smartLav Contest - Winner announced!

This week, we ran a contest for a RØDE smartLav lavaliere microphone. Twenty-nine people commented, and we're happy to announce that Junior King is the winner!

Junior will be recording podcasts with an iPhone 4 and the smartLav; he has a podcast on his website Cultured Hooligan, on which he interviews people who interest him—usually arts-related (musicians, artists, writers, etc.). We wish him the best of luck with his podcast, and hope the smartLav can help him record clean, audible podcasts!

To all other entrants

Thank you to everyone who commented on the contest blog post; we'll be sending a quick notification to you confirming you didn't win this time, along with a very nice coupon in case you still want to sign up for Server Check.in!

We hope to have more contests in the future—keep your eyes open, and subscribe to our blog, or follow us on social media!

Pages

Subscribe to Server Check.in Blog Subscribe to RSS - Server Check.in Blog