Helping you keep your server online and your website fast.

Improving architecture and adding features by containing technical debt

Server uses a small Node.js application to perform checks on servers and websites (to determine whether and how quickly they're responding to pings and http requests), while the main website and backend is built on Drupal and PHP.

One of my main goals in building Server with a modular architecture (besides being able to optimize each part using the best technologies available) was to ensure that the infrastructure and backend could easily scale as the service grows.

I'm happy to announce that Server now has multiple servers performing status checks (there are currently three check servers in the USA; more will be added soon!), and I'd like to explain how the Server architecture works and why it's built this way.

Priorities, priorities

Some people have asked why Server only used one server to do status checks initially—especially in light of the fact that most other uptime monitoring services tout tens or hundreds of probe servers as a headline feature. There are two reasons I chose to wait to expand to multiple check servers until recently:

Simplicity and stability trumps features.

Now, this is not true for every situation, everywhere, but I've found through the years that people prefer a dependable, reliable, and simple solution (as long as it meets their basic needs) over a flaky solution, no matter how many features it has.

Server's core product is the sending of a notification when a server goes down, and comes back up. If Server fails to notify you when your server is down, it failed. And if Server tells you your server is down when it's not, it failed even worse, because you start ignoring notifications.

Therefore, for the first few months of Server's existence, most of the development time was spent refining the accuracy and speed of our server checks, and making sure every notification was delivered successfully. As a direct result of these efforts, check frequencies have been increased and a new premium plan was added with even more value for the price. Server has also been up more than 99% of the time most months since launch!

Get the architecture right, or die slowly.

A major factor in choosing what new features go into Server is the limited development time available. To be honest, since Server is one of Midwestern Mac's side projects (a selfish one at that—I just want to be able to easily and inexpensively monitor my own client's servers!), there aren't a ton of resources when it comes to developing new features or making large architectural changes.

Therefore, before I embarked on a specific architecture for distributing server checking across many servers, I wanted to make sure my architecture was sound, and would be easy to scale horizontally.

Other projects I've worked on have died a long, slow death because small architectural decisions made at the beginning of the project slowed down or halted development at a later time. Not only did development time have to be reallocated to maintenance and bug fixes, developers themselves were demoralized as they didn't get to spend much time on creating shiny new features or improving more interesting parts of the system. It's a vicious cycle.

While it does mean features come at a slower pace, keeping a lid on technical debt has allowed me to stay interested in building new features for Server's and making it faster.

New Server Checking Infrastructure

Server now has multiple Node.js servers running our server checking application, and all of these servers communicate with our central application server to see if there are servers that need to be checked, then post back the check results. All data is transferred via a simple internal RESTful API, and with the current architecture, I'm confident Server can handle at least a few hundred check servers and thousands more clients without any extra work on the backend.

The internal API communication between the Node.js servers and the main application server is extremely simple, and this is what makes it so powerful. Put another way: by keeping the distributed part of our architecture simple, I avoid making an already complex situation (multiple servers communicating over private LANs or the public Internet) even more complex.

Additionally, our application servers (VPSes hosted in geographically disparate locations with different hosting providers for better redundancy) are using centralized code and configuration management (more on this in a future blog post, I hope!), so boring server management and deployments are trivial.

This architecture will also be very helpful new types of checks and other new features are added, since everything will be distributed among all our check servers automatically. More time can be spent on developing new features rather than managing infrastructure and architecture.

Please let me know what you think of this post below, or on Hacker News or Reddit.

RODE smartLav Contest - Winner announced!

This week, we ran a contest for a RØDE smartLav lavaliere microphone. Twenty-nine people commented, and we're happy to announce that Junior King is the winner!

Junior will be recording podcasts with an iPhone 4 and the smartLav; he has a podcast on his website Cultured Hooligan, on which he interviews people who interest him—usually arts-related (musicians, artists, writers, etc.). We wish him the best of luck with his podcast, and hope the smartLav can help him record clean, audible podcasts!

To all other entrants

Thank you to everyone who commented on the contest blog post; we'll be sending a quick notification to you confirming you didn't win this time, along with a very nice coupon in case you still want to sign up for Server!

We hope to have more contests in the future—keep your eyes open, and subscribe to our blog, or follow us on social media!

Contest - Win a RØDE smartLav! [Contest over]

[Update: The contest is closed; congratulations to winner Junior King!]

Server was built to help people know when their servers and websites are down (more about us). Our primary audience is small businesses, development shops, and individual website owners. Many of our customers are podcasters or publish videos online, and are always looking for new tools to help them get better recordings more easily!

Since I do a lot of recording and podcasting myself, I've written about many different microphones and recording solutions for iPhones and other smartphones and tablets, and I heartily recommend the new (and hard-to-get!) RØDE smartLav Microphone.

Rode smartLav product picture by Jeff Geerling

The smartLav is a little clip-on lavaliere mic that connects directly to your iPhone or Android phone, and lets you record lectures, podcasts, etc. with greater ease and clarity than most other solutions. It costs about $60, and is available from many different retailers.

We happen to have a brand new smartLav we'd like to give away to one lucky winner!

How to Enter

  1. Leave ONE comment below. Tell us why you'd like the smartLav.
    Use your real name and email address so we can contact you if you win!
  2. Tweet or Like this page (not a requirement, but we'd appreciate it!):
  3. Check the Server Blog on Saturday, June 1 (that's this Saturday!) in the afternoon to see who won.

Read the official contest rules at the bottom of this post.

More Info about the smartLav

And don't forget to Sign Up for Server to keep tabs on your servers and websites; it's only $15/year for peace of mind. Even if you don't sign up, read more about us, and follow us on Twitter (@servercheckin) or like us on Facebook.

Rules: No purchase necessary. We'll choose one of the comments on this post at random and contact the winner on Saturday morning. Please make sure you check your email and/or Twitter on Saturday! If the person responds to the email or Twitter message, he will become the winner. The winner will be announced on this blog and on Server's social media accounts on Saturday, June 1, in the afternoon. We will ship the smartLav, free of charge, to the winner.

Approximate value of the prize is $60 (plus shipping). Winner will be notified by email. Odds of winning depend on the number of comments on this post.

We will not use any email addresses for any purpose besides contacting the winner. Check out our privacy policy for more information about what we do with paid customers' information. We respect your privacy—especially if you sign up for our service ;-)

Moving functionality to Node.js increased per-server capacity by 100x

Node.js vs PHP and Drupal Queue Comparison - asynchronous vs synchronous

When Server was introduced, it was only slightly beyond MVP stage, and we've been iterating on it, making things more stable, adding small but effective features.

One feature that we just finished deploying is a small Node.js application that runs in tandem with Drupal to allow for an incredibly large number of servers and websites to be checked in a fraction of the time that we were checking them using only PHP, cron, and Drupal's Queue API.

The Old Way - Drupal's Queue API

The original pipeline for server checks was to do the following:

  1. On a cron run, find all servers that are due to be checked. Load them into a DrupalQueue (which we were storing in the queue table in Server's MySQL database).
  2. Build a queue worker callback, and use hook_cron_queue_info() to let Drupal manage the queue, and pass items to the worker callback whenever cron is running.
  3. The worker callback received each server in the queue, one by one, checked it (using cURL for http requests, and Ping for server pings), then posted the results back to the database.
  4. If a server went down, we would send it to another confirmation queue for a second check. If a server went back up, we would send out the proper notifications and update the server's status.

My own tests showed that this pipeline would scale to a few hundred or a thousand servers, but would quickly become a bottleneck. Therefore, we limited the number of users that were allowed to sign up for Server We didn't want users to sign up and get a nasty surprise of their servers only getting checked every 20 or 30 minutes! (We check servers every 5-10 minutes, depending on your plan).

The major flaw with this method is that it is a synchronous, serial process: Each server check would have to wait for the previous check to complete. This means that, when checking a hundred or so servers, the queue would take up to 10-15 minutes to empty! (This is assuming the worst case; all servers are down/unresponsive and they hit a 10 second timeout).

Additionally, scaling beyond one web server for the checks would be difficult at best. The built-in Drupal Queue API can do some great things, but it's not built to run through thousands of long-running tasks in very short amounts of time. (Note that, for some processes, like those that require a lot of processing, we have used the cron queue along with drush concurrency to process things in parallel, but this is overkill for simple (and non-CPU-intense) HTTP requests, and gobbles up server memory.)

As a final note: we had looked into using curl_multi_exec() with PHP to use some amount of concurrency in my server checks, but this would've still required a bunch of additional work refactoring the Queue API and cron code. Probably more work than building something in Node, and still not as scalable!

Building a Node.js server checking application

We decided to take a look at Node.js because it's a great platform for running tons of asynchronous (parallel) tasks that aren't CPU-bound. We built out a simple Node.js app that would get a list of servers to check, then run through them in parallel, either checking for a 200 OK status, content on a page, or pinging a URL/IP address, then send the results back to my main server.

We ended up using the request library instead of Node's built-in http library, simply because it makes HTTP requests much easier to manage, especially when redirects or other craziness is involved. We also used a modified version of the ping library for server pings, and are working on contributing back our improvements have released the improved version as jjg-ping on NPM.

For connection between Node.js and the Server Drupal site/database, we built a couple simple connectors:

  • Custom JSON Callback in Drupal: To get a list of servers to check, we simply call a custom page callback that grabs the servers, marks them as 'checked out' so other servers won't check servers while they're in process, and prints a structured JSON array of server objects.
  • Send results back via CSV: In the Node app, we simply write results into a CSV file using fs.appendFile (requires Node's built-in fs library), then post that CSV file back to the main server.
  • Processing the results in Drupal: We wrote a simple drush script that bootstraps Drupal just enough to access the database/API. It grabs all the result files, processes all the results (sending notifications or adding an item to the confirmation queue as needed), and clears out the result files.

In the future, we're planning on using a central database, maybe something using Amazon RDS or another cloud-based DB, and using that to store and retrieve raw server check data. But for now, passing CSV files back to the main server, and clearing them out once processed, seems to work fine. Using CSV files in such a manner requires a little bit of extra work, but it was quick to get going—I'll iterate later :)

This move to using Node.js for asynchronous requests and pings resulted in three great benefits:

  1. We can easily scale out horizontally, with more servers to do the server checking (with more regional diversity, and with more servers as needed). Just spin up a new Node.js server, and let it do its thing.
  2. We can now process ~100 server checks per second, per Node server, whereas with PHP/Drupal alone, we was only able to do about 1 per second, on average.
  3. Node.js used about 50% less RAM than the Drupal stack that had to run for each cron queue processing run. The maximum amount of memory that was used by Node was about 20 MB. Each httpd thread on the server is around 45-50 MB.

Node for Asynchronous network or IO-bound tasks

My takeaway: If you need to do some potentially slow tasks very often, and they're either network or IO-bound, consider moving those tasks off to a Node.js app. Your server and your overloaded queue will thank you!

New Server Features - April 2013

We worked hard this month to add some great new features to Server In addition to some tweaks to the interface to make your experience adding and managing servers much more enjoyable, we've added a new subscription plan!

  • New Premium Plan available (25 servers, 5 minute checks)
    You can now upgrade to your Premium plan, and get 25 servers (up from 5), and 5 minute check intervals (up from 10), as well as more allowed SMS messages/month. The price for the Premium plan is just $48/year (that's just $4/month!). If you've run out of additional servers in your account, you can upgrade for a pro-rated amount by logging into Server and clicking the Upgrade link under the 'Edit Account' tab.
  • New 'Global Website Latency' Tool
    We're slowly adding more tools to help you gauge your site's performance over time and against other websites. Our new Global Latency Graphs tool lets you see how the entire Internet (as seen by Server is performing.
  • The Server Blog!
    Because we care about performance, we thought we'd start writing about how you can make your own sites and servers perform better, and stay up longer. Check out our blog and subscribe to our Blog's RSS feed to keep up to date.

We have even more exciting new features we're hoping to reveal in the coming summer months, so keep an eye on your inbox for more! As always, please let us know if you have any ideas or suggestions for improving Server by emailing us or replying to this note. Many of the features we've implemented in our short existence have been directly requested by you—our awesome customers!

New Server Features - March 2013

Here are the latest features we've added to Server since January:

  • Monthly Server Summary Emails
    One of the most requested features was a monthly summary email showing how your servers are doing at the end of a month. Server will now send you a summary email on the first day every month, showing the previous month's stats for all the servers in your account.

    This feature is enabled by default, but you can opt-out of these emails by unchecking the 'Monthly Summary Emails' checkbox in your account on Server

  • HTTP/HTTPS Content Checks
    This is a feature we wanted since day one, and many of you have requested it; you can now have Server check for the presence of a bit of text in the markup of a web page.

    For our launch, we focused on simplicity, and made sure that everything ran well with just two kinds of checks: HTTP '200 OK' status checks, and server pings. Now that we've been able to spend more time ensuring we can handle the extra load of content checks, we've enabled this service for all your servers.

    One advanced way you can use this feature is to create a page somewhere on your site that prints a certain message if a service you have on your site is running correctly, then check for that text in Server

We're hard at work making even more improvements to Server What would you like to see us do? Let us know by contacting us or posting a comment below.


Subscribe to Server Blog Subscribe to RSS - Server Blog