When Server Check.in was introduced, it was only slightly beyond MVP stage, and we've been iterating on it, making things more stable, adding small but effective features.
One feature that we just finished deploying is a small Node.js application that runs in tandem with Drupal to allow for an incredibly large number of servers and websites to be checked in a fraction of the time that we were checking them using only PHP, cron, and Drupal's Queue API.
The Old Way - Drupal's Queue API
The original pipeline for server checks was to do the following:
- On a cron run, find all servers that are due to be checked. Load them into a DrupalQueue (which we were storing in the
queue table in Server Check.in's MySQL database).
- Build a queue worker callback, and use
hook_cron_queue_info() to let Drupal manage the queue, and pass items to the worker callback whenever cron is running.
- The worker callback received each server in the queue, one by one, checked it (using cURL for http requests, and Ping for server pings), then posted the results back to the database.
- If a server went down, we would send it to another confirmation queue for a second check. If a server went back up, we would send out the proper notifications and update the server's status.
My own tests showed that this pipeline would scale to a few hundred or a thousand servers, but would quickly become a bottleneck. Therefore, we limited the number of users that were allowed to sign up for Server Check.in. We didn't want users to sign up and get a nasty surprise of their servers only getting checked every 20 or 30 minutes! (We check servers every 5-10 minutes, depending on your plan).
The major flaw with this method is that it is a synchronous, serial process: Each server check would have to wait for the previous check to complete. This means that, when checking a hundred or so servers, the queue would take up to 10-15 minutes to empty! (This is assuming the worst case; all servers are down/unresponsive and they hit a 10 second timeout).
Additionally, scaling beyond one web server for the checks would be difficult at best. The built-in Drupal Queue API can do some great things, but it's not built to run through thousands of long-running tasks in very short amounts of time. (Note that, for some processes, like those that require a lot of processing, we have used the cron queue along with drush concurrency to process things in parallel, but this is overkill for simple (and non-CPU-intense) HTTP requests, and gobbles up server memory.)
As a final note: we had looked into using
curl_multi_exec() with PHP to use some amount of concurrency in my server checks, but this would've still required a bunch of additional work refactoring the Queue API and cron code. Probably more work than building something in Node, and still not as scalable!
Building a Node.js server checking application
We decided to take a look at Node.js because it's a great platform for running tons of asynchronous (parallel) tasks that aren't CPU-bound. We built out a simple Node.js app that would get a list of servers to check, then run through them in parallel, either checking for a 200 OK status, content on a page, or pinging a URL/IP address, then send the results back to my main server.
We ended up using the request library instead of Node's built-in http library, simply because it makes HTTP requests much easier to manage, especially when redirects or other craziness is involved. We also used a modified version of the ping library for server pings, and
are working on contributing back our improvements have released the improved version as jjg-ping on NPM.
For connection between Node.js and the Server Check.in Drupal site/database, we built a couple simple connectors:
- Custom JSON Callback in Drupal: To get a list of servers to check, we simply call a custom page callback that grabs the servers, marks them as 'checked out' so other servers won't check servers while they're in process, and prints a structured JSON array of server objects.
- Send results back via CSV: In the Node app, we simply write results into a CSV file using
fs.appendFile (requires Node's built-in fs library), then post that CSV file back to the main server.
- Processing the results in Drupal: We wrote a simple drush script that bootstraps Drupal just enough to access the database/API. It grabs all the result files, processes all the results (sending notifications or adding an item to the confirmation queue as needed), and clears out the result files.
In the future, we're planning on using a central database, maybe something using Amazon RDS or another cloud-based DB, and using that to store and retrieve raw server check data. But for now, passing CSV files back to the main server, and clearing them out once processed, seems to work fine. Using CSV files in such a manner requires a little bit of extra work, but it was quick to get going—I'll iterate later :)
This move to using Node.js for asynchronous requests and pings resulted in three great benefits:
- We can easily scale out horizontally, with more servers to do the server checking (with more regional diversity, and with more servers as needed). Just spin up a new Node.js server, and let it do its thing.
- We can now process ~100 server checks per second, per Node server, whereas with PHP/Drupal alone, we was only able to do about 1 per second, on average.
- Node.js used about 50% less RAM than the Drupal stack that had to run for each cron queue processing run. The maximum amount of memory that was used by Node was about 20 MB. Each httpd thread on the server is around 45-50 MB.
Node for Asynchronous network or IO-bound tasks
My takeaway: If you need to do some potentially slow tasks very often, and they're either network or IO-bound, consider moving those tasks off to a Node.js app. Your server and your overloaded queue will thank you!