Something strange has been happening to my Catalyst applications – they’ve been running fine (more than fine, actually), but every 7 days or so, the FastCGI server seems to lock up. I’m not sure where the failure is ATM, but the apps need to be available all of the time, so they need monitoring and restarting if there’s an error. I’m already using monit to keep an eye on the system and other running processes, so here’s how I’ve got it checking the FastCGI servers as well. (I could write a Perl script using the FCGI::Client module to do checking, but then I’d have to write a monitor to make sure the monitor was still running :D)

monit does not, by default, come with a protocol handler for FastCGI. I could have a known-good endpoint that I could call using the HTTP protocol, but that also relies on having a HTTP server running that can connect to the FastCGI server. I wanted to test the FastCGI server directly.

It is possible – you just have to do your own binary send/expect string, like so:

# Empty FastCGI request
if failed port 8101
  # Send FastCGI packet: version 1 (0x01), cmd FCGI_GET_VALUES (0x09)
  # padding 8 bytes (0x08), followed by 8xNULLs padding
  send "\0x01\0x09\0x00\0x00\0x00\0x00\0x08\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00\0x00"
  # Expect FastCGI packet: version 1 (0x01), resp FCGI_GET_VALUES_RESULT (0x0A)
  expect "\0x01\0x0A"
  timeout 5 seconds
then restart

Basically, we’re sending an empty FastCGI packet to the server, and getting the equally empty response back. From what I’ve experienced, the FastCGI server is accepting connections, just not returning any data, so I’ve put the timeout clause on there as well. Sending 16 bytes out and receiving 2 back is light enough a test for me. :)

I’ve got two identical servers running the same app – I’ve got one running with a “restart” action on test failure, and one just on “alert”. The front-end nginx servers are set to fallback to the other server in the event of a FastCGI failure. In about a week, I’m expecting both app servers to die, but one should restart and the other should stay dead (and fill up the syslog :)). With nginx doing the failover site visitors shouldn’t be affected by this, but it’ll give me a chance to do more investigation into why these app servers are dying.

Given that this test uses nothing specific to any FastCGI implementation (I double-checked it against the FastCGI spec) there’s no reason why this can’t be used to keep PHP or Python FastCGI servers running as well. Maybe one day there’ll be a native FastCGI protocol test in monit, but until then this seems to be the best there is.