Saturday, 12 May 2012

System monitoring

I hate having to configure monitoring. Especially in a complex interconnected system with 100s of servers and 1000s of processes. I have used Nagios in the past; this might be why I'm bald.
When monitoring a system all I really want to know is what should be up, what is up, what is it connected to, what are the inputs / outputs, and how busy is it.
A while ago I wrote a Python script to take a list of hosts, ssh to them, issue a ps and lsof, parse the outputs and connect all the dots. This worked well, but it took a long time to run and output a massive file giving a lot of data (network connections, stdin/out, executables, command line arguments, cpu / memory usage) but no real useful information. Apart from lengthy execution and no information it worked a treat. So, having failed to acheive anything useful I naturally put this python code into my back-burner directory and got on with something else.
A few months ago I read somewhere that the Processing environment was now available in the broswer via processing.js. This was good news. I now had the tool to allow me to successfully display the data as meaningful information. I just now had to get the get the server scan down to a sensible time in order that the information displayed was up to date. I also noticed that using python to monitor my system's performance was causing systems performance issues!                
Enter Go.                                                                                                       
I ported the python code to Go. Not a straight port; I utilised goroutines to issue the ssh commands and channels to gather / process the output and push the information to browser clients.
On a loop, ssh commands are issued and the output processed into meaningful results which are then stored in a map. Process signatures are parsed so as to give sensible names and display arguments. This map is then compared to the previous loop's map to calculate each process's status. Is the process alive? Has the PID changed? Have the socket connection changed, or dropped? How much CPU and memory are being used. That sort of thing. The results are then pushed as JSON to any connected browsers via websockets.
In the browser, the processing.js environment loops at a given frame rate drawing the nodes and connections as detailed in the provided JSON. The screen gives lots of visual information: status, type of process, resource usage, connections made etc. The Processing and jQuery allowed me to code user interactions with the nodes. The user can position, filter, detail, and maintain the nodes. As the websocket is bidirectional and a node details the process's standard output I was able to allow the users to hit a key to tail the process's log file in the browser. I intend to expand this further to interact with the underlying process; start stop etc, but I haven't added any authentication or authorisation yet.

Once the user has positioned the nodes to their liking they're able to save the screen by sending the current display context, again as JSON, back to the server over the websocket. This then allows me to save a good known state of the system and use this as a default process map when starting the monitoring process. This solves the problem of monitoring missing processes.

Processing enabled me to present the system information in a palatable way and deserves a lot of credit.

Go though, deserves unending credit. All the heavy lifting of issuing 1000s of system calls, parsing the output and collating the results is handled with ease. The job that took my python script four to five minutes to complete is done in under a second. I even restricted the Go process to a single core to make the comparison with python fair. The Go garbage collector is clearly doing its job as throughout the day the memory footprint holds steady. Sar stats across all machines show no adverse effect of using this scatter / gather approach.
Go, Processing (along with jQuery for the dialog inputs) and the good old shell utils gave me the ability to write my own bespoke monitoring software and ditch Nagios.

My one complaint though is that no matter how much I rub the source code on my head my hair still wont grow back.