If you have been an admin for any length of time, you have certainly discovered situations where a server spikes in CPU use or memory utilization and/or load levels. Running `top` won't always give you the answer, either. So how do you find those sneaky processes that are chewing up your system resources to be able to kill 'em?

The following script might be able to help. It was written for a web server, so has some parts of it that are specifically looking for httpd processes and some parts that deal with MySQL. Depending on your server deployment, simply comment/delete those sections and add others. It should be used for a starting point.

Prerequisites for this version of the script is some freeware released under the GNU General Public License called mytop (available at http://jeremy.zawodny.com/mysql/mytop/) which is a fantastic tool for checking how MySQL is performing. It is getting old, but still works great for our purposes here.

Additionally, I use mutt as the mailer - you may want to change the script to simply use the linux built in `mail` utility. I run it via cron every hour; adjust as you see fit. Oh - and this script needs to run as root since it does read from some protected areas of the server.

So let's get started, shall we?

First, set your script variables:

        #!/bin/bash## Script to check system load average levels to try to determine# what processes are taking it overly high...## 07Jul2010 tjones## set environmentdt=`date +%d%b%Y-%X`# Obviously, change the following directories to where your log files actually are kepttmpfile="/tmp/checkSystemLoad.tmp"logfile="/tmp/checkSystemLoad.log"msgLog="/var/log/messages"mysqlLog="/var/log/mysqld.log"# the first mailstop is standard email for reports. Second one is for cell phone (with a pared down report)mailstop="sysadmin@mydomain.com"mailstop1="15555555555@mycellphone.com"machine=`hostname`# The following three are for mytop use - use a db user that has decent rightsdbusr="username"dbpw="password"db="yourdatabasename"# The following is the load level to check on - 10 is really high, so you might want to lower it.levelToCheck=10
    

Next, check your load level to see if the script should continue:

        # Set variables from system:loadLevel=`cat /proc/loadavg | awk '{print $1}'`loadLevel=$( printf "%0.f" $loadLevel )# if the load level is greater than you want, start the script process. Otherwise, exit 0if [ $loadLevel -gt $levelToCheck ]; then echo "" > $tmpfile echo "**************************************" >>$tmpfile echo "Date: $dt " >>$tmpfile echo "Check System Load & Processes " >>$tmpfile echo "**************************************" >>$tmpfile
    

And continue through the checks, writing the results to the temporary file. Add or delete items from here where applicable to your situation:

         # Get more variables from system: httpdProcesses=`ps -def | grep httpd | grep -v grep | wc -l` # Show current load level: echo "Load Level Is: $loadLevel" >>$tmpfile echo "*************************************************" >>$tmpfile # Show number of httpd processes now running (not including children): echo "Number of httpd processes now: $httpdProcesses" >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show process list: echo "Processes now running:" >>$tmpfile ps f -ef >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show current MySQL info: echo "Results from mytop:" >>$tmpfile /usr/bin/mytop -u $dbusr -p $dbpw -b -d $db >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile
    

Notice with the top command, we are writing to two temp files. One is for the much smaller message to cell phone. If you don't want the urgency of cell phone alerts at three in the morning, you can take this out (and take out the second mailing routine later in the script).

         # Show current top: echo "top now shows:" >>$tmpfile echo "top now shows:" >>$topfile /usr/bin/top -b -n1 >>$tmpfile /usr/bin/top -b -n1 >>$topfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile
    

More checks:

         # Show current connections: echo "netstat now shows:" >>$tmpfile /bin/netstat -p >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Check disk space echo "disk space:" >>$tmpfile /bin/df -k >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile
    

Then write the temporary file contents to a more permanent log file and email the results to the appropriate parties. The second mailing is the pared down results consisting simply of the standard out of `top`:

         # Send results to log file: /bin/cat $tmpfile >>$logfile # And email results to sysadmin: /usr/bin/mutt -s "$machine has a high load level! - $dt" -a $mysqlLog -a $msgLog $mailstop <$tmpfile /usr/bin/mutt -s "$machine has a high load level! - $dt" $mailstop1 <$topfile echo "**************************************" >>$logfile
    

And then some housekeeping and exit:

         # And then remove the temp file: rm $tmpfile rm $topfilefi#exit 0 
    

Hopefully this helps someone out there. Fully assembled script is:

        #!/bin/bash## Script to check system load average levels to try to determine what processes are# taking it overly high...## set environmentdt=`date +%d%b%Y-%X`# Obviously, change the following directories to where your log files actually are kepttmpfile="/tmp/checkSystemLoad.tmp"logfile="/tmp/checkSystemLoad.log"msgLog="/var/log/messages"mysqlLog="/var/log/mysqld.log"# the first mailstop is standard email for reports. Second one is for cell phone (with a pared down report)mailstop="sysadmin@mydomain.com"mailstop1="15555555555@mycellphone.com"machine=`hostname`# The following three are for mytop use - use a db user that has decent rightsdbusr="username"dbpw="password"db="yourdatabasename"# The following is the load level to check on - 10 is really high, so you might want to lower it.levelToCheck=10# Set variables from system:loadLevel=`cat /proc/loadavg | awk '{print $1}'`loadLevel=$( printf "%0.f" $loadLevel )# if the load level is greater than you want, start the script process. Otherwise, exit 0if [ $loadLevel -gt $levelToCheck ]; then echo "" > $tmpfile echo "**************************************" >>$tmpfile echo "Date: $dt " >>$tmpfile echo "Check System Load & Processes " >>$tmpfile echo "**************************************" >>$tmpfile # Get more variables from system: httpdProcesses=`ps -def | grep httpd | grep -v grep | wc -l` # Show current load level: echo "Load Level Is: $loadLevel" >>$tmpfile echo "*************************************************" >>$tmpfile # Show number of httpd processes now running (not including children): echo "Number of httpd processes now: $httpdProcesses" >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show process list: echo "Processes now running:" >>$tmpfile ps f -ef >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show current MySQL info: echo "Results from mytop:" >>$tmpfile /usr/bin/mytop -u $dbusr -p $dbpw -b -d $db >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show current top: echo "top now shows:" >>$tmpfile echo "top now shows:" >>$topfile /usr/bin/top -b -n1 >>$tmpfile /usr/bin/top -b -n1 >>$topfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Show current connections: echo "netstat now shows:" >>$tmpfile /bin/netstat -p >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Check disk space echo "disk space:" >>$tmpfile /bin/df -k >>$tmpfile echo "*************************************************" >>$tmpfile echo "" >>$tmpfile # Send results to log file: /bin/cat $tmpfile >>$logfile # And email results to sysadmin: /usr/bin/mutt -s "$machine has a high load level! - $dt" -a $mysqlLog -a $msgLog $mailstop <$tmpfile /usr/bin/mutt -s "$machine has a high load level! - $dt" $mailstop1 <$topfile echo "**************************************" >>$logfile # And then remove the temp file: rm $tmpfile rm $topfilefi#exit 0