Diagnose Linux Server Load Problems with a Simple Script

If you have been an admin for any length of time, you have certainly discovered situations where a server spikes in CPU use or memory utilization and/or load levels. Running `top` won’t always give you the answer, either. So how do you find those sneaky processes that are chewing up your system resources to be able to kill ’em?

The following script might be able to help. It was written for a web server, so has some parts of it that are specifically looking for httpd processes and some parts that deal with MySQL. Depending on your server deployment, simply comment/delete those sections and add others. It should be used for a starting point.

Prerequisites for this version of the script is some freeware released under the GNU General Public License called mytop (available at http://jeremy.zawodny.com/mysql/mytop/) which is a fantastic tool for checking how MySQL is performing. It is getting old, but still works great for our purposes here.

Additionally, I use mutt as the mailer – you may want to change the script to simply use the linux built in `mail` utility. I run it via cron every hour; adjust as you see fit. Oh – and this script needs to run as root since it does read from some protected areas of the server.

So let’s get started, shall we?

First, set your script variables:

#!/bin/bash

#

# Script to check system load average levels to try to determine

# what processes are taking it overly high...

#

# 07Jul2010 tjones

#

# set environment

dt=`date +%d%b%Y-%X`

# Obviously, change the following directories to where your log files actually are kept

tmpfile="/tmp/checkSystemLoad.tmp"

logfile="/tmp/checkSystemLoad.log"

msgLog="/var/log/messages"

mysqlLog="/var/log/mysqld.log"

# the first mailstop is standard email for reports. Second one is for cell phone (with a pared down report)

mailstop="sysadmin@mydomain.com"

mailstop1="15555555555@mycellphone.com"

machine=`hostname`

# The following three are for mytop use - use a db user that has decent rights

dbusr="username"

dbpw="password"

db="yourdatabasename"

# The following is the load level to check on - 10 is really high, so you might want to lower it.

levelToCheck=10

Next, check your load level to see if the script should continue:

# Set variables from system:

loadLevel=`cat /proc/loadavg | awk '{print $1}'`

loadLevel=$( printf "%0.f" $loadLevel )

# if the load level is greater than you want, start the script process. Otherwise, exit 0

if [ $loadLevel -gt $levelToCheck ]; then

echo "" > $tmpfile

echo "**************************************" >>$tmpfile

echo "Date: $dt " >>$tmpfile

echo "Check System Load & Processes " >>$tmpfile

echo "**************************************" >>$tmpfile

And continue through the checks, writing the results to the temporary file. Add or delete items from here where applicable to your situation:

# Get more variables from system:

httpdProcesses=`ps -def | grep httpd | grep -v grep | wc -l`

# Show current load level:

echo "Load Level Is: $loadLevel" >>$tmpfile

echo "*************************************************" >>$tmpfile

# Show number of httpd processes now running (not including children):

echo "Number of httpd processes now: $httpdProcesses" >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show process list:

echo "Processes now running:" >>$tmpfile

ps f -ef >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show current MySQL info:

echo "Results from mytop:" >>$tmpfile

/usr/bin/mytop -u $dbusr -p $dbpw -b -d $db >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

Notice with the top command, we are writing to two temp files. One is for the much smaller message to cell phone. If you don’t want the urgency of cell phone alerts at three in the morning, you can take this out (and take out the second mailing routine later in the script).



# Show current top:

echo "top now shows:" >>$tmpfile

echo "top now shows:" >>$topfile

/usr/bin/top -b -n1 >>$tmpfile

/usr/bin/top -b -n1 >>$topfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

More checks:



# Show current connections:

echo "netstat now shows:" >>$tmpfile

/bin/netstat -p >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Check disk space

echo "disk space:" >>$tmpfile

/bin/df -k >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

Then write the temporary file contents to a more permanent log file and email the results to the appropriate parties. The second mailing is the pared down results consisting simply of the standard out of `top`:

# Send results to log file:

/bin/cat $tmpfile >>$logfile

# And email results to sysadmin:

/usr/bin/mutt -s "$machine has a high load level! - $dt" -a $mysqlLog -a $msgLog $mailstop >$logfile

And then some housekeeping and exit:

# And then remove the temp file:

rm $tmpfile

rm $topfile

fi

#

exit 0

Hopefully this helps someone out there. Fully assembled script is:

#!/bin/bash

#

# Script to check system load average levels to try to determine what processes are

# taking it overly high...

#

# set environment

dt=`date +%d%b%Y-%X`

# Obviously, change the following directories to where your log files actually are kept

tmpfile="/tmp/checkSystemLoad.tmp"

logfile="/tmp/checkSystemLoad.log"

msgLog="/var/log/messages"

mysqlLog="/var/log/mysqld.log"

# the first mailstop is standard email for reports. Second one is for cell phone (with a pared down report)

mailstop="sysadmin@mydomain.com"

mailstop1="15555555555@mycellphone.com"

machine=`hostname`

# The following three are for mytop use - use a db user that has decent rights

dbusr="username"

dbpw="password"

db="yourdatabasename"

# The following is the load level to check on - 10 is really high, so you might want to lower it.

levelToCheck=10

# Set variables from system:

loadLevel=`cat /proc/loadavg | awk '{print $1}'`

loadLevel=$( printf "%0.f" $loadLevel )

# if the load level is greater than you want, start the script process. Otherwise, exit 0

if [ $loadLevel -gt $levelToCheck ]; then

echo "" > $tmpfile

echo "**************************************" >>$tmpfile

echo "Date: $dt " >>$tmpfile

echo "Check System Load & Processes " >>$tmpfile

echo "**************************************" >>$tmpfile

# Get more variables from system:

httpdProcesses=`ps -def | grep httpd | grep -v grep | wc -l`

# Show current load level:

echo "Load Level Is: $loadLevel" >>$tmpfile

echo "*************************************************" >>$tmpfile

# Show number of httpd processes now running (not including children):

echo "Number of httpd processes now: $httpdProcesses" >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show process list:

echo "Processes now running:" >>$tmpfile

ps f -ef >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show current MySQL info:

echo "Results from mytop:" >>$tmpfile

/usr/bin/mytop -u $dbusr -p $dbpw -b -d $db >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show current top:

echo "top now shows:" >>$tmpfile

echo "top now shows:" >>$topfile

/usr/bin/top -b -n1 >>$tmpfile

/usr/bin/top -b -n1 >>$topfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Show current connections:

echo "netstat now shows:" >>$tmpfile

/bin/netstat -p >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Check disk space

echo "disk space:" >>$tmpfile

/bin/df -k >>$tmpfile

echo "*************************************************" >>$tmpfile

echo "" >>$tmpfile

# Send results to log file:

/bin/cat $tmpfile >>$logfile

# And email results to sysadmin:

/usr/bin/mutt -s "$machine has a high load level! - $dt" -a $mysqlLog -a $msgLog $mailstop >$logfile

# And then remove the temp file:

rm $tmpfile

rm $topfile

fi

#

exit 0