Michael Doran Home Page
Contact | Site Map | Search  
  Home > Archives > Unix Sysadmin 101 > Troubleshooting
This page is deprecated: please read archives disclaimer.

Unix Sysadmin 101: What newbies need to know, but nobody tells them.

Troubleshooting

A server crashes or slows down to a crawl, or a critical application no longer works. Your phone starts ringing off the hook and you know that your boss (and perhaps the whole library) is looking to you to make things right. You may have no earthly idea what the heck is wrong, but you better have a plan for figuring it out.
A troubleshooting plan

If a server suddenly bursts into flames, you should put out the fire and call vendor hardware support. But for less dramatic hardware or software problems, you need a more measured approach:
  1. Ask yourself, "What has changed since everything was working OK?"
    Nine times out of ten, the answer to this question is the key to solving the problem.
    • Was new software installed, or older software upgraded?
    • Was a configuration file edited?
    • Who had access to the server, and what did they do?
  2. Read the relevant documentation.
    (This should really be done before problems show up. But better late than never, eh?)
    • Is the problem really a bug, or is it a feature?
    • Can the problem be solved by editing a configuration file?
    • Are there utilities (e.g. fsck) that can fix the problem?
  3. Search symptom/resolution databases (e.g. SunSolve Online, news groups, archives, etc.).
    • Is this a known bug? If so, is there a work-around?
    • Is there a patch that fixes the problem?
    • Is there a newer driver for the misbehaving device?
  4. Get outside assistance.
    • Ask for help from a more knowledgeable colleague or coworker.
    • Contact vendor technical support (for covered hardware/software).
    • Post a message to a listserv or newsgroup (for open-source software).
Note that steps 1-3 stress self-reliance. This is not a coincidence. Barring an emergency, you need to take the initiative to help yourself before you ask other people for help. Even if steps 1-3 do not directly provide a solution, what you learn in the process better enables you to articulate your request for outside support or assistance.

Note also that a troubleshooting plan is reactive. It goes without saying that system administrators should be proactively monitoring their systems in order to head off potential problems (e.g. full disks). Sysadmins should also familiarize themselves with normal server processess, performance, etc., so that they can recognize anomalous behavior when it occurs.

Troubleshooting story #1

I was hired as a Systems Librarian just as we were coming up on Voyager. One of my first duties was insuring that backups were being done. I did the first couple manually, and then wrote some shell scripts to run as cron jobs. Everything worked fine the first week, and then I got a ufsdump "write" error. I made sure I was using the correct type and size tape cartridges. I cleaned the tape heads. The next backup went OK, and I breathed a sigh of relief. That relief was short-lived, as I soon started getting additional error messages. Uh-oh, intermittent problem.

I read the manual for our tape drive; the installation instructions said to modify the /kernel/drv/st.conf file. When I checked that file, I saw that the modification had never been made. So I edited that configuration file and did a reconfigure reboot. Problem solved? No, I was still getting errors -- now almost every backup. I was taking the Solaris System Administration I course around that time, and I asked the instructor for suggestions. His recommendations (regarding ufsdump parameters) were no help.

At that point, I decided that the tape drive was bad and I called Sun Support. A technician came out and replaced the drive. Guess what? That didn't solve the problem. I called Sun Support again, and asked for software support. A helpful support rep searched SunSolve, determined that it was a software driver problem, and sent me a patch. The patch overwrote st.conf (which I had to re-edit), but other that that, it solved the problem.

First moral of the story: Use available symptom/resolution databases to diagnose problems.

Second moral of the story: Rule out bad software before you replace "bad" hardware.

Troubleshooting story #2

About a year later, I took a networking course in which I learned commands for monitoring network performance. Of course when I returned to work, one of the first things I did was run the commands. Whereupon I discovered that one of my servers had a malfunctioning ethernet card (well not really a card -- ethernet is integral to the system board on a Sun server). But having learned my lessons (see above story), I did not replace any hardware. I searched SunSolve Online, discovered that the problem had previously been described and that a patch was available. I downloaded and applied the patch, after which the card worked like a charm.

Moral of the story: If you apply lessons learned from past mistakes, your sysadmin life gets easier.