Using a Standardized Troubleshooting Model
Being a good troubleshooter is a key part of being an effective Linux
system administrator. I’ve been teaching new system administrators for
nearly two decades now, and this is one of the hard- est skills for
some to master. Some new admins just seem to have an intrinsic sense
for how to troubleshoot problems; others don’t. The reason for this,
in my opinion, is that troubleshooting is part art form. Just as it’s
difficult for some of us (me included) to learn how to draw, sculpt,
or paint, it’s also difficult for some of us to learn how to
troubleshoot.
However, I’ve noticed that, with a little training and a lot of
practice, most new administrators can eventually learn how to
troubleshoot effectively. There are three keys to doing this:
• Using a solid troubleshooting procedure
• Obtaining a working knowledge of troubleshooting tools
• Gaining a lot of experience troubleshooting problems
The last point is beyond the scope of this book. The only way to gain
troubleshooting experience is to spend a couple of years in the
field. However, we can work with the first two points. In the last
part of this chapter, we’ll focus specifically on troubleshooting
network issues. However, the procedure we will discuss here can be
broadly applied to any system problem.
Network problems can be caused by a wide array of issues, and I can’t
even begin to cover them all here. Instead, I want to focus on using a
standardized process for troubleshooting network issues. By using a
standardized process, you can adapt to, confront, and resolve a broad
range of network problems. The model I’m going to present here is by
no means all-inclusive. You may need to add, remove, or reorganize
steps to match your particular situation. However, I hope it gives you
a good base to start from.
Many new system administrators make a key mistake when they
troubleshoot system or net- work problems. Instead of using a
methodical troubleshooting approach, they go off half-cocked and start
trying to implement fixes before they really know what the problem is.
I call it “shotgun troubleshooting.” The administrator tries one fix
after another, hoping that one of them will repair the problem.
This is a very dangerous practice. I’ve watched system administrators
do this and cause more problems than they solve. Sometimes they even
cause catastrophic problems. Case in point: Several years ago I was
setting up several servers in a network. One of the servers was
misconfigured and was having trouble synchronizing information with
the other systems. While I was trying to figure out the source of the
problem, my coworker (let’s call him Syd) started implementing one fix
after another in shotgun fashion trying to get the server to sync with
the other servers. In the process, he managed to catastrophically mess
up all of them! The actual issue was relatively minor and would have
required only about 20 minutes to fix. Instead, we had to spend the
rest of the day and part of the night reinstalling each server from
scratch and restoring their data.
Instead of using shotgun troubleshooting, you should use a
standardized troubleshooting model. The goal of a troubleshooting
model is to concretely identify the source of the problem before you
start fixing things. I know that sounds simple, but many system
administrators struggle with this concept. Here’s a suggested
troubleshooting model that you can use to develop your own personal
troubleshooting methodology:
Step 1. Gather information. This is a critical step. You need to
determine exactly what has happened. What are the symptoms? Were any
error messages displayed? What did they say? How extensive is the
problem? Is it isolated to a single system, or are many systems
experiencing the same problem?
Step 2. Identify what has changed. In this step, you should identify
what has changed in the system. Has new software been installed? Has
new hardware been installed? Did a user change something? Did you
change something?
Step 3. Create a hypothesis. With the information gathered in the
preceding steps, develop several hypotheses that could explain the
problem. To do this, you may need to do some research. You should
check FAQs and knowledgebases available on the Internet. You should
also consult with peers to validate your hypotheses. Using the
information you gain, narrow your results down to the one or two most
likely causes.
Step 4. Determine the appropriate fix. The next step is to use peers,
FAQs, knowledgebases, and your own experience to identify the steps
needed to fix the problem. As you do this, be sure to identify the
possible ramifications of implementing the fix and account for them.
Many times, the fix may have side effects that are as bad as or worse
than the original problem.
Step 5. Implement the fix. At this point, you’re ready to implement
the fix. Notice that in this troubleshooting model, we did a ton of
research before implementing a fix! Doing so greatly increases the
likelihood of success. After implementing the fix, be sure to verify
that the fix actually repaired the problem and that the issue doesn’t
reappear.
Step 6. Ensure user satisfaction. This is a key mistake made by many
system administrators. I like to teach students the adage “If the user
ain’t happy, you ain’t happy.” We system admins are notoriously poor
communicators. If the problem affects users, you need to communicate
the nature of the problem with them and make sure they are aware that
it has been fixed. If applicable, you should also educate them as to
how to keep the problem from occurring in the future. You should also
communicate with your users’ supervisors and ensure they know that the
problem has been fixed.
Step 7. Document the solution. Finally, you need to document the
solution to your problem. That way, when it occurs again a year or two
down the road, you or other system administrators can quickly identify
the problem and how to fix it.
If you use this methodology, you can learn to be a very effective
troubleshooter as you gain hands-on experience in the real world.
LX0-104 Exam Objectives (T)
No comments:
Post a Comment