Troubleshooting is not an art, it's a science, and despite what you've been told, it can be taught. Most engineers and technicians work by past experience alone. If they have seen a problem before, and have been shown how to solve it, they can fix it. When it is a new problem they haven't seen before, the problem will escalate out of control and Management will be looking for someone to fire.
Using the Scientific Method when troubleshooting may not be the swiftest path to a resolution, but is the most certain path to resolving complex problems and finding a permanent fix. The Scientific Method can be applied to any troubleshooting situation. This method reduces all troubleshooting to a standard set of common steps that can be adapted to suit your needs. It is not specific to any technology so no specific tools will be covered here. The idea is to learn principles of troubleshooting. You can learn about how to use specific computer troubleshooting tools and network troubleshooting tools elsewhere on this site.
To get the most out of this tutorial, it is highly recommended that you either know, or learn the OSI Model. Learn to combine the Scientific Method with the OSI Model and your troubleshooting will be far more effective and you will achieve success more often--and be able to state with confidence that the failure is due to vendor problems, not your own work.
The Scientific Method is an investigative process that uses logic to formulate and test theories through observation and methodical experimentation. It is the basis of how mankind derives knowledge from the natural world around him. The Scientific Method has been around since mankind first started asking "Why?" and "How?" and shows up as early as 3,000 years ago in India's and Egypt's historical records.
How does the Scientific Method apply to Information Technologies and specifically to troubleshooting? If you want to solve a technical problem, you need a logical and systematic procedure that can be used to sift through the available information, discard what is irrelevant, discover other useful facts and make logical conclusions in order to arrive at the source of the problem. In most cases, you will use the Scientific Method not once but several times to arrive at the source of the problem.
The Scientific Method is the key to troubleshooting your computer and network problems. It burns away irrelevancies and brings you to the root cause. There are six steps in the scientific method:
"It is a capital mistake to theorize before you have all the evidence. It biases the judgment." -- Shirlock Holmes, A Study In Scarlet, Ch. 3, p. 27
You must gather reliable information about what problem is occurring in order to discover what is not functioning properly. It is absolutely critical that you gather as much information as possible. The most common cause of extended problems and outages is a lack of information.
When gathering information:
The information you gather can and should come from multiple sources. There are several ways to gather information about the problem.
Here, during the early information collection phase is where knowing how a system works from the bottom up becomes useful. These days, there are very few monolithic systems that are totally self contained. Everything is built in layers, one simple layer supporting more advanced/complex functions and in the end providing a working system. Understanding your environment and understanding the OSI Model, are absolutely critical at this phase as they provide direction on what indicators to check. Start from the lowest level and check your indicators, unless you have good input pinpointing the source of the problem.
There is a wealth of good information in the system and application logs including error messages, crash notifications, errors and exit codes. Collect these because you may need to provide them to the vendor when you contact them for support.
When you are working with a system or application that has always worked perfectly well in the past, you have to determine what changed to cause the current problem to appear. Knowing what changed and when is why you need some sort of Change Management and Change Notification processes within your organization.
In solving a problem of this sort, the grand thing is to be able to reason backward.
Ask:
Ask the user what he is experiencing, but treat this information source with extreme caution. Most users are not technical people and thus make unwarranted conclusions about what is wrong. Users also lie on occasion, especially when they think they might be held responsible for whatever is broken or inoperative.
Sometimes the key to fixing a problem is to observe the actual failure as it occurs. It is often a good idea to turn on additional logging or diagnostic modes, run the command in verbose mode or use other diagnostic tools to gather information.
This is the process of reviewing all available information and getting a clear understanding of the perceived failure or dysfunction. Putting the problem into words clarifies exactly what the problem is. The Problem Statement should be very clear about what the problem is, and is not.
The problem statement should include as much of the following information as possible. If you do not have one or more of these, you have not gathered enough information.
Troubleshooting is the science of figuring out the why.
Examples of good Problem Statements:
After collecting information and clearly stating exactly what the problem is, formulate a theory as to a possible cause--this should take the form of a question.
NOTE: One roadblock to coming up with a good problem statement is not understanding the hardware, technologies and protocols in use. Training is critical to providing superior support and swift troubleshooting.
Once you have stated the problem, devise a method to test your hypothesis of the problem. Each test you perform should follow these simple principles:
After each test, note whether the change you made did, or did not solve the problem. You must note the results of your test, gather any new information from the system, application or user and draw a conclusion as to whether the problem is solved or whether the change you made had any affect on the problem. Once you have drawn conclusions, you can devise new tests to eliminate other possible causes.
To quote the great Shirlock Holmes:
"Eliminate all other factors, and the one which remains must be the truth."
Chapter 1, p92; "The Sign of the Four" 1890
Translating this addage to modern geek-speak:
..when you've checked the basics and eliminated all configuration and operations-related possible causes, whatever remains is a vendor bug.
The entire troubleshooting process feeds into itself and must be repeated until a solution is found. This troubleshooting method relies on identifying possible causes, categorically eliminating each possible cause until the true, root cause of the problem is found. You cannot find and fix the true root cause of the problem unless you apply the scientific method to your troubleshooting.
<< Back to Main Troubleshooting Page
All content Copyright © 1995-2012, InetDaemon Enterprises
Designed by InetDaemon | Powered by Manage My Internet