Phase #2: Searching the Internet Effectively
There are three good reasons to move to this phase of investigation. The first is
that your boss and/or customer needs immediate resolution of a problem. The
second reason is that your patience has run out, and the problem is going in a
direction that will take a long time to investigate. The third is that the type of
problem is such that investigating it on your own is not going to build useful
skills for the future.
Using what you’ve learned about the problem in the first phase of
investigation, you can search online for similar problems, preferably finding
the identical problem already solved. Most problems can be solved by searching
the Internet using an engine such as Google, by reading frequently asked
question (FAQ) documents, HOW-TO documents, mailing-list archives,
USENET archives, or other forums.
Google When searching, pick out unique keywords that describe
the problem you’re seeing. Your keywords should contain the application name
or “kernel” + unique keywords from actual output + function name where problem
occurs (if known). For example, keywords consisting of “kernel Oops sock_poll”
will yield many results in Google.
There is so much information about Linux on the Internet that search
engine giant Google has created a special search specifically for Linux. This is
a great starting place to search for the information you want -
http://www.google.com/linux. There are also some types of problems that can affect a Linux user but
are not specific to Linux. In this case, it might be better to search using the
main Google page instead. For example, FreeBSD shares many of the same
design issues and makes use of GNU software as well, so there are times when
documentation specific to FreeBSD will help with a Linux related problem.
USENET USENET is comprised of thousands of newsgroups or
discussion groups on just about every imaginable topic. USENET has been
around since the beginning of the Internet and is one of the original services
that molded the Internet into what it is today. There are many ways of reading
USENET newsgroups. One of them is by connecting a software program called
a news reader to a USENET news server. More recently, Google provided Google
Groups for users who prefer to use a Web browser. Google Groups is a searchable
archive of most USENET newsgroups dating back to their infancies. The search
page is found at
http://groups.google.com or off of the main page for Google.
Google Groups can also be used to post a question to USENET, as can most
news readers.
Linux Web Resources There are several Web sites that store
searchable Linux documentation. One of the more popular and comprehensive
documentation sites is The Linux Documentation Project:
http://tldp.org. The Linux Documentation Project is run by a group of volunteers who
provide many valuable types of information about Linux including FAQs and
HOW-TO guides.
There are also many excellent articles on a wide range of topics available
on other Web sites as well. Two of the more popular sites for articles are:
☞ Linux Weekly News –
http://lwn.net☞ Linux Kernel Newbies –
http://kernelnewbies.orgThe first of these sites has useful Linux articles that can help you get a better
understanding of the Linux environment and operating system. The second
Web site is for learning more about the Linux kernel, not necessarily for fixing
problems.
Bugzilla Databases Inspired and created by the Mozilla project,
Bugzilla databases have become the most widely used bug tracking database
systems for all kinds of GNU software projects such as the GNU Compiler
Collection (GCC). Bugzilla is also used by some distribution companies to track
bugs in the various releases of their GNU/Linux products.
Most Bugzilla databases are publicly available and can, at a minimum,
be searched through an extensive Web-based query interface. For example,
GCC’s Bugzilla can be found at
http://gcc.gnu.org/bugzilla, and a search can
be performed without even creating an account. This can be useful if you think
you’ve encountered a real software bug and want to search to see if anyone
else has found and reported the problem. If a match is found to your query, you
can examine and even track all the progress made on the bug.
If you’re sure you’ve encountered a real software bug, and searching does
not indicate that it is a known issue, do not hesitate to open a new bug report
in the proper Bugzilla database. Open source software is community-based,
and reporting bugs is a large part of what makes the open source movement
work. Refer to investigation Phase 4 for more information on opening a bug
reports.
Mailing Lists Mailing lists are related closely to USENET
newsgroups and in some cases are used to provide a more user friendly front-
end to the lesser known and less understood USENET interfaces. The advantage
of mailing lists is that interested parties explicitly subscribe to specific lists.
When a posting is made to a mailing list, everyone subscribed to that list will
receive an email. There are usually settings available to the subscriber to
minimize the impact on their inboxes such as getting a daily or weekly digest
of mailing list posts.
The most popular Linux related mailing list is the Linux Kernel Mailing
List (lkml). This is where most of the Linux pioneers and gurus such as Linux
Torvalds, Alan Cox, and Andrew Morton “hang out.” A quick Google search will
tell you how you can subscribe to this list, but that would probably be a bad
idea due to the high amount of traffic. To avoid the need to subscribe and deal
with the high traffic, there are many Web sites that provide fancy interfaces
and searchable archives of the lkml. The main one is
http://lkml.org. There are also sites that provide summaries of discussions going on in
the lkml. A popular one is at Linux Weekly News (lwn.net) at
http://lwn.net/Kernel.
As with USENET, you are free to post questions or messages to mailing
lists, though some require you to become a subscriber first.
Phase #3: Begin Deeper Investigation (Good Problem
Investigation Practices)
If you get to this phase, you’ve exhausted your attempt to find the information
using the Internet. With any luck you’ve picked up some good pointers from
the Internet that will help you get a jump start on a more thorough
investigation.
Because this is turning out to be a difficult problem, it is worth noting
that difficult problems need to be treated in a special way. They can take days,
weeks, or even months to resolve and tend to require much data and effort.
Collecting and tracking certain information now may seem unimportant, but
three weeks from now you may look back in despair wishing you had. You
might get so deep into the investigation that you forget how you got there. Also
if you need to transfer the problem to another person (be it a subject matter
expert or a peer), they will need to know what you’ve done and where you left
off. Best Practices for Complex Investigations
Collect the Relevant Information When the Problem
Occurs Earlier in this chapter we discussed how changes can cause certain
types of problems. We also discussed how changes can remove evidence for
why a problem occurred in the first place (for example, changes to the amount
of free memory can hide the fact that it was once low). In the former situation,
it is important to collect information because it can be compared to information
that was collected at a previous time to see if any changes caused the problem.
In the latter situation, it is important to collect information before the changes
on the system wipe out any important evidence. The longer it takes to resolve
a problem, the better the chance that something important will change during
the investigation. In either situation, data collection is very important for
complex problems.
Even reproducible problems can be affected by a changing system. A
problem that occurs one day can stop occurring the next day because of an
unknown change to the system. If you’re lucky, the problem will never occur
again, but that’s not always the case.
Consider a problem that occurred many years ago where application trap
occurred in one xterm (a type of terminal window) window but not in another.
Both xterm windows were on the same system and were identical in every way
(well, so it seemed at first) but still the problem occurred only in one. Even the
list of environment variables was the same except for the expected differences
such as PWD (present working directory). After logging out and back in, the
problem could not be reproduced. A few days later the problem came back
again, only in one xterm. After a very complex investigation, it turned out that
an environment variable PWD was the difference that caused the problem to
occur. This isn’t as simple as it sounds. The contents of the PWD environment
variable was not the cause of the problem, although the difference in size of
PWD variables between the two xterms forced the stack (a special memory
segment) to slightly move up or down in the address space. Sure enough,
changing PWD to another value made the problem disappear or recur depending
on the length. This small difference caused the different behavior for the
application in the two xterms. In one xterm, a memory corruption in the
application landed without issue on an inert part of the stack, causing no side-
effect. In the other xterm, the memory corruption landed on a pointer on the
stack (the long description of the problem is beyond the scope of this chapter).
The pointer was dereferenced by the application, and the trap occurred. This is
a very rare problem but is a good example of how small and seemingly unrelated
changes or differences can affect a problem.
If the problem is serious and difficult to reproduce, collect and/or write
down the information from : Initial Investigation Using Your Own Skills.
It usually takes many years to become an expert at diagnosing complex
problems. That expertise includes technical skills as well as best practices.
The technical skills are what take a long time to learn and require experience
and a lot of knowledge. The best practices, however, can be learned in just a
few minutes. Here are six best practices that will help when diagnosing complex
problems:
1. Collect relevant information when the problem occurs.
2. Keep a log of what you’ve done and what you think the problem
might be.
3. Be detailed and avoid qualitative information.
4. Challenge assumptions until they are proven.
5. Narrow the scope of the problem.
6. Work to prove or disprove theories about the problem.
The best practices listed here are particularly important for complex problems
that take a long time to solve. The more complex a problem is, the more
important these best practices become. Each of the best practices is covered in
more detail as follows.
For quick reference, here is the list:
☞ The exact time the problem occurred
☞ Dynamic operating system information
☞ What you were doing when the problem occurred
☞ A problem description
☞ Anything that may have triggered the problem
☞ Any evidence that may be relevant
The more serious and complex the problem is, the more you’ll want to start
writing things down. With a complex problem, other people may need to get
involved, and the investigation may get complex enough that you’ll start to
forget some of the information and theories you’re using. The data collector
included with this book can make your life easier whenever you need to collect
information about the OS Use an Investigation Log Even if you only ever have one
complex, critical problem to work on at a time, it is still important to keep
track of what you’ve done. This doesn’t mean well written, grammatically correct
explanations of everything you’ve done, but it does mean enough detail to be
useful to you at a later date. Assuming that you’re like most people, you won’t
have the luxury of working on a single problem at a time, which makes this
even more important. When you’re investigating 10 problems at once, it
sometimes gets difficult to keep track of what has been done for each of them.
You also stand a good chance of hitting a similar problem again in the future
and may want to use some of the information from the first investigation.
Further, if you ever need to get someone else involved in the investigation,
an investigation log can prevent a great deal of unnecessary work. You don’t
want others unknowingly spending precious time re-doing your hard earned
steps and finding the same results. An investigation log can also point others
to what you have done so that they can make sure your conclusions are correct
up to a certain point in the investigation.
An investigation log is a history of what has been done so far for the
investigation of a problem. It should include theories about what the problem
could be or what avenues of investigation might help to narrow down the
problem. As much as possible, it should contain real evidence that helps lead
you to the current point of investigation. Be very careful about making
assumptions, and be very careful about qualitative proofs (proofs that contain
no concrete evidence).
The following example shows a very structured and well laid out
investigation log. With some experience, you’ll find the format that works best
for you. As you read through it, it should be obvious how useful an investigation
log is. If you had to take over this problem investigation right now, it should be
clear what has been done and where the investigator left off.
Time of occurrence: Sun Sep 5 21:23:58 EDT 2004
Problem description: Product Y failed to start when run from a cron
job.
Symptom:
ProdY: Could not create communication semaphore: 1176688244 (EEXIST)
What might have caused the problem: The error message seems to indicate
that the semaphore already existed and could not be recreated.
Theory #1: Product Y may have crashed abruptly, leaving one or more IPC
resources. On restart, the product may have tried to recreate a semaphore
that it already created from a previous run.
Needed to prove/disprove:
☞ The ownership of the semaphore resource at the time of
the error is the same as the user that ran product Y.
☞ That there was a previous crash for product Y that
would have left the IPC resources allocated.
Proof: Unfortunately, there was no information collected at the time of
the error, so we will never truly know the owner of the semaphore at the
time of the error. There is no sign of a trap, and product Y always
leaves a debug file when it traps. This is an unlikely theory that is
good given we don’t have the information required to make progress on
it.
Theory #2: Product X may have been running at the time, and there may
have been an IPC (Inter Process Communication) key collision with
product Y.
Needed to prove/disprove:
☞ Check whether product X and product Y can use the same
IPC key.
☞ Confirm that both product X and product Y were actually
running at the time.
Proof: Started product X and then tried to start product Y. Ran “strace”
on product X and got the following semget:
ion 618% strace -o productX.strace prodX
ion 619% egrep “sem|shm” productX.strace
semget(1176688244, 1, 0) = 399278084
Ran “strace” on product Y and got the following semget:
ion 730% strace -o productY.strace prodY
ion 731% egrep “sem|shm” productY.strace
semget(1176688244, 1, IPC_CREAT|IPC_EXCL|0x1f7|0666) = EEXIST
The IPC keys are identical, and product Y tries to create the semaphore
but fails. The error message from product Y is identical to the original
error message in the problem description here.
Notes: productX.strace and productY.strace are under the data directory.
Assumption: I still don’t know whether product X was running at the
time when product Y failed to start, but given these results, it is very
likely. IPC collisions are rare, and we know that product X and product
Y cannot run at the same time the way they are currently configured.
Note: A semaphore is a special type of inter-process communication
mechanism that provides a synchronization mechanism between
processes (and/or threads). The type of semaphore used here requires a
unique “key” so that multiple processes can use the same semaphore. A
semaphore can exist without any processes using it, and some applications
expect and rely on creating a semaphore before they can run properly.
The semget() in the strace that follows is a system call (a special type of
OS function) that, as the name suggests, gets a semaphore.
Notice how detailed the proofs are. Even the commands used to capture
the original strace output are included to eliminate any human error. When
entering a proof, be sure to ask yourself, “Would someone else need any more
proof than this?” This level of detail is often required for complex problems so
that others will see the proof and agree with it.
The amount of detail in your investigation log should depend on how
critical the problem is and how close you are to solving it. If you’re completely
lost on a very critical problem, you should include more detail than if you are
almost done with the investigation. The high level of detail is very useful for
complex problems given that every piece of data could be invaluable later on
in the investigation.
If you don’t have a good problem tracking system, here is a possible
directory structure that can help keep things organized:
<problem identifier>/ inv.txt
/ data /
/ src /
The problem identifier is for tracking purposes. Use whatever is appropriate
for you (even if it is 1, 2, 3, 4, and so on). The inv.txt is the investigation log,
containing the various theories and proofs. The data directory is for any data
files that have been collected. Having one data directory helps keep things
organized and it also makes it easy to refer to data files from your investigation
log. The src directory is for any source code or scripts that you write to help
investigate the problem.
The problem directory is what you would show someone when referring
to the problem you are investigating. The investigation log would contain the
flow of the investigation with the detailed proofs and should be enough to get
someone up to speed quickly.
You may also want to save the problem directory for the future or better
yet, put the investigation directories somewhere where others can search
through them as well. After all, you worked hard for the information in your
investigation log; don’t be too quick to delete it. You never know when you’ll hit
a similar (or the same) problem again. The investigation log can also be used
to help educate more junior people about investigation techniques.
Be Detailed (Avoid Qualitative Information) Be very detailed
in your investigation log or any time when discussing the problem. If you prove
a theory using an error record from an error log file, include the error record
and the name of the error log file as proof in the investigation log. Avoid
qualitative proofs such as, “Found an error log that showed that the suspect
product was running at the time.” If you transfer a problem to another person,
that person will want to see the actual error record to ensure that your
assumption was correct. Also if the problem lasts long enough, you may actually
start to second-guess yourself as well (which is actually a good thing) and may
appreciate that quantitative proof (a proof with real data to back it up).
Another example of a qualitative proof is a relative term or description.
Descriptions like “the file was very large” and “the CPU workload was high”
will mean different things to different people. You need to include details for
how large the file was (using the output of the ls command if possible) and
how high the CPU workload was (using uptime or top). This will remove any
uncertainty that others (or you) have about your theories and proofs for the
investigation.
Similarly, when you are asked to review an investigation, be leery of any
proof or absolute statement (for example, “I saw the amount of virtual memory
drop to dangerous levels last night”) without the required evidence (that is, a
log record, output from a specific OS command, and so on). If you don’t have
the actual evidence, you’ll never know whether a statement is true. This doesn’t
mean that you have to distrust everyone you work with to solve a problem but
rather a realization that people make mistakes. A quick cut and paste of an
error log file or the output from an actual command might be all the evidence
you need to agree with a statement. Or you might find that the statement is
based on an incorrect assumption.
Challenge Assumptions There is nothing like spending a week
diagnosing a problem based on an assumption that was incorrect. Consider an
example where a problem has been identified and a fix has been provided ...
yet the problem happens again. There are two main possibilities here. The
first is that the fix didn’t address the problem. The second is that the fix is
good, but you didn’t actually get it onto the system (for the statistically inclined
reader: yes there is a chance that the fix is bad and it didn’t get on the system,
but the chances are very slim). For critical problems, people have a tendency to
jump to conclusions out of desperation to solve a problem quickly. If the group
you’re working with starts complaining about the bad fix, you should encourage
them to challenge both possibilities. Challenge the assumption that the fix
actually got onto the system. (Was it even built into the executable or library
that was supposed to contain the fix?)
Narrow Down the Scope of the Problem Solution (that is, a
complete IT solution) -level problem determination is difficult enough, but to
make matters worse, each application or product in a solution usually requires
a different set of skills and knowledge. Even following the trail of evidence can
require deep skills for each application, which might mean getting a few experts
involved. This is why it is so important to try and narrow down the scope of the
problem for a solution level problem as quickly as possible.
Today’s complex heterogeneous solutions can make simple problems very
difficult to diagnose. Computer systems and the software that runs on them
are integrated through networks and other mechanism(s) to work together to
provide a solution. A simple problem, even one that has a clear error message,
can become difficult given that the effect of the problem can ripple throughout
a solution, causing seemingly unrelated symptoms. .