Sysadmin 101: Troubleshooting

I typically keep this blog strictly technical, keeping observations, opinions
and the like to a minimum. But this, and the next few posts will be about
basics and fundamentals for starting out in system administration/SRE/system engineer/sysops/devops-ops
(whatever you want to call yourself) roles more generally.
Bear with me!

“My web site is slow”

I just picked the type of issue for this article at random, this can be
applied to pretty much any sysadmin related troubleshooting.
It’s not about showing off the cleverest oneliners to find the most
information. It’s also not an exhaustive, step-by-step “flowchart” with the
word “profit” in the last box.
It’s about general approach, by means of a few examples.
The example scenarios are solely for illustrative purposes. They sometimes
have a basis in assumptions that doesn’t apply to all cases all of the time, and I’m
positive many readers will go “oh, but I think you will find…” at some point.
But that would be missing the point.

Having worked in support, or within a support organization for over a decade,
there is one thing that strikes me time and time again and that made me write
this;
The instinctive reaction many techs have when facing a problem, is
to start throwing potential solutions at it.

“My website is slow”

  • I’m going to try upping MaxClients/MaxRequestWorkers/worker_connections
  • I’m going to try to increase innodb_buffer_pool_size/effective_cache_size
  • I’m going to try to enable mod_gzip (true story, sadly)

“I saw this issue once, and then it was because X. So I’m going to try to fix X
again, it might work”
.

This wastes a lot of time, and leads you down a wild goose chase. In the dark. Wearing greased mittens.
InnoDB’s buffer pool may well be at 100% utilization, but that’s just because
there are remnants of a large one-off report someone ran a while back in there.
If there are no evictions, you’ve just wasted time.

Quick side-bar before we start

At this point, I should mention that while it’s equally applicable to many
roles, I’m writing this from a general support system adminstrator’s point of
view. In a mature, in-house organization or when working with larger, fully managed or
“enterprise” customers, you’ll typically have everything instrumented,
measured, graphed, thresheld (not even word) and alerted on. Then your approach
will often be rather different. We’re going in blind here.

If you don’t have that sort of thing at your disposal;

Clarify and First look

Establish what the issue actually is. “Slow” can take many forms. Is it time to
first byte? That’s a whole different class of problem from poor Javascript
loading and pulling down 15 MB of static assets on each page load.
Is it slow, or just slower than it usually is? Two very different plans of
attack!

Make sure you know what the issue reported/experienced actually is before you
go off and do something. Finding the source of the problem is often difficult
enough, without also having to find the problem itself.
That is the sysadmin equivalent of bringing a knife to a gunfight.

Low hanging fruit / gimmies

You are allowed to look for a few usual suspects when you first log in to a
suspect server. In fact, you should! I tend to fire off a smattering of commands
whenever I log in to a server to just very quickly check a few things; Are we
swapping (free/vmstat), are the disks busy (top/iostat/iotop), are we dropping
packets (netstat/proc/net/dev), is there an undue amount of connections in an
undue state (netstat), is something hogging the CPUs (top), is someone else on
this server (w/who), any eye-catching messages in syslog and dmesg?

There’s little point to carrying on if you have 2000 messages from your RAID
controller about how unhappy it is with its write-through cache.

This doesn’t have to take more than half a minute.
If nothing catches your eye – continue.

Reproduce

If there indeed is a problem somewhere, and there’s no low hanging fruit to be
found;

Take all steps you can to try and reproduce the problem. When you can
reproduce, you can observe. When you can observe, you can solve.
Ask the person reporting the issue what exact steps to take to reproduce the
issue if it isn’t already obvious or covered by the first section.

Now, for issues caused by solar flares and clients running exclusively on
OS/2, it’s not always feasible to reproduce. But your first port of call
should be to at least try!
In the very beginning, all you know is “X thinks their website is slow”. For
all you know at that point, they could be tethered to their GPRS mobile phone and
applying Windows updates. Delving any deeper than we already have at that
point is, again, a waste of time.

Attempt to reproduce!

Check the log!

It saddens me that I felt the need to include this. But I’ve seen escalations
that ended mere minutes after someone ran tail /var/log/..
Most *NIX tools these days
are pretty good at logging. Anything blatantly wrong will manifest itself quite
prominently in most application logs. Check it.

Narrow down

If there are no obvious issues, but you can reproduce the reported problem,
great.
So, you know the website is slow.
Now you’ve narrowed things down to: Browser rendering/bug, application
code, DNS infrastructure, router, firewall, NICs (all eight+ involved),
ethernet cables, load balancer, database, caching layer, session storage, web
server software, application server, RAM, CPU, RAID card, disks.
Add a smattering of other potential culprits depending on the set-up. It could
be the SAN, too. And don’t forget about the hardware WAF! And.. you get my
point.

If the issue is time-to-first-byte you’ll of course start applying known fixes
to the webserver, that’s the one responding slowly and what you know the most
about, right? Wrong!
You go back to trying to reproduce the issue. Only this time, you try to
eliminate as many potential sources of issues as possible.

You can eliminate the vast majority of potential culprits very
easily:
Can you reproduce the issue locally from the server(s)?
Congratulations, you’ve
just saved yourself having to try your fixes for BGP routing.
If you can’t, try from another machine on the same network.
If you can – at least you can move the firewall down your list of suspects, (but do keep
a suspicious eye on that switch!)

Are all connections slow? Just because the
server is a web server, doesn’t mean you shouldn’t try to reproduce with another
type of service. netcat is very useful in these scenarios
(but chances are your SSH connection would have been lagging
this whole time, as a clue)! If that’s also slow, you at least know you’ve
most likely got a networking problem and can disregard the entire web
stack and all its components. Start from the top again with this knowledge
(do not collect $200).
Work your way from the inside-out!

Even if you can reproduce locally – there’s still a whole lot of “stuff”
left. Let’s remove a few more variables.
Can you reproduce it with a flat-file? If i_am_a_1kb_file.html is slow,
you know it’s not your DB, caching layer or anything beyond the OS and the webserver
itself.
Can you reproduce with an interpreted/executed
hello_world.(py|php|js|rb..) file?
If you can, you’ve narrowed things down considerably, and you can focus on
just a handful of things.
If hello_world is served instantly, you’ve still learned a lot! You know
there aren’t any blatant resource constraints, any full queues or stuck
IPC calls anywhere. So it’s something the application is doing or
something it’s communicating with.

Are all pages slow? Or just the ones loading the “Live scores feed” from a
third party?

What this boils down to is; What’s the smallest amount of “stuff” that you
can involve, and still reproduce the issue?

Our example is a slow web site, but this is equally applicable to almost
any issue. Mail delivery?
Can you deliver locally? To yourself? To <common provider here>? Test
with small, plaintext messages. Work your way up to the 2MB campaign
blast. STARTTLS and no STARTTLS.
Work your way from the inside-out.

Each one of these steps takes mere seconds each, far quicker than
implementing most “potential” fixes.

Observe / isolate

By now, you may already have stumbled across the problem by virtue of being unable to
reproduce when you removed a particular component.

But if you haven’t, or you still don’t know why;
Once you’ve found a way to reproduce the issue with the smallest amount of
“stuff” (technical term) between you and the issue, it’s time to start
isolating and observing.

Bear in mind that many services can be ran in the foreground, and/or have
debugging enabled. For certain classes of issues, it is often hugely helpful to do this.

Here’s also where your traditional armory comes into play. strace, lsof, netstat,
GDB, iotop, valgrind, language profilers (cProfile, xdebug, ruby-prof…).
Those types of tools.

Once you’ve come this far, you rarely end up having to break out profilers or
debugers though.

strace is often a very good place to start.
You might notice that the application is stuck on a particular read() call
on a socket file descriptor connected to port 3306 somewhere. You’ll know
what to do.
Move on to MySQL and start from the top again. Low hanging
fruit: “Waiting_for * lock”, deadlocks, max_connections.. Move on to: All
queries? Only writes? Only certain tables? Only certain storage
engines?…

You might notice that there’s a connect() to an external API resource that
takes five seconds to complete, or even times out. You’ll know what to do.

You might notice that there are 1000 calls to fstat() and open() on the
same couple of files as part of a circular dependency somewhere. You’ll
know what to do.

It might not be any of those particular things, but I promise you, you’ll
notice something.

If you’re only going to take one thing from this section, let it be; learn
to use strace! Really learn it, read the whole man page. Don’t even skip
the HISTORY section. man each syscall you don’t already know what it
does. 98% of troubleshooting sessions ends with strace.

via Planet MySQL
Sysadmin 101: Troubleshooting