Alhamdo lilah .. alhamdo lilah alone, I could fix a serious problem on one of our company servers which is hosting one of the top 500 websites according to Alexa ...
I've written this report so as other can make use of it ... maybe it can help someone somewhere at any point of time ... This report was written by me to be sent to my manager so it may look sometime offensive or too official , I've just made few changes to it before publishing ...
This report will be divided into 3 parts ...
-the 1st parts contains the symptoms of the problem I've found
-the 2nd part contains steps I've done in detail to diagnose and solve the problem
-the 3rd part contains some scientific comments regarding this issue .. many of them I've just learned ...
*Note: In our company there really too much bureaucracy specially regarding some servers, on one them the problem occurred, so I wasn't allowed to do reboot the server, I didn't wish to do it here really, do any significant changes without asking for permission and waiting for very long time, maybe a month or so, for them to say "what did you ask for??' ....
*Notes: 1- I haven't done any irreversible changes
2- most of the changes I've made were temporary and had no effect on the running services ... Most of them I've already reset, unless mentioned
3- I didn't change any configuration file of any running services "the services which the server is running for" ...
5- I didn't reboot the server
6- I didn't install any new software or removed any installed software but clamav , installed then removed, and zsh for testing purposes
1st section: Symptoms of the problem:
1- Login failures : you've to try usually up to ten times or more to get the shell running ... as soon as you login you get "connection closed" .. nothing appears in the logs regarding this ..
2- Most of the commands - including; ls , w , top, ps - just exits as soon as they run with no output at all ... and you'd to run the command too many times until it runs successfully once
3- Even the shell scripts didn't run successfully until you try many times .. this includes ; init scripts , multi-check , even the services command itself ... etc.
4- Normal interactive shells ran better than login shells .. this applies to bash, csh, zsh and ash .. but the normal shells were not 100% normal , just about 60% ..
= These were enough I think to feel how serious it was ...
2nd section: Steps I've done in detail:
I cannot remember everything as I've worked continuously for 3 following days but I'm writing all what I can remember, insha'Allah ..
Note: when I say "this attempt failed", it means "the problem didn't get fixed after doing the attempt" not that "I couldn't do the attempt itself"
=I've checked all the logs too many times while working ..
1- I cannot remember why I logged in there but that's what happened.. I logged in and noticed many suspicious things, the commands didn't run as supposed and some other symptoms, so I downloaded the latest version of clamav , updated its virus database with "freshclam" command then started scan ... at 1st scanned the home directories , then the main system directories "/usr/ , /bin/ , /sbin, ... etc" and then did a full system scan ... the only important results found was a weak php.shell script which, practically, cannot cause the noticed harm to the system even if could be used successfully by an attacker ...
2- I've downloaded rkhunter and chkrootkit tools and did a scan to catch rootkits if any ... they gave negative results .... "no bad files found"
3- I've tried to run "bash" and many other shells from the command prompt but my attempts failed ...
4- Then ran the command "exec bash" ... the "exec" command replaces the current shell with command executed by it, giving it the same process ID and inheriting some info from the memory containing "environment variables" and so ... I was kicked off the session as soon as I ran the command ... the command was supposed to run a new "normal interactive shell" not a login shell as the currently running one .. the normal shell doesn't read settings from "profile*" but only from "bashrc" ...
5- I've then compared the installed packages on this server with another server of the same configuration ... both have the same system version, both are 64bit edition and basic packages versions were the same ... I thought that maybe a hacker could hack into the system and change too many binary files and change the shell settings, while he didn't wish to stop any services or causing serious corruption so as to make use of the system as long as possible without being noticed ... so I've token a backup of /bin, /sbin, /usr/bin, /usr/sbin and shell settings files, including personal and system wide, on both servers then overwrote the files in this server with those from the other twin server .. this attempts failed too ...
I've even compared the "shell's local and environment variables" with other servers one by one but got a negative results ...
6- I then thought, OK OK, lets suppose that there is more wide corruption in the base packages .. why not forcibly install these packages ?!! .. then downloaded "bash, procps , coreutils and even rpm itself" RPMs , of course the same version as installed and same architecture , and forcibly installed them ... this attempt failed too..
7- Then I thought that maybe the applications couldn't deal with the hardware or the "/proc" fine for any reason ... so umounted and re-mounted the /proc and restarted the haldaemon service ... this attempt failed too ...
8- Then , OK, the system now is running for too long time ... maybe there is a limit in the kernel for the processes or threads which the system reached ... I couldn't run ps nor its children, ps -ef or ps aux, freely to see PIDs and total number of processes .. so examined the /proc and found that
/proc/sys/kernel/pid_max = 32768
and
/proc/sys/kernel/threads-max = 81920
so did some googling and found that these limits can be increased too much more in the 64bit systems ... but also these values can be memory eating if too high.... so changed them to following limits by echo ;
echo 100000 > /proc/sys/kernel/pid_max
echo 120000 > /proc/sys/kernel/threads-max
This attempt failed too ...
9- On the next day, the last day , My new raw manager talked to some people on CentOS IRC channel .. they have tried to troubleshoot a bit with us and all attempts failed then claimed that "you must update the kernel" ... I felt that this problem will surely be resolved if we just rebooted the system even if no kernel update was made ... then my manager made a plan and told me about it .. it was to make a redundant server in case we restarted this one and it failed so we can just switch IPs ... I asked him to give me some more chance and went on ...
10- Through the previous attempts I used to "exec csh" shell and start working, sometimes zsh too ... so I tried to change the root login shell to csh in /etc/passwd ... but the login csh shell acted exactly like the login bash shell ... both acted bad ...
11- The normal "csh" shell started by "exec csh" acted better, but sometimes it said "Broken pipe" ... so caught an idea from this point and started to re-investigate ..
what about bash? why it doesn't say the same?
can it be a broken library ?
can it be a kernel bug in pipefs ?
12- what about bash? why it doesn't say the same? .. it really does but silently ..
I found that all the unsuccessful commands exited with same "exit code", called also "exit status" , "141" ... "you can know the exit status of any command by running "echo $?" just after the command exits" ... so googled and found that this is the exit status of "Broken pipe" of bash ... I tried to google much more to know what can cause this "Broken pipe" but nothing found .. all were talking about "how to build a C program which can skip or handle broken pipe" or "some body was running a perl or shell script which had a real problem so giving `Broken pipe' " ...
13- can it be a kernel bug in pipefs ? ... didn't find any thing useful regarding this and the newest bug reported regarding kernel pipefs was too old , in kernel series 2.2.x as I can remember ...
14- can it be a broken library ? ... OK ... maybe, yes ... then I've run "rpm -aV" which verifies all files from all installed packages ... them got a list of all the changes files it reported, queried their RPMs , and downloaded all of them, about 138 package, with the same version and from official mirrors and forcibly installed them all, some by some ... then I noticed some improvement for moments but it was fake ... this attempt failed too .. "these packages included many of the main system libraries" ...
15- after few minutes I've ran rpm -aV again ... and found that some of the just installed files got changed ... so started thinking "can there be an sudoer who has root privileges and is playing with me?" ... and yes found one .. "zabbix" user was an sudoer who could run ALL system commands without even asking for password ... so I've compared the /etc/sudoers file with the files on other servers where zabbix agent or server was running and found that zabbix isn't an sudoer on those servers ... so I ran "visudo", the command used to change the /etc/sudoers file, edited it and commented the line of zabbix ... I hope this solved a security problem ... but it didn't help with the problem-in-action at all ...
16- So started to read and search deeper and deeper on pipefs , SIGPIPE, and "Broken pipe" .... then caught something; "pipe" from a kernel point of view doesn't mean only "pipes" but can, also, mean sockets and other FIFO files .... so, OK , can there be a maximum limit for sockets ? does the "sockets" that appear in "netstat -a" count on this limit? can there be a process using too many sockets to the limit that effect other processes ?
17- then ran "netstat -a" and noticed a process called nscd using too many sockets .. then "Yes, that's a service, why not restart it?" .. restarted it and found that the problem got fixed ... yes it got fixed .. tried again and again ... and yes it got fixed .... the `Broken pipe' got fixed :), alhamdo lilah ...
"the nscd caches name service lookups. It can dramatically improve performance with NIS+ and may help with DNS as well.", the official definition says...
18- I tried to stop the service and watch the /proc/net/sockstat file ; the "sockets: used" value was more than 1100
started it again and watched /proc/net/sockstat ; the "sockets: used" value was still nearly the same ...
so stopped it and watched "netstat -a" and found that nscd is still using too many sockets even when the service was stopped ... so I checked the running processes and found that there are still some nscd processes .. so "killall -9 nscd" and it worked then ... the "sockets: used" value doesn't exceed 120 when nscd is really stopped .... so I've stopped it and turned it off on startup ... I think it's nearly useless on that server ... any way that a reversible change ...
19- I've done too many different tests continuously for more than 2 hours to make sure that the problem got fixed and, alhamdo lilah, none failed ...
=Now it's too many days after the problem and it didn't occur again ...
3rd Section; some scientific comments regarding this issue:
1-Regarding maximum threads and maximum pid ..
-threads-max:
In 2.3.x, it is a tunable parameter which defaults to size-of-memory-in-the-system / kernel-stack-size / 2. Suppose you have 512MB of RAM; then, the default upper limit of available processes will be 512*1024*1024 / 8192 / 2 = 32768. Now, 32768 processes might sound like a lot, but for an enterprise-wide Linux server with a database and many connections from a LAN or the Internet, it is a very reasonable number. I have personally seen UNIX boxes with a higher number of active processes. It might make sense to adjust this parameter in your installation. In 2.3.x, you can also increase the maximum number of tasks via a sysctl at runtime. Suppose the administrator wants to increase the number of concurrent tasks to 40,000. He will have to do only this (as root):
echo 40000 > /proc/sys/kernel/threads-max
-
-pid_max:
-
/proc/sys/kernel/pid_max
-
This file (new in Linux 2.5) specifies the value at which PIDs wrap around (i.e., the value in this file is one greater than the maximum PID). The default value for this file, 32768, results in the same range of PIDs as on earlier kernels. On 32-bit platfroms, 32768 is the maximum value for pid_max. On 64-bit systems, pid_max can be set to any value up to 2^22 (PID_MAX_LIMIT, approximately 4 million).
2-Regarding `Broken pipe' :-What's a pipe ?In Unix-like computer operating systems, a pipeline is the original software pipeline: a set of processes chained by their standard streams, so that the output of each process (stdout) feeds directly as input (stdin) of the next one. Each connection is implemented by an anonymous pipe. Filter programs are often used in this configuration.This can also be used by a process to communicate with its children .. "And here was the problem we had, I think, the shell couldn't communicate the commands it runs "its children" and the commands could talk back to the shell so got SIGPIPE and terminated" .. -Broken pipe is the message a process gets when it tried to write on a pipe with no readers ... a process can handle this signal and skip it but the default is to got terminated ... "The kernel will send the SIGPIPE signal when the remote end closes or shuts down the socket and you try to send/write. The default signal handler will terminate your program." SIGPIPE 13 Broken pipe: write to pipe with no readersThis signal is sent to processes doing network connections, network sockets, and to normal processes using pipe for internal communication or I/O redirection .. 3-Regarding nscd: I've found that it had an old bug which they say about "Note that you can't use nscd with 2.0 kernels because of bugs in the kernel-side thread support. Unfortunately, nscd happens to hit these bugs particularly hard."
I've no idea if there bug can be related to ours or not ... huh ... finally, alhamdo lilah, I worked on this report for over than 3 hours and a half...