Search
Get RSS Posts / Comments

RAID: Rebuilding a Foreign disk by hand

Posted by kire on September 3, 2010

While replacing a bad drive with a drive that used to be part of another RAID array configuration, the RAID refused to automatically rebuild, thinking that I might want to import the configuration from this disk (or that there’s data on there that I might need).

Simply inserting the drive doesn’t make the controller rebuild the array with that disk. Here’s how to manually make the drive get along with the rest of the new array:

(Note: this is a 64 bit server, so the MegaCli client I’m using is called “MegaCli64″, if you are not running x64, you can simply substitute the commands below with the path to your megacli binary.)

server:~# MegaCli64 -PDlist -aALL -a0
[...]
Enclosure Device ID: 32
Slot Number: 4
[...]
Firmware state: Unconfigured(bad)
[...]
Secured: Unsecured
Locked: Unlocked
Foreign State: Foreign
[...]

Based on the information obtained above, I now know that the disk drive I just replaced is [32:4] ([enclosureid:slotnumber]) and is currently being reported as ‘Unconfigured(bad)’.

To bring this drive back online run:

server:~# MegaCli64 -PDMakeGood -PhysDrv[32:4] -a0
Adapter: 0: EnclId-32 SlotId-4 state changed to Unconfigured-Good.

The controller will now recognize the disk as being a “foreign” one. This does not mean it was made in Japan (though, it likely was). It means it has detected some RAID configuration/data on it and thus, considers it as a disk being part of an array that may be imported into current controller configuration. Because of this, it will not automatically rebuild until you force it to.

Now you can ask the controller to scan for foreign configurations and remove them:

server:~# MegaCli64 -CfgForeign -Scan -a0
There are 1 foreign configuration(s) on controller 0.

server:~# MegaCli64 -CfgForeign -Clear -a0
Foreign configuration 0 is cleared on controller 0.

The disk should now be available for rebuilding into your new RAID array. To confirm, run this:

server:~# MegaCli64 -PDList -a0
[...]
Enclosure Device ID: 32
Slot Number: 4
[...]
Firmware state: Unconfigured(good), Spun Up
Foreign State: None
[...]

Excellent. We have a good, recognized (yet still unconfigured) drive now. Now we have all we need to add the disk back into the new array, and rebuild:

Get the disk [32:4] back into array 1, as disk 4:

server:~# MegaCli64 -PdReplaceMissing -PhysDrv[32:4] -array1 -row4 -a0
Adapter: 0: Missing PD at Array 1, Row 4 is replaced

And finally start rebuilding it:

server:~# megacli -PDRbld -Start -PhysDrv[32:4] -a0
Started rebuild progress on device(Encl-32 Slot-4)

Now, sit back, relax, grab a smoke and wait for it to rebuild itself into your new RAID array. Not so foreign anymore, huh?

-E

process accounting

Posted by kire on August 21, 2010

An excellent program for monitoring users and applications is psacct. This program will work in the background of your system recording what all users are doing on your system as well as the resources that are being consumed. I use it daily for resource abuse tracking, statistics generation, CPU usage trending, process identification and more.

Since I administer both Ubuntu/Debian and CentOS/RHEL servers, I’ll provide both methods of installation here:

CentOS:

yum install psacct

Ubuntu:

sudo apt-get install acct

The most useful command that now exists on your box is in /usr/bin and called “sa”.

SA has the following output fields:

cpu – sum of system and user time in cpu minutes
re – actual time in minutes
k – cpu-time averaged core usage, in 1k units
k*sec – cpu storage integral (kilo-core seconds)
u – user cpu time in cpu minutes
s – system time in cpu minutes

Start off by allowing process accounting to collect some data from your system. This should really be left for 24-48 hours, though if you’re too excited to start parsing the results, lets continue.

This will show you averages for the all the activity for this server overtime. This log file grows larger over time as more commands run, and the longer you allow it to collect data.

# sa -m

This will show you the sum of the system and user time in cpu minutes for specific commands.

# sa -u | grep username

This will give you a combined total for a specific user:

# sa -u |grep username|awk ‘BEGIN{TOTAL=0}{TOTAL=TOTAL+$2}END{print TOTAL}’

This option will show each of the programs on your server so you may evaluate, real time, memory usage and which programs are running:

# sa -a

Hint: append the “-c” flag to any of the options to view the highest percentile users first.

I know that this can be confusing, and it took myself and my organization some time to master the technology due to the lack of documentation available. I hope this article will also help someone else’s resource management lightbulb illuminate.

-e

How to disable SSH host key checking

Posted by kire on July 30, 2010

Remote login using the SSH protocol is a common activity in my line of work. With the SSH protocol, the responsibility is on the SSH client to verify the identity of the host to which it is connecting. The host identify is established by its SSH host key. Typically, the host key is auto-created during initial SSH installation setup.

By default, the SSH client verifies the host key against a local file containing known, rustworthy machines. This provides protection against possible Man-In-The-Middle attacks. However, there are situations in which you want to bypass this verification step. This article explains how to disable host key checking using OpenSSH, a popular Free and Open-Source implementation of SSH.

When you login to a remote host for the first time, the remote host’s host key is most likely unknown to the SSH client. The default behavior is to ask the user to confirm the fingerprint of the host key.

$ ssh erik@192.168.0.100
The authenticity of host ’192.168.0.100 (192.168.0.100)’ can’t be established.
RSA key fingerprint is 3f:1b:f4:bd:c5:aa:c1:1f:bf:4e:2e:cf:53:fa:d8:59.
Are you sure you want to continue connecting (yes/no)?

If your answer is yes, the SSH client continues login, and stores the host key locally in the file ~/.ssh/known_hosts. You only need to validate the host key the first time around: in subsequent logins, you will not be prompted to confirm it again.

Yet, from time to time, when you try to remote login to the same host from the same origin, you may be refused with the following warning message:

$ ssh erik@192.168.0.100
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
3f:1b:f4:bd:c5:aa:c1:1f:bf:4e:2e:cf:53:fa:d8:59.
Please contact your system administrator.
Add correct host key in /home/erik/.ssh/known_hosts to get rid of this message.
Offending key in /home/erik/.ssh/known_hosts:3
RSA host key for 192.168.0.100 has changed and you have requested strict checking.
Host key verification failed.

There are multiple possible reasons why the remote host key changed. A Man-in-the-Middle attack is only one possible reason. Other possible reasons include:

OpenSSH was re-installed on the remote host but, for whatever reason, the original host key was not restored.
The remote host was replaced legitimately by another machine.

If you are sure that this is harmless, you can use either 1 of 2 methods below to trick openSSH to let you login. But be warned that you have become vulnerable to man-in-the-middle attacks.

The first method is to remove the remote host from the ~/.ssh/known_hosts file. Note that the warning message already tells you the line number in the known_hosts file that corresponds to the target remote host. The offending line in the above example is line 3(“Offending key in /home/erik/.ssh/known_hosts:3″)

You can use the following one liner to remove that one line (line 3) from the file.

$ sed -i 3d ~/.ssh/known_hosts

Note that with the above method, you will be prompted to confirm the host key fingerprint when you run ssh to login.

The second method uses two openSSH parameters:
StrictHostKeyCheckin, and UserKnownHostsFile.

This method tricks SSH by configuring it to use an empty known_hosts file, and NOT to ask you to confirm the remote host identity key.

$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no erik@192.168.0.100
Warning: Permanently added ’192.168.0.100′ (RSA) to the list of known hosts.
erik@192.168.0.100′s password:

The UserKnownHostsFile parameter specifies the database file to use for storing the user host keys (default is ~/.ssh/known_hosts).

The /dev/null file is a special system device file that discards anything and everything written to it, and when used as the input file, returns End Of File immediately.

By configuring the null device file as the host key database, SSH is fooled into thinking that the SSH client has never connected to any SSH server before, and so will never run into a mismatched host key.

The parameter StrictHostKeyChecking specifies if SSH will automatically add new host keys to the host key database file. By setting it to no, the host key is automatically added, without user confirmation, for all first-time connection. Because of the null key database file, all connection is viewed as the first-time for any SSH server host. Therefore, the host key is automatically added to the host key database with no user confirmation. Writing the key to the /dev/null file discards the key and reports success.

Please refer to this excellent article about host keys and key checking.

By specifying the above 2 SSH options on the command line, you can bypass host key checking for that particular SSH login. If you want to bypass host key checking on a permanent basis, you need to specify those same options in the SSH configuration file.

You can edit the global SSH configuration file (/etc/ssh/ssh_config) if you want to make the changes permanent for all users.

If you want to target a particular user, modify the user-specific SSH configuration file (~/.ssh/config). The instructions below apply to both files.

Suppose you want to bypass key checking for a particular subnet (192.168.0.0/24).

Add the following lines to the beginning of the SSH configuration file.

Host 192.168.0.*
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null

Note that the configuration file should have a line like Host * followed by one or more parameter-value pairs. Host *means that it will match any host. Essentially, the parameters following Host * are the general defaults. Because the first matched value for each SSH parameter is used, you want to add the host-specific or subnet-specific parameters to the beginning of the file.

As a final word of caution, unless you know what you are doing, it is probably best to bypass key checking on a case by case basis, rather than making blanket permanent changes to the SSH configuration files.

Change a processes’ priority with renice

Posted by kire on July 21, 2010

You can change the default priority that an application runs with by starting it with the nice command, but if you want to change the priority of a process that is already running, the command to use is renice.

Renice can be used to change the priority of a single process, or of all the processes owned by a specified user. As with the nice command, the priority values range from -20 to +19 and negative numbers raise the priority of a task while positive numbers lower it. Yes, that seems a little backwards, but the higher the priority the less resources are devoted to it. Only the superuser can specify negative numbers (thus raising the priority of a process).

renice 5 some_process

This command will change the priority of some_process to 5.

renice -5 -u erik

will change the priority of all processes owned by user erik to -5.

renice -5 -u erik -p 699

will chance the priority of all processes owned by erik and process with PID 699 to -5.

Out of memory error on horde

Posted by kire on July 20, 2010

This one’s here because I spent the better half of an hour troubleshooting this for a customer.

The error:

Fatal error: Out of memory (allocated 47448064) (tried to allocate 13472326 bytes) in
/usr/local/cpanel/3rdparty/lib/php/Net/SMTP.php on line 821

The cause:

Sending messages through Webmail with attachments larger than 10MB.

Incorrect diagnosis:

If you’ve stumbled upon this post as a result of furious internet searching for similar issues, you’ve likely found thousands of posts regarding the error indicating the problem lies in the php.ini file of cPanel or Horde. While the PHP memory limit for cPanel and Horde are defined in their own php.ini, this does not actually alleviate the error seen here. cPanel itself will actually kill off applications consuming more than what is defined in Tweak Settings.

The Fix:

WHM (as root) ->Tweak Settings > The maximum memory a cPanel process can use before it is killed off.

Change this to 512M or 0 to disable memory limits all together.

Done.

Easily remove the last character of every line

Posted by kire on July 10, 2010

So, often I’ll come across something that I need done, but don’t want to waste time manually doing it. This particular instance, I had a list of hostnames which I needed to remove the trailing “.” (dot) from the PTR. Normally, if these hostnames were all simple name.TLD domains this would easily be accomplished with cut:

cut -d. -f1-2

This basically means cut by delimiter (-d), where the delimiter is defined as a period (.) and then print fields 1 and 2. Easy enough.

But, what if you have hostnames and subdomains with varying field lengths, such as: something.domain.com. and something.else.domain.com.

So — here’s two easy ways to get it done.

USING VI:
This one’s for you vi fans (and once again pains me to learn another way that vi is more capable than nano/pico):

:%s/.$//g

Run that in vi, and wha-la, all the trailing dots disappear. You can obviously do this for other characters that are not “dots” by replacing the . in the command above.

But wait, there’s still more ways to skin the cat dot.

USING SED:

sed. x=`echo $i|sed s/.$//`

Either way, this leaves you with an easy way to save yourself boatloads of time (depending on how long the list you’re trying to trim is!)

Hope this helps!

-E

The difference between Megabytes and Megabits

Posted by kire on June 23, 2010

For those of you who are unaware of the difference between a Megabyte (MB) and a Megabit (Mb), here’s a quick 101:

In my line of work, I often hear people confuse the two or even think that they’re the same.

There are 8 bits that go to make up a byte, so 10Mb != 10MB.

10MB/s actually = 80Mb/s just as 800Mb/s = 100MB/s

If a particular switch or port is transferring 3.2MB per second, you’re actually moving 25.6Mbit per second.

Quite a difference, indeed.

Error: Failed opening ‘Net/SMTP.php’ for inclusion

Posted by kire on June 3, 2010

Warning: send(Net/SMTP.php) [function.send]: failed to open stream: No such file or directory in /usr/local/lib/php/Mail/smtp.php on line 206
Warning: send() [function.include]: Failed opening ‘Net/SMTP.php’ for inclusion (include_path=’/usr/local/cpanel/base/horde/lib:.:/usr/lib/php:/usr/local/lib/php’) in /usr/local/lib/php/Mail/smtp.php on line 206

Annoying, yeah?

# pear install Net_SMTP

voilà!

imapd: Error: Input/output error

Posted by kire on May 25, 2010

So your IMAP mail is failing. Your e-mail client disconnects with an error related to bad authentication or simply “connection closed by remote server”. Check your mail logs, and you find:

May 25 17:52:43 vps imapd: Failed to create cache file: maildirwatch (someone@somewhere.com)
May 25 17:52:43 vps imapd: Error: Input/output error
May 25 17:52:43 vps imapd: Check for proper operation and configuration
May 25 17:52:43 vps imapd: of the File Access Monitor daemon (famd).

Found this obscure error in the system logs for IMAP. The server does not run “famd”. Make any sense? Not really. Though verbose, the output is indicative of nothing related to famd, I/O, or the cache file.

I first restarted the courier-auth daemon, which alleviated the failed login issue (seen as authentication failed or connection closed on most IMAP clients).

Second, edit the file:

/usr/lib/courier-imap/etc/imapd

and make sure IMAP_USELOCKS and IMAP_ENHANCEDIDLE are both set to 0, after that restart courier-imap:

/etc/init.d/courier-imap restart

Done, and done.

when apache simply won’t start, check the semaphores!

Posted by kire on May 15, 2010

I came across this strange issue from a Tier II escalation today. A Virtuozzo based virtual server had a problem with the apache web server refusing to start:

# service httpd restart
[Sat May 15 16:41:13 2010] [warn] NameVirtualHost x.x.x.x:80 has no VirtualHosts
httpd not running, trying to start

Lets see if it’s actually started:

# service httpd status
Looking up localhost
Making HTTP connection to localhost
Alert!: Unable to connect to remote host.

:-(

Lets figure out what’s going on… Here’s some information on the kernel and OS:

-bash-3.00# uname -a
Linux xxxx 2.6.9-023stab046.2-enterprise #1 SMP Mon Dec 10 15:22:33 MSK 2007 i686 i686 i386 GNU/Linux

Normal troubleshooting ensued from there, and you can use this as a basis for determining what’s actually going on with your own server.

1. ALWAYS check the Apache error logs

Take a look at the error logs (usually “/usr/local/apache/logs”) and see if you can find what’s causing the problem.

In this case, the error_log gave me some valuable information and a place to start.

[Sat May 15 17:03:19 2010] [notice] suEXEC mechanism enabled (wrapper: /usr/local/apache/bin/suexec)
[Sat May 15 17:03:19 2010] [crit] (28)No space left on device: mod_rewrite: could not create rewrite_log_lock
Configuration Failed

On some server environments, you may also see a similar error that says:

[emerg] (28)No space left on device: Couldn’t create accept lock

Seems pretty obvious, yeah? Not so much….

2. Check available disk space.

-bash-3.00# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vzfs 77G 21G 57G 27% /

So we have plenty of disk space. Why can’t apache create the lockfile, then? Next step, (especially on virtual environments):

3. Check your available inodes

-bash-3.00# df -ih
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/vzfs 489K 167K 323K 34% /

**Scratches Head** So, the filesystem and available inodes are fine too. Now this gets interesting. The problem is that apache didn’t shut down properly, and it’s left myriads of semaphore-arrays left, owned by the apache-user (nobody).

A semaphore is a programming concept that is frequently used to solve multi-threading problems. Think of semaphores as bouncers at a nightclub. There are a dedicated number of people that are allowed in the club at once. If the club is full no one is allowed to enter, but as soon as one person leaves another person might enter.

To see if this is your problem, run:

ipcs -s | grep nobody

If you see a “wall” of these stragglers listed, your problem is solved. Removing these semaphores immediately should solve the problem and allow apache to start.

To do this, simply execute this command:

ipcs -s | grep nobody | perl -e 'while () { @a=split(/\s+/); print `ipcrm sem $a[1]`}'

You will see all of them being removed sequentially, and you can now go ahead and start up your apache service successfully.

PROTIP: Hitting the reset switch is NEVER the solution.