Wednesday, March 23, 2011

Building DBD::mysql on Solaris 10 Sparc

Having problems building the Perl DBD::mysql modules on Solaris 10 Sparc 64-bit? The Perl 5.8.4 binary that ships with Solaris 10 is a 32-bit application.  You are probably running the 64-bit version of MySQL and trying to build DBD::mysql against that db version. What you actually need to do is download the 32-bit version of MySQL, for linking the Perl DBD::mysql libraries against.   I run the 64-bit MySQL database in /opt/mysql/mysql, so I unpacked the 32-bit MySQL as /opt/mysql/mysql32. Then, run a CPAN shell, look DBD::mysql, and build the module.
/usr/perl5/5.8.4/bin/perlgcc Makefile.PL --libs '-R/usr/sfw/lib \
-R/opt/mysql/mysql32/lib -L/usr/sfw/lib -L/opt/mysql/mysql32/lib \
-lmysqlclient -lz -lposix4 -lcrypt -lgen -lsocket -lnsl -lm' \
--cflags '-I/usr/sfw/include -I/usr/include -I/opt/mysql/mysql32/include'
Then
gmake install UNINST=1
and you're done.

Tuesday, December 7, 2010

logging shell commands to syslog on secure systems

I had recently come across a blog post describing methods for capturing commands entered on the command line, and recording them to syslog.  Either by function() or by patching the actual shell itself.   I found this article because I was asked by my boss to find a way to add CLI logging to some hosts on our network, to support audits and accountability.

Some of the environments I work on are more secure than usual.  In a typical corporate environment, whether internet connected or not, there is generally no need or requirements to use system auditing to track all user actions.  Some government systems, whether classified or not, do require this, and some commercial systems in regulated industries, or who service government agencies, also require this level of auditing and accountability.  In some cases it can be a smart idea for non-regulated systemd.  For instance, if you're a managed services company that uses a team of operators to manage multiple customer environments, there may be some value to tracking user activity.

Most operating environments these days come with some sort of auditing facility, however I have found that these are usually fairly unintuitive, and most people that implement auditing do so by following a How-To, and then end up not actually having a SME on staff when things do go wrong.  Audit logs can also consume a lot of space, so lots of sysadmins just delete old audit logs in an effort to reclaim disk space.

One easy, and portable, way to quickly and intuitively audit user activity involves using patched shells to send all the commands run to syslog.  I should note that there are a few weaknesses in logging all commands to syslog, such as password exposure.  Some people do put passwords in the command line of such tools as ldapsearch, or mysql, or sqlplus.  Those passwords will then be recorded in plain-text in your system logs.

And there are always ways to work around being logged, like running a shell that doesn't log to syslog, which can be done as simply as by uploading your own, non-logging, shell and running it.

Aside from the weaknesses outlined above, though, in an environment where users are not malicious, and where team members can find themselves on any of a number of systems at any particular time, logging user can provide very valuable context in a familiar way.  And quickly, too.

Using syslogging shells is simply a tool like any other.  It doesn't replace real system auditing, but it definitely has it's place.

GNU Bash 4.1 has all the code to enable command logging, simply by editing the config-top.h file.
Just change:

/* Define if you want each line saved to the history list in bashhist.c:
   bash_add_history() to be sent to syslog(). */
/* #define SYSLOG_HISTORY */ 
#if defined (SYSLOG_HISTORY)
#  define SYSLOG_FACILITY LOG_USER
#  define SYSLOG_LEVEL LOG_INFO
#endif
to:
/* Define if you want each line saved to the history list in bashhist.c:
   bash_add_history() to be sent to syslog(). */
#define SYSLOG_HISTORY 
#if defined (SYSLOG_HISTORY)
#  define SYSLOG_FACILITY LOG_USER
#  define SYSLOG_LEVEL LOG_INFO
#endif

And all commands in interactive shells are logged. (don't forget to add the other 9 official bash patches to get to code level 4.1.9)

One thing I did notice is that in Solaris, the PID was being logged with the log entry to syslog, however this was not the case in linux.  Rather the PID was being logged by the log entry %s itself.

 if (strlen(line) < SYSLOG_MAXLEN)
    syslog (SYSLOG_FACILITY|SYSLOG_LEVEL, "HISTORY: PID=%d UID=%d %s", 
 getpid(), current_user.uid, line);
Resulting in log entries like:
Dec  7 23:13:02 linux bash: HISTORY: PID=1752 UID=1001 ls
I don't really like that format, either. I'd rather see usernames and commands, and have the pid over on the left with the 'bash:'. This was a pretty simple change in code to:
 openlog("bash",LOG_PID,SYSLOG_FACILITY);
  if (strlen(line) < SYSLOG_MAXLEN)
    syslog (SYSLOG_LEVEL, "[%s] %s", current_user.user_name, line);
This results in log entries that look like:
Dec  7 23:26:39 linux bash[1846]: [tkennedy] ls
 
To me, this is a much more readable log file. Perhaps that's because I'm used to the log format that the BOFH patched tcsh shell, which we also use, uses. Now bash and tcsh log in identical formats. Our users have been informed that bash and tcsh are acceptable for interactive shells on Linux, and there were no exceptions. On Solaris we encourage the use of bash or tcsh for interactive shells in the hopes that consistency lends itself to stability, although we use the RBAC aware pf- shells for role accounts like 'oracle' which encourage ksh.
Here's my patch to bashhist.c that logs entries the way I like them:

Friday, November 19, 2010

A generic perl script to scan a CIDR subnet for listeners on a specific port.

Ever had a customer ask you where was running on in their network? I have. And usually this involves an environment that doesn't have NMAP installed, or any other common port scanning tools. Fortunately these days, almost every *nix OS comes with Perl, even Solaris.

Since I work for a managed services company, and we manage a multitude of different environments, each with it's own set of restrictions and requirements, I try to wrote the most portable code that I can, so that it has the best chance of actually working in any given environment.

This script uses a couple of standard Perl modules that are included as part of the default installation, and don't require any CPAN-Fu, and it takes a couple of options, such as a switch for verbosity, and IP address, with or wirhout a CIDR mask, and a TCP port. The CIDR mask defaults to /32, and the port defaults to 22. Here's an example of the output.

tcsh-[101]$ ./scan-port.pl 208.64.63.39/30 80
================================================================================
Request to scan 208.64.63.39/30 on Port 80 (http)
Scanning 4 IP Addresses:
--------------------------------------------------------------------------------
208.64.63.36    : rats.entic.net                   : listening on 80 (http)
208.64.63.37    : corn.entic.net                   : listening on 80 (http)
208.64.63.38    : ice.hostsonfire.co.uk            : listening on 80 (http)
208.64.63.39    : jeep.sugarat.net                 : listening on 80 (http)
--------------------------------------------------------------------------------
Found 4 hosts listenening on port 80.
================================================================================
And here's the script itself.
#!/usr/bin/env perl
#===============================================================================#
#
# scan_port.pl (c) tkennedy - 2011 November 19 Version 1.1
#
#-------------------------------------------------------------------------------#
#
# Description: Allows scanning a IP Subnet for TCP listeners on designated port.
# Syntax: $0 [-v] IPADDRESS/CIDR-NETMASK [Port]
#
#-------------------------------------------------------------------------------#
#
# Purpose:
# This script provides a means of scanning a subnet for TCP listeners, using 
# only commonly available Perl functions.  This should make the script portable
# to any system with Perl installed.  The script can be passed 3 arguments. 2 of
# the arguments are optional, and include a '-v', which increases the verbosity
# of script output, and 'P', which is a TCP port presented as an integer between
# 1 - 65535.  If a port is not supplied on the command line, the script assumes
# a default of 22, which is the common port for Secure Shell traffic (ssh). The
# mandatory argument is an IP Address, either with, or without, a CIDR netmask.
# If a CIDR netmask is not supplied on the command line, then we assume you only
# intend to scan a single IP (ie, a /32).
#
#-------------------------------------------------------------------------------#
#
# History:
#
# 20101123 1.1 tkennedy - removed Switch module for Sol8/Perl5.003 compatibility
#
# 20101119 1.0 tkennedy - initial revision
#
#===============================================================================#
#
# The modules we're using are standard perl modules, so this script should 
# work on any operating system with Perl installed.
#
use strict;
use IO::Socket;

my ($ip,$cidr,$port,$target);
my $VERBOSE = 0;

# We need a regex to match IP addresses.  This is only used to validate
# the command-line options to verify that one option is an IP.
#
my $ipr = qr/^((?:(?:2(?:5[0-5]|[0-4][0-9])\.)|(?:1[0-9][0-9]\.)|(?:(?:[1-9][0-9]?|[0-9])\.)){3}(?:(?:2(?:5[0-5]|[0-4][0-9]))|(?:1[0-9][0-9])|(?:[1-9][0-9]?|[0-9])).*$)/x;

# I wanted to keep things as free-form as possible, and so opted to just 
# parse the command line to extract our options.  This also gives us some
# leeway to ignore bad options.
#
# In our parser, we will match a '-h' for help info, a '-v' to extend 
# our output a bit, including printing lines for hosts that are scanned
# but not listening.  We'll also match a single number, which we'll 
# interpret as a TCP port number, and lastly, we'll match an IP address
# or CIDR-masked sub-net.
#
foreach my $arg (@ARGV) {
 for ($arg) {
  if ( $arg eq '-h' )  { &usage; }
  elsif ( $arg eq '-v' )  { $VERBOSE = 1; }
  elsif ( $arg =~ m/^\d+$/ )  { $port  = $arg; }
  elsif ( $arg =~ m/$ipr/ )  { $target = $arg; } 
 }  
}

# if the user forgot to put in a target subnet, goto usage();
#
&usage("Error: no target supplied!") if ("$target" == "");

# Here we'll set the default port to "22", which is SSH, and then we'll
# check to see if a port was submitted on the command line.  If it was,
# we'll override the default with the user submitted value.  We'll do 
# the same for CIDR mask, assuming that if a mask was not passed in, then
# the user wants a single host scanned and use /32 as the default.
#
($ip,$cidr) = split( /\//, $target);
$cidr  = ( $cidr ? $cidr : "32" );
$port  = ( $port ? $port : "22" );

# die if we get an invalid CIDR mask!
#
&usage("Error: Invalid CIDR mask [$cidr]") if ( $cidr > 32 );

# Get the TCP Service name for our port...
#
my $svcname = getservbyport($port,"tcp");

# Convert the CIDR mask into a hex netmask, and calculate the number
# of addresses in the supplied target.
#
# my $mask = 0xffffffff >> $cidr;  # this fails on sol8/perl-5.003
my $size = 1 << ( 32 - $cidr );
my $mask = $size - 1;

#===============================================================================#
# script body below here.
#===============================================================================#

&hr2();
print "Request to scan ${ip}/${cidr} on Port $port ($svcname)","\n";
print "Scanning $size IP Addresses:","\n";

# calculate the lowest IP in the range, by AND-ing the supplied IP 
# address and the netmask, and convert to dotted quad notation.
#
my $lowest      = unpack('N', pack('C4', split '\.', $ip)) & ~$mask;
my $count = 0;
my $scanned = 0;

# based on the $lowest IP in the subnet, let's enumerate all of the
# IP addresses in the supplied $target subnet.
#
my @ips  = map {$lowest++} 0 .. $mask;

# a simple loop to scan each of the ips we mapped.
#
foreach my $addr(@ips) {
 scan_ip($addr);
}

&hr();
print "Found $count hosts listenening on port $port.","\n";
&hr2();

#===============================================================================#
# only subroutines are below here.
#===============================================================================#

sub scan_ip {
 #
 # from the foreach loop above, we've passed in our address.
 # this address is an integer, so we'll need to convert it to
 # a dotted-quad format IP address. then try hostname lookup,
 # and then try opening a socket.
 #
 my $addr = shift;
 my $ipaddr = join( '.', unpack( "C4", pack( "N", $addr ) ) );

 my $hostname = gethostbyaddr(inet_aton("$ipaddr"), AF_INET);
 chomp $hostname;

 if ("$hostname" eq "") {
  $hostname = ( $VERBOSE ? "NXDOMAIN" : " " );
 }

 my $sock = IO::Socket::INET->new(
  PeerAddr => "$ipaddr",
  PeerPort => "$port",
  Timeout => "1",
  Proto => "tcp",
  ) or nosocket("$ipaddr","$hostname") && next;

 if($scanned < 1) { &hr(); }
 
 my $txt = "listening on $port ($svcname)";
 printf('%-15s : %-32s : %-15s', $ipaddr, $hostname, "$txt\n") if($sock);

 close($sock) if($sock);
 $count++;
 $scanned++;
}

sub nosocket { 
 # we reach this sub on unsuccessful socket attempts.
 #
 if($scanned < 1) { &hr(); }
 my $ipaddr = shift;
 my $hostname = shift;

 if($VERBOSE > 0) {
  my $txt = "closed on $port ($svcname)";
  printf('%-15s : %-32s : %-15s', $ipaddr, $hostname, "$txt\n");
 }

 $scanned++;
 next;
}

sub usage { 
 if (@_) {
  print "\n@_\n";
 }
 # this sub() just prints the standard usage information.
 #
 print < [port]
 -h  this message
 -v  verbose output
  format IP/CIDR: 1.1.1.1/32, /32 is default CIDR mask
 [port]  a tcp port between 1 and 65535, 22 is default 

Example: $0 -v 192.168.0.1/24 80
 will scan the 255 addresses in the 192.168.0.0 subnet on port 80

EOT
exit(1);
}

sub hr { 
 # print a row of ---s
 print "-" x 80, "\n";
}

sub hr2 { 
 # print a row of ===s
 print "=" x 80, "\n";
}

# EOF
So far this has worked successfully on Solaris 10 Sparc & x64, CentOS 5.5 x64, Solaris 8 Sparc, and Solaris 9 Sparc, and Mac OS X 10.6.

Friday, May 16, 2008

diskread: reading beyond end of ramdisk (& how I recovered)

We had to do a maintenance to replace a NEM module in a Sun Blade 8000 Modular System.

Two of my team mates went on down to the datacenter on other business and graciously offered to SWAP the NEM for me. The pulled the old one out, stuck the new one in.

That's as simple as it should have been.

Should have been. I wish. Instead, the chassis started to freak out, cycling it's power over and over, and somehow was taking the CMM with it. In between one set of cycles, I was able to connect to the CMM via console and paste in a bunch of commands to shut down chassis power. I let it sit for a moment, then began to power up the system. First the chassis, then the individual blades. One blade came up, no problem. The next two, though, were very much less than happy, spitting out errors like:

diskread: reading beyond end of ramdisk
	start = 0x2000, size = 0x2000
failed to read superblock
diskread: reading beyond end of ramdisk
	start = 0x2000, size = 0x2000
failed to read superblock
panic: cannot mount boot archive
Press any key to reboot
The GRUB menu was coming up OK, though, so I pressed the trusty any key, booted into Solaris 10 Failsafe mode. This was no picnic either.
SunOS Release 5.10 Version Generic_120012-14 32-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring devices.
Searching for installed OS instances...
NOTICE: /a: unexpected free inode 5825, run fsck(1M)
/dev/dsk/c2t0d0s0 is under md control, skipping.
To manually recover the boot archive on a root mirror,mount the first
side (the one that the system boots from) and run:

        bootadm update-archive -R 

umount: /a busy

No installed OS instance found.

Starting shell.
#
My immediate thought was "WTF? No installed OS instance found?" Closer inspection revealed that it had in fact found two possibilities, but one c2t1d0s0 was inconsistent and needed a fsck, and the second c2t1d0s0 was under md control, and so being skipped.

An fsck of /dev/dsk/c2t0d0s0 revealed a few inconsistencies. Here's an example. I think this was actually the 3rd of 4 fscks I ran on this dev:

bash-3.00# fsck /dev/dsk/c2t0d0s0
** /dev/rdsk/c2t0d0s0
** Last Mounted on /
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3a - Check Connectivity
** Phase 3b - Verify Shadows/ACLs
** Phase 4 - Check Reference Counts
UNREF FILE  I=1457  OWNER=root MODE=100644
SIZE=657 MTIME=May 15 18:01 2008
RECONNECT? y

UNREF FILE  I=1458  OWNER=root MODE=100644
SIZE=675 MTIME=May 15 18:06 2008
RECONNECT? y

** Phase 5 - Check Cylinder Groups

CORRECT BAD CG SUMMARIES? y

CORRECTED SUMMARY FOR CG 0
FRAG BITMAP WRONG
FIX? y

FRAG BITMAP WRONG (CORRECTED)
CORRECTED SUMMARY FOR CG 4
CORRECTED SUMMARY FOR CG 12
CORRECTED SUMMARY FOR CG 30
CORRECTED SUMMARY FOR CG 70
CORRECT GLOBAL SUMMARY
SALVAGE? y 

Log was discarded, updating cyl groups
46737 files, 1720899 used, 24099860 free (21460 frags, 3009800 blocks, 0.1% 
fragmentation) 

***** FILE SYSTEM WAS MODIFIED *****
So far, so good. Let's reboot, and see if we an come up in a multi-user state. So ... reboot ... wait ... wait ...

CRAP! Same panic as our previous boot. We're missing something. A further delve into google reveals that I need to recreate the ramdisks for boot. A boot into failsafe mode again, allows me to fsck c2t0d0s0, which is mounted on /a, and remount it -o rw. bootadm update-archive fails, due to fs inconsistency. Another fcsk, we're in single user, nothing is using that disk, so I just ran the fsck without remounting -o ro. Now, let's skip bootadm and just move straight along to /boot/solaris/bin/create_ramdisk.

bash-3.00# /boot/solaris/bin/create_ramdisk -R /a
Creating ram disk for /a
updating /a/platform/i86pc/boot_archive...this may take a minute
That's it! That's the little piece of magic that fixed it. After that, I was able to reboot, and the server came right up into runlevel 3. Not without a few minor errors, but at least it was up.
SunOS Release 5.10 Version Generic_127112-11 64-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: generic
NOTICE: /: unexpected free inode 9193, run fsck(1M) -o f
NOTICE: /: unexpected free inode 5961, run fsck(1M) -o f
WARNING: /: unexpected allocated inode 9637, run fsck(1M) -o f
Loading smf(5) service descriptions: 1/1
/dev/md/rdsk/d60 is clean
/dev/md/rdsk/d30 is clean
/dev/md/rdsk/d20 is clean

generic console login:
At this point it was pretty simple to complete the fix, which, not wanting to reboot into failsafe mode and fsck a bunch more to recover from the unexpected free and allocated inodes, I wrote a script to: by turns, detach each have of the root mirror, clear the detached metadevice, newfs the raw device, re-create the metadevice, and attach it once again to the mirror. Let it sit long enough to complete the resync, and repeat the same steps on the other half of the mirror.
#!/bin/sh
#
# fix-mirror.sh
#
# 05-16-2008 Tim Kennedy 
#
# This script will take one argument, which should be the 
# metadevice of the mirror you want to rebuild.  This script
# will determine the Submirrors, and one at a time, detach,
# clear, newfs, re-init, and reattach them.
# For me this has solved problems with ailing filesystems,
# while replacement storage is procured.
#
# YMMV.  Use at your own risk.  This is not in any way to
# be considered a Sun Microsystems product, and is not in
# any way supported by Sun Microsystems.
#
 
PATH=/usr/bin:/usr/sbin
export PATH
 
MIRROR=$1
 
check_return () {
        RETURN=$1
        if [ $RETURN = 0 ]; then
                printf "%-6s\n" "[ok]"
        else
                printf "%-6s\n" "[err]"
                echo 
                echo "please check the last step manually to see why it failed."
                echo
                exit 1
        fi
}
 
for m in `metastat $MIRROR | grep "Submirror of $MIRROR" | cut -d: -f1`; do
        echo "Found Submirror $m"
        DEVICE=`metastat -p $m | awk '{print $NF}'`
        printf "%-72s" "    -- metadetach $MIRROR $m"
        metadetach $MIRROR $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metaclear $m"
        metaclear $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- newfs /dev/rdsk/$DEVICE"
        echo y | newfs /dev/rdsk/$DEVICE >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metainit $m 1 1 /dev/dsk/$DEVICE"
        metainit $m 1 1 /dev/dsk/$DEVICE >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metattach $MIRROR $m"
        metattach $MIRROR $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- checking resync status before continuing "
        while [ 1 ]; do
                STATE=`metastat -c $MIRROR | head -1 | grep resync`
                if [ "x${STATE}" = "x" ]; then
                        printf "%-6s\n" "[ok]"
                        break;
                else    
                        sleep 60
                fi
        done
done
Now these blades are happy once again. We'll see how long that lasts or if they continue to have problems of any sort. My hope is for the former.

Have a good weekend.

Tuesday, July 31, 2007

OUCHIES! I broke my big toe this morning!

I broke my big toe. I went to the hospital and had them X-RAY it. It's broke!

I was carrying my son (15 months old) down the steps, and I slipped. My only though was "Don't let Jason get hurt."

So I grabbed him and wrapped my arms around him, as my left foot missed a step, and my right foot slipped off the step it was on, hitting the step below that toe first, which toes consequently folded underneath that foot at the same time as they became the primary weight bearers for all 250 lbs of me.

Jason was not hurt. I think he was scared that daddy was screaching like his 19-month old cousin Jade when they're fighting over a toy (actually he's the screacher, not her), but he was fine.

Here's a pic of the X-RAY:

After I hurt myself, I took about 5 minutes to gather my wits, then I took Jason to daycare, and drove myself to the hospital, which is quite pleasant at 8:45 AM.

A few X-RAYS, and a silly post-op shoe later, and here I am on a diet of Advil and ice-packs. Hopefully the bones won't need to be pinned in place, but I won't find out till the end of the week, when I have a follow-up with the orthopedic specialist.

Wednesday, June 27, 2007

Using Solaris 10 Update 3, Sun Cluster 3.2, Zones, & ZFS in a Multi-Node Cluster of Sun Fire T-2000s

It all started with a conference call with one of our customers. We wanted a way to set up some highly available systems that could be used for various beta or QA purposes, or production services, or anywhere in between as needed. We also wanted a way to maximize the resources. We had 4 servers available to us, all Sun Fire T-2000s. If we used them as straight servers, they'd be great at anything they do, right? 8 cores, 4 threads per core, 32GB of RAM. Nice. Capable of running dozens of zones without skipping a beat. Perhaps even hundreds of zones.

Zones make perfect development boxes, right? You can blow them away and re-install them in a matter of minutes, or even seconds on ZFS. Zones also do pretty good as production environments as well. We're currently using a large number of zones in production, to supply a variety of services.

Zones on ZFS make particularly good dev boxes because you can take frequent snapshots and roll back as desired.

** Zones with their zoneroot on ZFS do encounter bug #6356600, which relates to how the live upgrade scratch zone used for installing packages into local zones can't access ZFS filesystems to upgrade zones with a ZFS zoneroot.

Sun Cluster 3.2 introduced support for ZFS as a failover filesystem, and for failover zones as well. We decided to make use of both of these features.

We built a 4-node cluster out of the 4 T-2000s, and began exporting individual disks from our SAN. Put 3 disks in a ZFS pool as a raidz filesystem, and installed the zone root at a ratio of 1 zone per zpool. (We're still doing some testing with our SAN and comparing performance of ZFS on individual disks, or ZFS on a RAID5 LUN exported by the SAN, but so far the way we're doing it is working nicely.)

So we built the cluster, and got it all configured and running. We then installed the first zone onto the ZFS pool. Then I copied the relevant portions of the zones configuration (in /etc/zones/*) to the other nodes in the cluster.

We then created a resource group in Sun Cluster 3.2, and added the zone into the resource group. We also added the ZFS pool into the resource group as an HA-Storage resource, and created a quick set of control scripts to start and stop the zone. The zone itself takes care of bringing up it's ip addresses, and starting the various applications installed within.

End result: Highly available servers, on a failover basis, that take less than 30 seconds to fail over from one host to another.

So far it's working really well. We're already getting more requests to build more of these multi-noded clusters, with zone/zpool combo's as the resource group. It's been a great solution for us.

Monday, November 6, 2006

Don't overlook the simple answers!

Today, I spent a good part of the day troubleshooting an Oracle 10g database who's db_recovery_file_dest kept filling up. Now, I'm not a DBA, by trade, just a technical generalist with a penchant for Googling. I increased the size of the db_recovery_file_dest, and 4 hours later, it was full again. I could not for the life of me figure out why the archiving and log rotation RMAN scripts weren't working. I ran them manually, and voila! problem fixed again, for a limited time.

That's when it occured to me to look in /var/cron/log. Sure enough, I found the answer to all my problems. Well, not ALL of my problems, but enough of the ones I was dealing with today that I rated today a success.

The oracle user's password had expired.

That was it. The root cause of two database outages due to the recovery log destination filling up, and the database refusing connections, and hours of troubleshooting.

An expired password.

This brings me to a lesson I know well, but often forget.

Never overlook the simple answers!

Sadly, I often forget that, and try to solve a complex problem that's not complex.