Post

diskread: reading beyond end of ramdisk (& How I Recovered)

diskread: reading beyond end of ramdisk (& How I Recovered)

This post documents a system recovery after hardware maintenance on a Sun Blade 8000 went wrong. During a NEM module replacement, power cycling issues corrupted boot archives on multiple blades, causing panic errors during startup.

The Problem

After technicians replaced a NEM module and the chassis experienced unexpected power cycles, two blades failed to boot with errors:

1
diskread: reading beyond end of ramdisk

and

1
panic: cannot mount boot archive

Recovery Steps

I booted into Solaris 10 Failsafe mode and discovered filesystem corruption. The solution involved several steps.

Step 1: Run fsck

I ran fsck multiple times on /dev/dsk/c2t0d0s0 to repair inconsistencies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
bash-3.00# fsck /dev/dsk/c2t0d0s0
** /dev/rdsk/c2t0d0s0
** Last Mounted on /
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3a - Check Connectivity
** Phase 3b - Verify Shadows/ACLs
** Phase 4 - Check Reference Counts
UNREF FILE  I=1457  OWNER=root MODE=100644
SIZE=657 MTIME=May 15 18:01 2008
RECONNECT? y

UNREF FILE  I=1458  OWNER=root MODE=100644
SIZE=675 MTIME=May 15 18:06 2008
RECONNECT? y

** Phase 5 - Check Cylinder Groups

CORRECT BAD CG SUMMARIES? y

CORRECTED SUMMARY FOR CG 0
FRAG BITMAP WRONG
FIX? y

FRAG BITMAP WRONG (CORRECTED)
CORRECTED SUMMARY FOR CG 4
CORRECTED SUMMARY FOR CG 12
CORRECTED SUMMARY FOR CG 30
CORRECTED SUMMARY FOR CG 70
CORRECT GLOBAL SUMMARY
SALVAGE? y

Log was discarded, updating cyl groups
46737 files, 1720899 used, 24099860 free (21460 frags, 3009800 blocks, 0.1% fragmentation)

Step 2: Recreate Ramdisks

1
/boot/solaris/bin/create_ramdisk -R /a
1
2
Creating ram disk for /a
updating /a/platform/i86pc/boot_archive...this may take a minute

Step 3: Rebuild the Root Mirror

I wrote a script to automate the mirror rebuild process, which cyclically detaches each submirror, clears metadata, creates new filesystems, reinitializes metadevices, and reattaches them while monitoring resync status.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#!/bin/sh
#
# fix-mirror.sh
#
# 05-16-2008 Tim Kennedy
#
# This script will take one argument, which should be the
# metadevice of the mirror you want to rebuild.  This script
# will determine the Submirrors, and one at a time, detach,
# clear, newfs, re-init, and reattach them.
# For me this has solved problems with ailing filesystems,
# while replacement storage is procured.
#
# YMMV.  Use at your own risk.  This is not in any way to
# be considered a Sun Microsystems product, and is not in
# any way supported by Sun Microsystems.
#

PATH=/usr/bin:/usr/sbin
export PATH

MIRROR=$1

check_return () {
        RETURN=$1
        if [ $RETURN = 0 ]; then
                printf "%-6s\n" "[ok]"
        else
                printf "%-6s\n" "[err]"
                echo
                echo "please check the last step manually to see why it failed."
                echo
                exit 1
        fi
}

for m in `metastat $MIRROR | grep "Submirror of $MIRROR" | cut -d: -f1`; do
        echo "Found Submirror $m"
        DEVICE=`metastat -p $m | awk '{print $NF}'`
        printf "%-72s" "    -- metadetach $MIRROR $m"
        metadetach $MIRROR $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metaclear $m"
        metaclear $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- newfs /dev/rdsk/$DEVICE"
        echo y | newfs /dev/rdsk/$DEVICE >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metainit $m 1 1 /dev/dsk/$DEVICE"
        metainit $m 1 1 /dev/dsk/$DEVICE >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- metattach $MIRROR $m"
        metattach $MIRROR $m >/dev/null 2>&1
        check_return $?
        printf "%-72s" "    -- checking resync status before continuing "
        while [ 1 ]; do
                STATE=`metastat -c $MIRROR | head -1 | grep resync`
                if [ "x${STATE}" = "x" ]; then
                        printf "%-6s\n" "[ok]"
                        break;
                else
                        sleep 60
                fi
        done
done

After these steps, both blades successfully booted to runlevel 3, though minor inode warnings remained.

This post is licensed under CC BY 4.0 by the author.