User Tag List

Page 4 of 4 FirstFirst ... 234
Results 61 to 61 of 61

Thread: Unplanned outage

  1. #61
    Join Date
    Nov 2008
    Location
    Geelong
    Age
    48
    Posts
    1,614
    Blog Entries
    1

    Default

    WHOOPS!

    Significant outage tonight due to filesystem corruption. Lost two or three of the latest ozmps posts, apologies. I had to repair the operating system and then restore a day old database.

    What happened?

    Well, I investigated a memory issue saturday evening. Shut down, tested memory configurations, made intended change and brought server back online, no problems apparent then, except...after the maintenance the array controller (server storage like your laptop hard disk) reported a battery capacitor failure; it's cable had detached. The fault occurred while correcting the disconnected battery capacitor.

    What's the array battery/capacitor do?

    It's there to protect the hard disk storage in the event of a hard crash like an uncontrolled power outage.
    The array controller spreads the storage across a whole bunch of hard disks (unlike your laptop) and applies "parity" so the storage can survive hard disks dying.
    In our case up to two hard disks can die and be replaced without losing any access, shutting down or damaging data.
    The array controller has several gigabytes of memory on it. When the server saves data to disk, it's sent to the memory, not directly to the disks.
    So if a power outage wiped that few Gb of memory before it's sent to disks the data would be lost.
    The battery saves the memory contents during a power outage and the controller writes the data to disk when the system powers back on.

    Yeah, so...how did that break shit?

    Well, everything was fine on shutdown. I investigated the battery/capacitor fault and found the cable was unplugged. Easy fix, yeah? But the cable is impossible to plug in while the controller installed, so I removed it, attached the cable, and reinstalled the controller. This really should be just fine, but....when the battery was re-attached to the memory and the card reinstalled, it appears to have scrambled the cache memory which should be blank, and then behaved as though there had been a power outage... which means....when the server was restarted it wrote scrambled data to disk....which corrupted the array filesystems.

    It did significant damage taking out 7 of 13 Operating systems running on this server. I'll probably be sorting out the fallout for a week.

    AARGH. That's easily the worst fault we've had since 2016
    Last edited by Nexus; 13-09-2021 at 01:49 AM.
    "Blue Meanie" 2007 Aurora Blue MPS 3 - 18x8.5+44 SSR GTX01 - 235/40R18 Federal RS-RR - 3.5" ETS TMIC - CPE stg 2 mount - HKS/CPE BPV - 2XS inlet - 2XS short shift - 2XS turbo manifold - Hypertech tune - Leather/Aluminium handbrake - Momo shifty knob - 7" touchscreen - JDM Mazda Retractable dashtop screen assembly - PC based GPS and instrumentation - 36AH reserve battery and C-TEK isolator - TEIN Street Advanced coilovers 1" drop - Superpro bushings - 220Kw/410Nm.

    "Lipstick" 2013 Velocity Red MPS 3 - 18x7.5+48 Enkei RPF1 -225/40R18 Federal RS-RR - CPE TMIC - COBB inlet - CPE stg 2 mount - COBB Stage 1 98 octane tune - COBB shifty knob - 2XS short shift.

Page 4 of 4 FirstFirst ... 234

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •