Server Help

Announcements - Network outages explained

Mine GO BOOM - Mon May 26, 2008 11:51 pm
Post subject: Network outages explained
Last night, and the previous week, were both caused because the datacenter lost power and their battery backup systems failed.

Update: The reason the servers did not come back online the moment power came back is because of the funky raid-lvm tricks I do with the servers. I kept fsck from automatically fix errors, mostly because I didn't trust it to run safely. Now that I've had a few trial runs (three datacenter shutdowns, one power supply pop), I turned it on to fix upon detection, so the servers should come back online upon power restoration.
Mine GO BOOM - Tue May 27, 2008 11:00 am
Post subject:
Here is the email I received last week for the previous failure. Waiting for another email for last night.
Quote:
During a routine transfer of power from generator to utility power, one of our UPS units failed and caused the other two units in the UPS system to go offline. Normally this wouldn't be a problem as we have bypass power that would take over and be transparent to our customers when switching to it. Unfortunately, the bypass breaker failed when it was needed and would not reset.

Power was restored after human intervention which required a complete shutdown of the UPS system, transfer to the other bypass power source and then bringing the UPS system back up. This sounds like a simple straight forward thing to do, but it has to be done very carefully, and in the proper sequence. All the feed breakers have to be opened to the PDU to prevent the massive inrush current required to saturate the transformers in the PDUs from tripping the bypass breaker. Once the bypass power is brought up, the PDUs are brought on line one a time and finally the UPS system is brought on line and the power transferred to it.

In the above situation, the customer would have power as soon as the bypass is available and the PDU breakers are closed. The PDU power ramp up only needs to occur if the power to it has failed completely; once the transformer core is saturated there is no longer the problem of the inrush current when switching between UPS and bypass power.

The bypass breaker that failed was a hard fail and the breaker (1200 amp at 480v) has been replaced. We initially thought that the breaker settings were incorrect and caused it to trip prematurely, but that is not the case.

The UPS unit that failed has been repaired with new parts by the equipment manufacturer and is back on line. All systems are normal at this time.

All failed parts have been sent back to the manufacturer for analysis as soon as the results come back I will notify you.

Procedurally, we now will keep the generator running until all tests are complete and systems are normal.

We are very sorry for any inconveniences this may have caused and if there are any additional questions you may have please don't hesitate to contact me.

Sincerely,

Paul

Mine GO BOOM - Tue May 27, 2008 6:19 pm
Post subject:
And for the one for Monday at midnight:
Quote:
On 5-25-2008 at 11:58 PM MarquisNet lost commercial power. The UPS systems and the generator engaged. Unfortunately UPS "Y" failed due to excessive load. Upon further examination it was found that the two fuses that failed last week failed yet again. The cause of this failure will be investigated and results released as soon as we have them. UPS "Y" remains off line. However this did not explain why the UPS batteries did not take the load and keep the equipment up. Power was up and on generator within 2 minutes; unfortunately because the batteries did not hold the load we experienced an outage.

There are 8 strings of 40 batteries (320) available to the system. Each string is separate and all are in parallel with each other to provide the back up. Each string provides approximately 100 KVA for a total of 800 KVA. Our load at the time of the outage was approximately 610 KVA. With all 8 strings working this would have been fine. We then decided to check the battery strings for any problems. To check the strings each battery has to be tested individually. This means pulling out 4 batteries at a time and testing them, all 320. We tested all the batteries to find 2 bad ones. One bad battery in two different strings. If there is one bad battery in a string, then the entire string is bad.

What this means is we had our capacity lowered to 600 KVA at the time of the outage. With the inrush of power to the system the batteries could not hold the load, thus an outage until the generator came on line.

To lessen the chance of this happening again, one of the PDU's originally on this UPS string of batteries was moved to our new UPS and batteries. Overall lowering the load on the first system by approximately 125KVA, allowing the degraded string to handle the load until we can get replacement batteries. All of this took from 12:00 AM to 9:30 AM to accomplish.

If you have any questions feel free to call me on my cell number below.

Sincerely,

Paul

All times are -5 GMT
View topic
Powered by phpBB 2.0 .0.11 © 2001 phpBB Group