We have detected that cookies are not enabled on your browser. Please enable cookies to ensure the proper experience.
Page 1 of 2 1 2 LastLast
Results 1 to 25 of 41
  1. #1
    Join Date
    Nov 2013
    Posts
    1

    Question Help me understand

    As an IT professional, I am struggling to understand how a power outage completely cripples all the game servers. Was there no backup plan? Today's sophisticated data centers handle mission-critical operations and processes, and it's not feasible to shut them down -- even for a short duration. This means that power needs to be available continuously. A properly designed and regularly tested emergency power system will ensure that critical data center operations are protected and continuously operational. Your customers deserve no less.

    Extremely bad timing, considering many of us pre-paid for Helms Deep. There was certainly no delay in the transfer of funds from my bank to Turbine.

  2. #2
    the facebook post i believe noted that the 'back-up' power failed as well. i deal with IT as well (several servers, battery back-ups, etc) and no matter what gets put in place, things happen beyond our control. in this case you would think the datacenter had a beefy enough generator to kick in to handle the load but if the power-outage is fairly widespread, networking/internet gear outside your control will go down and everyone will lose access anyway. thankfully i've never had that big of an issue *knock on wood*

  3. #3
    Quote Originally Posted by Smaugnakh View Post
    As an IT professional, I am struggling to understand how a power outage completely cripples all the game servers. Was there no backup plan? Today's sophisticated data centers handle mission-critical operations and processes, and it's not feasible to shut them down -- even for a short duration. This means that power needs to be available continuously. A properly designed and regularly tested emergency power system will ensure that critical data center operations are protected and continuously operational. Your customers deserve no less.

    Extremely bad timing, considering many of us pre-paid for Helms Deep. There was certainly no delay in the transfer of funds from my bank to Turbine.
    turbine doesn't own the datacenter. the datacenter back power failed to work and all servers including non-turbine ones were affected. maybe they'll switch to a different host or maybe not.

  4. #4
    In case the OP does not realise, this is a GAME...hardly mission critical. Back ups fail, there were far worse things happening in the States last night than loss of a game!

  5. #5
    Quote Originally Posted by Ceejay90 View Post
    In case the OP does not realise, this is a GAME...hardly mission critical. Back ups fail, there were far worse things happening in the States last night than loss of a game!
    You obviously have no clue what you're talking about. It is a SERVICE that their customers pay for. Keeping the service up and running is absolutely, positively mission critical. Flaky servers when presented with stressful conditions is a great way to lose customers. Turbine is in no position to lose customers. Your apathy towards this problem just because the planet is rife with other bigger problems in no way gives Turbine a free pass to host this game on shoddy gear.

  6. #6
    Join Date
    Jun 2011
    Location
    In the Ninky Nonk
    Posts
    4,398
    Quote Originally Posted by Ceejay90 View Post
    In case the OP does not realise, this is a GAME...hardly mission critical. Back ups fail, there were far worse things happening in the States last night than loss of a game!
    I think you'll find that as far as Turbine are concerned their games are mission-critical. It may be a game for us but to them it's what generates their revenue, pays their salaries and feeds their families.

    I too work in IT. Even the best-laid plans can go wrong. I've seen systems with extremely well-rehearsed recovery plans go awry when the 0.0001% chance of something occurring actually occurs. What most people don't appreciate when they use their IT (whether corporate or cloud services) is just how reliable these things are nowadays so when a failure does occur it's often as not the result of a problem that cannot be forseen.

    Put it this way, back in 1999 I worked for a company and we spent the latter half of that year testing & re-testing the critical systems to make sure that things would work on 01/01/2000. The number of issues we found and fixed meant that when the date cut over everything went smoothly to plan, with the result that out director of IT was told to the effect that "y2k was a waste of money and it could have been better spent elsewhere". Sometimes people just take for granted that when they flick a switch the light turns on.

    So, rather than dwell on this issue, let's be grateful that because Turbine treat their systems as mission-critical the games are back up and running and we've got a new date scheduled for HD. Otherwise, if they had been more lax, things could be a lot worse.
    <A sig goes here>

  7. #7
    Join Date
    Mar 2007
    Posts
    12,981
    Yes, I have the feeling that there are going to be INN-teresting conversations between Turbine's
    people and the guys running the datacenter. Like, why didn't their backups kick in?

    We'll probably never know the details, though I would like to be a fly on the wall to hear the
    conversation.

    There's an old, old, joke among engineers, which has been adapted (as a metaphor) by programmers:

    "But the automatic stop didn't stop!"

    "Well, why weren't you WATCHING the automatic stop?"
    Eruanne - Shards of Narsil-1 - Elendilmir -> Arkenstone

  8. #8
    Join Date
    Jul 2011
    Location
    Germany
    Posts
    612
    The datacenter probably outsourced their Admins to India.
    VoIP Admins rule!

    You "real" IT Pros know I´m only half joking..

  9. #9
    Join Date
    Jun 2011
    Location
    In the Ninky Nonk
    Posts
    4,398
    Quote Originally Posted by Aldeld View Post
    You obviously have no clue what you're talking about. It is a SERVICE that their customers pay for. Keeping the service up and running is absolutely, positively mission critical. Flaky servers when presented with stressful conditions is a great way to lose customers. Turbine is in no position to lose customers. Your apathy towards this problem just because the planet is rife with other bigger problems in no way gives Turbine a free pass to host this game on shoddy gear.
    Don't think the issue is shoddy hardware. One can have the most up-to-date servers in the best data centres with all the redundant power one could ever need - but if someone fails to regularly test the failover to the backup generator then all that money has gone down the drain.
    <A sig goes here>

  10. #10
    Quote Originally Posted by Smaugnakh View Post
    As an IT professional, I am struggling to understand how a power outage completely cripples all the game servers. Was there no backup plan? Today's sophisticated data centers handle mission-critical operations and processes, and it's not feasible to shut them down -- even for a short duration. This means that power needs to be available continuously. A properly designed and regularly tested emergency power system will ensure that critical data center operations are protected and continuously operational. Your customers deserve no less.

    Extremely bad timing, considering many of us pre-paid for Helms Deep. There was certainly no delay in the transfer of funds from my bank to Turbine.
    Yes I also agree. To me this is very unprofessional.
    There are always .... Possibilities.

  11. #11
    Join Date
    Mar 2007
    Posts
    12,981
    Quote Originally Posted by Startrekman1of9 View Post
    Yes I also agree. To me this is very unprofessional.
    As others have pointed out, it was not unprofessional behavior on Turbine's part.

    It may have been unprofessional behavior on their data center's part, if it was due to
    someone's negligence that their backups didn't come up; we may never find out.

    When power drops like a stone for a large number of computers, some of them will suffer damage
    to their data -- as witnessed by the fact that several servers are/have been unable to come up or
    stay up.

    This used to be called "an act of God." Nowadays, many people substitute "Murphy" for "God."
    Eruanne - Shards of Narsil-1 - Elendilmir -> Arkenstone

  12. #12
    Quote Originally Posted by Smaugnakh View Post
    As an IT professional, ...
    No, you're obviously not.

  13. #13
    Quote Originally Posted by Flatfoot789 View Post
    The datacenter probably outsourced their Admins to India.
    VoIP Admins rule!

    You "real" IT Pros know I´m only half joking..
    LOL the company i work for is currently 'experimenting' with they can outsource. so far so good. everything that they've tried has resulted in poor results and late projects when/if they were completed at all.

  14. #14
    Join Date
    Feb 2007
    Location
    Sarasota, FL, USA
    Posts
    3,223
    Quote Originally Posted by Aldeld View Post
    You obviously have no clue what you're talking about. It is a SERVICE that their customers pay for. Keeping the service up and running is absolutely, positively mission critical. Flaky servers when presented with stressful conditions is a great way to lose customers. Turbine is in no position to lose customers. Your apathy towards this problem just because the planet is rife with other bigger problems in no way gives Turbine a free pass to host this game on shoddy gear.
    The 'shoddy gear' is provided by a third-party who also handles other companies and their processes. When the power goes down, all hardware is not brought back online at the same time. It is done in a specific order. I would guess that despite the size of the client, Turbine, hardware dedicated to running games is not on the top of the list.
    << Co-founder of The Firebrands of Caruja on Landroval >>
    Ceolford of Dale, Dorolin, Tordag, Garberend Bellheather, Colfinn Belegorn, Garmo Butterbuckles, Calensarn Nimlos, Langtiriel, Bergteir


  15. #15
    Quote Originally Posted by Smaugnakh View Post
    As an IT professional, I am struggling to understand how a power outage completely cripples all the game servers.
    I believe what occurred is that the power company stopped supplying power. The data center had to switch over to battery backup and or generators. One problem we have here in South Florida is heat. Very few data centers have a robust. One company I work with has three generators. Only needs two. There is enough fuel to power each generator a week. The capacity is high enough for the data center and the cooling systems. The battery plant is designed for 24 hours of operation in the case that none of the generators start.

    Other companies have no backup for the cooling system. One hour on battery life. In this case the battery backup is enough to last thru a very short power outage before a controlled shut down begins.

    We are dealing with a game. It is not a critical service like an energy control center for a power company. Game low profit product on a single unit basis. There is very few spare dollars to put in back up systems or duplicated hardware. These kind of applications are designed to go off when a sub server fails or the power company stops supplying power.

    What is interesting is that all this trouble restarting the service is mostly likely due to a failure to perform a controlled shutdown. It appears that looking on from the outside - some or all of the units in the Lotro server complex powered off. Restarting complex systems in this situation is very painful.

    We ran into this situation at one of the companies I worked at. Our engineering servers were handled by the engineering department employees. When the power went off. They had enough sense to properly power down the servers. It never occurred to them that the huge UPS has a big computer in it. They let the UPS turn off when its battery ran out. They did not have any clue as to how to properly cold start a discharged UPS. They brought it back on line. Got it charged a little bit. Turned on the servers. They started booting up. That completely drained the little charge out of the UPS. The UPS shut down again taking the servers with it. I have both software engineering and an electrical engineering. I ended up going in there. Restarted the UPS properly for them. It took them hours to get the servers back up after a few power cycles. It took even more time to resolve data mismatches between machines that shared a work load. All because they had just enough knowledge to be dangerous.
    Last edited by Yula_the_Mighty; Nov 18 2013 at 04:58 PM.
    Unless stated otherwise, all content in this post is My Personal Opinion.

  16. #16
    Join Date
    Oct 2010
    Posts
    491
    Quote Originally Posted by Smaugnakh View Post
    Extremely bad timing, considering many of us pre-paid for Helms Deep.
    I fail to see what this has to do with anything.

    The fact that you pre-paid for Helms Deep, didn't contain in it any sort of promise that you'd be able to play it on the 18th. No more then pre-ordering any other software means you'll be able to play it on release day. This especially true when you consider that this is an expansion to an existing game.

    LotRO is still there, you can play it the same as you always did before. So the fact that you paid in advance for Helms Deep means pretty much nothing...

    Unless you honestly think they'd be better off trying to force such a massive upgrade on systems that aren't even running correctly with the stable version of the code, because that would work so much better....

  17. #17
    Join Date
    Jun 2011
    Location
    Local cluster
    Posts
    522
    Quote Originally Posted by Smaugnakh View Post
    Extremely bad timing, considering many of us pre-paid for Helms Deep. There was certainly no delay in the transfer of funds from my bank to Turbine.
    Good lord. Are you really saying that you feel the value of your purchase has been decreased because a 48h delay was introduced in delivery? Did you use instant availability as your only cue for everything you buy, or do you only expect that when you are buying a digital product? Because as an IT professional you should probably know that #### happens..

  18. #18
    Join Date
    Oct 2010
    Posts
    491
    Quote Originally Posted by Yula_the_Mighty View Post
    In this case the battery backup is enough to last thru a very short power outage before a controlled shut down begins.
    This is the part I don't get... I work in IT infrastructure, and when you lose power the first thing you do is start to power down the servers, because the battery backups don't last very long. Yet it seems like the LotRO servers went down hard, and that's why they're having issues getting them back up.

    I've also seen reports of lost data, and rollbacks, which further points to a uncontrolled shutdown.

    No reasonable person should expect the game to stay up and running if the datacenter lost power. It's unlikely they'd have the resources to keep the whole thing running on backup generators. But I think it's very reasonable to ask why the systems didn't go down gracefully, if that is in fact what happened.

  19. #19
    Quote Originally Posted by rannion View Post
    Good lord. Are you really saying that you feel the value of your purchase has been decreased because a 48h delay was introduced in delivery? Did you use instant availability as your only cue for everything you buy, or do you only expect that when you are buying a digital product? Because as an IT professional you should probably know that #### happens..

    Not fair!! You beat me to the rant.

  20. #20
    Quote Originally Posted by Aldeld View Post
    You obviously have no clue what you're talking about. It is a SERVICE that their customers pay for. Keeping the service up and running is absolutely, positively mission critical. Flaky servers when presented with stressful conditions is a great way to lose customers. Turbine is in no position to lose customers. Your apathy towards this problem just because the planet is rife with other bigger problems in no way gives Turbine a free pass to host this game on shoddy gear.
    Lighten up Francis.

    I've work in IT professionally since 1986. All of that time has been at data centres. I still work in IT at a data centre today.

    It is a physical impossibility to give an absolute 100% guarantee that a data centre can never completely lose power. There is ALWAYS a chance it can happen. If preparations and planning are done right, that chance is small, but that chance still exists and it is always there. 99.9x % of the time, the planning and preparations will work, but sometimes that 0.0x % of the time, you get bitten in the ### instead.

    Here's one example I personally witnessed (and not the only example).

    The data centre in question has electronically locked doors as part of it's security. Staff use pass cards to open the doors, but in an emergency such as a fire, the doors have a button to unlock them behind a small "break glass" panel.
    Inside the main control room for the data centre is another "break glass" panel with a button behind it, but this button turns off ALL OF THE POWER to the data centre and DISABLES THE BACKUP GENERATOR from starting.

    One day a technician was doing maintenance work (on contract) on the door locking system, including the break glass buttons that unlock the doors. He mistook the emergency power shutdown button for a door release button, even though the button was clearly labelled as an emergency power shutdown and was in a clearly different type of break glass box. The entire data centre shut down in seconds. It took many hours to get all the equipment in the data centre running again. That technician was banned from ever coming to that data centre ever again. Extra measures were taken to make sure it would be impossible to trip the master power shutdown button by accident a second time.

    Yes, some of us, including me, pay real world money to play this game. A certain level of reliability is expected. 100% reliability with absolutely no possibility of an unexpected interruption just isn't possible while keeping the costs of running the game at a reasonable level. Even with good planning and preparation, problems sometimes happen that are outside the scope of what has been prepared and planned for. When that happens, you do the best you can under the circumstances you're faced with.

    I'm not saying I know the exact details of the data centre where the LOTRO servers run from, but in all probability, neither do you. What I can say is that I've been in similar situations, more than once, in multiple data centres. This isn't a perfect world, sometimes things go wrong, Do yourself a favour and learn to live with that little fact.
    Therina - Hobbit Guard Rongo - Hobbit Warden
    Frood - Man Minstrel Garmun - Man Captain
    Zorosi - Dwarf Champ Froodaroon - Elf Hunter
    Southern Defenders - Arkenstone (formerly Elendilmir)

  21. #21
    Join Date
    Mar 2007
    Posts
    12,981
    Is everybody familiar with Clarke's Third Law?

    "Any sufficiently advanced technology is indistinguishable from magic."

    There are many people who use computers every day for whom, essentially, those
    computers run by magic.

    Heck, there are people for whom a light switch is magic. Anything you don't understand is magic.

    If someone who had all the information were to explain to me what actually went wrong last night,
    I *might* understand it. Whh certainly would. But in the absence of such information, let us assume
    that the data center didn't sacrifice enough chickens to Murphy last night, and get on with life, the
    game, and everything.
    Eruanne - Shards of Narsil-1 - Elendilmir -> Arkenstone

  22. #22
    Join Date
    Oct 2010
    Posts
    491
    Quote Originally Posted by GarethB View Post
    but this button turns off ALL OF THE POWER to the data centre and DISABLES THE BACKUP GENERATOR from starting.
    As soon as I saw this I saw where it was going

    100% reliability with absolutely no possibility of an unexpected interruption just isn't possible while keeping the costs of running the game at a reasonable level.
    I remember once my boss telling me that we had a 100% data recovery policy. That 100% of the data on the systems was recoverable with in the last 6 weeks. I told her she was a fool if she was making that kind of promise. Backup tapes can be lost or damaged, so there's no way you can say that 100% of data can be recovered.

    But clearly something weird happened here, because IMO the issue isn't the power outage, or even the failure of the backup systems. What I can't figure out, is how the systems went down so hard, if that is in fact what happened.

  23. #23
    Join Date
    Nov 2008
    Location
    Utah
    Posts
    15
    I'm just returning after a four year hiatus. Got software installed Sat, played a little, played a little Sun, then servers fall down go boom. I think the most surprising thing to me is how hard things went down and that Turbine seemed as poleaxed as the rest of us. It seems it took an hour or two for them just to get in contact with the data center. I wonder if they're in the market for a different center now... I'd sure hate to be the guys going through all the debris in lost+found and elsewhere trying to get things largely back together.

    Edit: I see the post above me wonders the same thing. There was a serious protocol failing of some sort at the data center.

    Edit2: Forgot to mention another newbie/returnee impression - I don't know what bounder's tokens are, but the rage over them has been crazy nuts.
    Last edited by biodegraded; Nov 18 2013 at 06:33 PM.

  24. #24
    Quote Originally Posted by Solarfox View Post
    But clearly something weird happened here, because IMO the issue isn't the power outage, or even the failure of the backup systems. What I can't figure out, is how the systems went down so hard, if that is in fact what happened.
    My hunch is there are multiple physical servers for each "server". If there was a sudden power outage and zero battery backup then they aren't just going to go down hard it would be like losing RAID integrity.

    THAT sucks big time.

    My hunch is the batteries that should be there as a stop gap between power failure and backup generator starting failed miserably.

    Someone probably forgot to put diesel in the generator. Actually seen that happen before.

  25. #25
    Join Date
    Mar 2007
    Posts
    10,510
    There are more things that can go wrong than most people imagine, even with systems that are expected to be "always on".

    After the Loma Prieta earthquake in 1989, the phone system in the area stayed up and functioning, except for on central office (CO) that went down and stayed down a few days after the 'quake. (The relevance here is that COs were--and are--all run on computers.)

    So what happened? To follow the story, you have to know that COs have battery backups that are in turn backed up by generators powered by internal combustion engines, usually diesels. The CO doesn't actually ever run directly off the generator. When the external power goes down (as it did throughout the affected area), the CO runs off the batteries. When the batteries get low, the generator kicks in to recharge the batteries. When the batteries are recharged, the generator shuts down until the next time it needs to recharge. COs are designed to operate for a minimum of two weeks without external power. The generators are tested regularly and fuel supplies are checked as well.

    In the CO that went down, when the batteries got low, the generator failed to start. When the batteries went flat, the CO shut down.

 

 
Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

This form's session has expired. You need to reload the page.

Reload