SmugMug down @ 4:20pm Pacific

2

Comments

  • anderivanderiv Registered Users Posts: 80 Big grins
    edited January 18, 2007
    And....we're back up! Here's hoping for no more hardware failures.

    Thanks for the good work, gents.
    Erik Anderson
    http://andersonfam.org
    http://andersonfam.smugmug.com
    D70 | SB-600 | Nifty Fifty | Tamron 17-50 f/2.8 | Nikon 70-300 f/4-5.6G
  • AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    back up at 530 PT
  • DJFriarDJFriar Registered Users Posts: 19 Big grins
    edited January 18, 2007
    What, me worry?
    I'm not worried. I too have been here for over 2 years, and outages are rare. I agree it sucks when have something major going on, but overall I still don't think there is a better service than SmugMug.

    To SmugMug: Your candor on these issues is by far the best customer service I've seen, and a very large part of why I remain a power subscriber. I don't even really shoot pics at the hobby level (although I'm trying to get there), but I choose to pay for a service I can do for fre or myself because of how incredible the people and service are here. Don't change a thing. Well, fix that hardware issue, but don't change anything else. lol.
  • thegrepperthegrepper Registered Users Posts: 25 Big grins
    edited January 18, 2007
    Manny wrote:
    Don,

    This is of course a bad time for this question :-) But I had an awful bad day at work where I got bashed by my customers for having poor redundancy on one of our mission critical apps. So I ask you, why is there no redundancy built into Smugmug? I mean, not even the RAID is redundant? There is a single RAID controller ? no hot backup? :-)

    This is a loaded question and I realize I am putting you on the spot, but you have been candid with us and you make it easy for us to talk directly with you guys and you make it easy for us (the customer) to make requests that can improve the service. I think the question is not totally unfair.

    I ask also because in the past few days, I have had more downtime on my Smugmug account than I care to... I just sold about 5 people at my job on getting rid of their crappy Kodak accounts and switch to Smugmug. Now they are looking at me as if I misled them... Laughing.gif

    In any case, maybe when this is all over you guys can tell us more about plans to have more redundancy? if possible?

    Cheers and good luck with your current challenges.

    Manny,

    Great question; you are exactly where I was earlier in the week on the redundancy issue.

    Don,

    As Manny mentioned in his post, I'm surprise that a RAID controller would be a single point of failure. Your business is mainly about storage and I would have expected that you maintain a tier1 solution like HDS or EMC. Perhaps this is cost prohibitive. Can you provide any details on your storage configration?
  • DJKennedyDJKennedy Registered Users Posts: 555 Major grins
    edited January 18, 2007
    yay its back up !clap.gif
    http://www.djkennedy.com

    What did Cinderella say when she left the photo shop? "One day my prints will come."

  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    onethumb wrote:
    I would guess we'll be down for an hour, but you never know. Sorry about that.

    Don

    Missed my estimate by about 5 minutes. Sorry about that.

    We're back up.

    Don
  • DJKennedyDJKennedy Registered Users Posts: 555 Major grins
    edited January 18, 2007
    onethumb wrote:
    Missed my estimate by about 5 minutes. Sorry about that.

    We're back up.

    Don

    5 mins, $5 credit? mwink.gif
    http://www.djkennedy.com

    What did Cinderella say when she left the photo shop? "One day my prints will come."

  • Scott_QuierScott_Quier Registered Users Posts: 6,524 Major grins
    edited January 18, 2007
    Z06Nut wrote:
    Please hurry I need to upload some pictures. :bluduh
    NOOOOO Don't hurry. Take your time and do it right.

    It is always cheaper to spend a little more and do it right the first than to pay to do it wrong once and then pay again to do it right.

    So, SmugMug Heros, take your time and do the job with style and know that we appreciate the effort you are making to get the system up and running correctly rather than trying to apply a patch that may/will fail without notice.
  • MannyManny Registered Users Posts: 148 Major grins
    edited January 18, 2007
    Thanks Chris!!

    Cheers
    MG
    Baldy wrote:
    Hi Manny (and everyone else),

    Ouch, we're really sorry for the outages we've seen over the last few days. It appears to be the same hardware failure happening repeatedly, and the guys think they've isolated it now.

    We've been working hard on redundancy for some years now (we've always been redundant on storage) to eliminate almost all single points of failure. However, there are a few things that can still go wrong, and things we're working on currently to reduce the chances of an outage like this one.

    I just suffered the same embarrassment as many of you: pointing the press to my galleries and having them get an error. :cry Hopefully we'll get this put to bed quickly.

    Thanks,
    Chris
  • SysConsultantSysConsultant Registered Users Posts: 1 Beginner grinner
    edited January 18, 2007
    Don - do you need a hand?
    In an earlier post you mentioned that the problem you were addressing was likely related to a faulty disk controller and that you would require site downtime to replace it.

    I design enterprise storage systems for a living and if you're not comfortable with your current situation I'd be glad to discuss some of your options. While your post was a bit sketchy, it sounds like you maybe haven't been getting the best advice on storage solutions. It's unlikely that a properly architected storage system would suffer an outage under the circumstances you describe.

    BTW, I'm not trying to sell you something. Unless you're hosting your site in Minnesota, it's unlikely that I would benefit from your purchase of any of the equipment my company sells.

    I can be contacted at the email or phone number listed in my SmugMug account.

    SysConsultant
  • carver6carver6 Registered Users Posts: 1 Beginner grinner
    edited January 18, 2007
    Up / Down ???
    So the site appears up. But has the problem been resolved? I too am a new customer. I have been trying to upload and sort over 5000 images. Duplicate detection is a nightmare. Where did each of my uploads stop? Which ones failed and which ones finished? Its a mystery.

    On a side note, may I suggest dual GSLB load balancers for production failover? Data replication in real time, etc.

    Otherwise... everything I've heard so far is very promising.
  • AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    carver6 wrote:
    So the site appears up. But has the problem been resolved? I too am a new customer. I have been trying to upload and sort over 5000 images. Duplicate detection is a nightmare. Where did each of my uploads stop? Which ones failed and which ones finished? Its a mystery.

    On a side note, may I suggest dual GSLB load balancers for production failover? Data replication in real time, etc.

    Otherwise... everything I've heard so far is very promising.
    Hi Carver, and welcome wave.gif

    As many customers have said, this is not the norm for us. WE do have a weekly scheduled maintenance window on Thursday nights, please see the sticky in this forum... we don't always use it but I'm told we will be tonight.
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    Manny wrote:
    Don,

    This is of course a bad time for this question :-) But I had an awful bad day at work where I got bashed by my customers for having poor redundancy on one of our mission critical apps. So I ask you, why is there no redundancy built into Smugmug? I mean, not even the RAID is redundant? There is a single RAID controller ? no hot backup? :-)

    This is a loaded question and I realize I am putting you on the spot, but you have been candid with us and you make it easy for us to talk directly with you guys and you make it easy for us (the customer) to make requests that can improve the service. I think the question is not totally unfair.

    I ask also because in the past few days, I have had more downtime on my Smugmug account than I care to... I just sold about 5 people at my job on getting rid of their crappy Kodak accounts and switch to Smugmug. Now they are looking at me as if I misled them... Laughing.gif

    In any case, maybe when this is all over you guys can tell us more about plans to have more redundancy? if possible?

    Cheers and good luck with your current challenges.

    We have redundancy across all of our systems, but redundancy falls into a few categories. Here's some more details:

    - Automatic failover. These pieces of our datacenter include core routers, core switches, load balancers, image storage servers and disks, web servers, upload servers, image processors, database slaves, and the like. These pieces have multiple pieces of hardware in place, hot and ready to go, and they auto-detect failures and automatically work around them. >99.99% of the time, when something in this category fails, the customer never notices. Occasionally the failover will take a second, and for that second, there may be some slowness or brokenness, but it never lasts longer than one or two seconds.

    - Manual failover. These pieces of our datacenter are mostly database related, and specifically, database masters. The technology exists to do automatic failover here, but we don't use it. Neither does Yahoo, Google, or any other major internet company I know of (I'm friends with people in most of the systems groups there). Why? Because this stuff is vitally important that data integrity is maintained. Without going into technical details, there are cases where multiple database masters might THINK the other has failed, when they really haven't. If and when that happens, data gets randomly written to two or more masters and data gets corrupt. In this case, that means your images, albums, billing information, orders, etc get screwed up.

    Every piece of our database master architecture has redundancy built in. We use multiple disks in RAID arrays, with multiple arrays on each box, and the list goes on. But when we have a failure, we *must* manually verify the state of all database machines, ensure corruption or confusion hasn't occurred, and then bring them back into production.

    In this particular case (the last 3 or so days), this has been difficult to do because it's a strange problem we've never experienced before, and we needed to narrow down the problem. Having a knee-jerk reaction and rapidly changing hardware out would possibly mask the problem, making it harder to find, and at worst, could jeopardize your data. It's like elementary science class: change one variable, observe what happens, then change the next.

    The bottom line is that while we value uptime, we value your data more. Given a choice between a short outage and possible data corruption, I'll always err on an outage to ensure the data is there when we come back up.

    We're actively working on making this process faster and less error-prone, but it's an extremely expensive proposition, and we're still learning how to best accomplish it. Our friends at Yahoo, Google, etc, have been sharing their expertise, but we're just not there yet.

    I should note that our friends at those other companies have large outages from time-to-time, too, so this isn't a problem that's unique to us. The truth is, I'm not away of any major internet brand that doesn't suffer outages like these - often more frequently than we do. That's not an excuse, just an observation.

    I hope that helps. When I get a chance, I plan on outlining (to the extent that I can without potentially exposing security issues) our datacenter architecture on the wiki so you can get a good feel for where the weak and strong points are.

    Don
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    carver6 wrote:
    So the site appears up. But has the problem been resolved? I too am a new customer. I have been trying to upload and sort over 5000 images. Duplicate detection is a nightmare. Where did each of my uploads stop? Which ones failed and which ones finished? Its a mystery.

    On a side note, may I suggest dual GSLB load balancers for production failover? Data replication in real time, etc.

    Otherwise... everything I've heard so far is very promising.

    We do have dual GSLB load balancers. :)

    Data replication, though, is never "real time". The speed of light makes sure of that, so it's not quite that simple. It never is, is it? :)

    Don
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    In an earlier post you mentioned that the problem you were addressing was likely related to a faulty disk controller and that you would require site downtime to replace it.

    I design enterprise storage systems for a living and if you're not comfortable with your current situation I'd be glad to discuss some of your options. While your post was a bit sketchy, it sounds like you maybe haven't been getting the best advice on storage solutions. It's unlikely that a properly architected storage system would suffer an outage under the circumstances you describe.

    BTW, I'm not trying to sell you something. Unless you're hosting your site in Minnesota, it's unlikely that I would benefit from your purchase of any of the equipment my company sells.

    I can be contacted at the email or phone number listed in my SmugMug account.

    SysConsultant

    Our outage state wasn't related, per se, to hardware failure but rather a double whammy of not knowing where the failure was occuring (hardware or software, and which piece of what) and ensuring that the problem in question hadn't done anything to the data that would compromise it.

    Thanks for the offer, but I think I'm fairly well versed in all of the commercially available storage architectures. Unfortunately, my experience, and everyone I've spoken to verifies this is the case, suggests that as you add extra pieces to your storage infrastructure to prevent and workaround storage outages and errors, those very technologies tend to cause more problems than they solve.

    Our storage infrastructure over the last 5 years has been far more solid than any of my friends' high-end EMC installations at places like Microsoft, for example. Their failover and multipathing is supposed to work, but often doesn't.

    I'll ping you if we get in a bind, though. :)

    Don
  • SenecaSeneca Registered Users Posts: 1,661 Major grins
    edited January 18, 2007
    kygarden wrote:
    While I too will admit that this outage/downtime is not the norm...I'll also have to admit that I can specifically recall a couple other unplanned downtimes (maybe 2 or 3) in the last several months. I remember because I remember stating that I left Pbase for this reason - unreliable service. While I'm not looking to make enemies (I love SmugMug), I also have to agree with others that are saying there just has to be some way to prevent this from happening this often. I don't have to rely on this service for my business, but I AM paying for it. Whether I'm simply posting photos in forums or selling something on ebay (that would be my biggest complaint - photo not available for ebay shoppers), I do expect my photos to be available.

    I work for a multi-billion dollar software company so I hear many many conversations about uptime at work (I'm in the IT dept, but work on the telecom side though - convergered voice and data though - VoIP). I realize there are challenges sometimes to providing trouble-free service. But at some point you reach a breaking point and it gets to be too much. Would I leave SmugMug (and would anyone care if I did anyway)?....if this continues to happen sporadically, yes I could and will. Especially if my photo work becomes more important to me and I need a rock solid hosting service. I guess at this point I'm officially "spooked."

    Again, I don't doubt for one second that the folks running smugmug and trying to take care of us all are very nice people. Everything I've seen indicates they are. But ultimately, business is business and someone being frustrated about something they are paying for not being available is a big problem. Check the dpreview forums and other photography forums. You don't think there are complaints (loud complaints) about smugmug's service being shaky here lately? You better believe it. That complaining only scares off more businss. Justly or unjustly.

    I guess I'll hush now :) I hope it's back soon...and here it is internet rush hour time for the U.S. That's bad.....

    And you post proves what? rolleyes1.gif Sorry...I think you guys are doing a great a job...in the World Wide Web many internet hosting companies have their problems, heck even YAHOO has their problems from time to time. My husband works for a company that has offices in Germany, Austrilia, Norway, Russia...and lately they've had their problems with their internet and connecting lines. Stuff like this happens - and it happens more than we think. Sometimes its hardware problems...Sometimes it's just fluke stuff.
  • SenecaSeneca Registered Users Posts: 1,661 Major grins
    edited January 18, 2007
    I’ve been smugmug’n for two almost two years now, and have had nothing shy of a great experience. For those of you who have just joined our community this is way out of the norm. I'm sure things will be back up and running better then ever (I can’t wait for the big changes Smugmug has planned) very soon. MM

    Meeee tooo...service is great...thumb.gif
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    Seneca wrote:
    And you post proves what? rolleyes1.gif Sorry...I think you guys are doing a great a job...in the World Wide Web many internet hosting companies have their problems, heck even YAHOO has their problems from time to time. My husband works for a company that has offices in Germany, Austrilia, Norway, Russia...and lately they've had their problems with their internet and connecting lines. Stuff like this happens - and it happens more than we think. Sometimes its hardward problems...somethings it's just fluke stuff.

    Thanks, Seneca.

    It's a dirty little secret, but the truth is that no-one, not Microsoft, Google, Yahoo, eBay, or Amazon, has solved this problem. They have billion-dollar budgets for this sort of thing, but they all have outages all the time. None of them have 5-9s (99.999%) uptime for a year.

    We try hard, all of us, but sometimes things are inevitable. It can be rough on us to hear people screaming for 100% uptime, which is impossible, for $40/year when Microsoft can't do it for $1B/year.

    Thanks so much to all of you who are so supportive, understanding, and patient. I know it's frustrating. Having such great customers makes this job worth it.

    Now, I'm hoping we've solved the problem for good, but it's entirely possible we were barking up the wrong tree. So keep those fingers crossed. :)

    Don
  • thegrepperthegrepper Registered Users Posts: 25 Big grins
    edited January 18, 2007
    onethumb wrote:
    Thanks, Seneca.

    It's a dirty little secret, but the truth is that no-one, not Microsoft, Google, Yahoo, eBay, or Amazon, has solved this problem. They have billion-dollar budgets for this sort of thing, but they all have outages all the time. None of them have 5-9s (99.999%) uptime for a year.

    We try hard, all of us, but sometimes things are inevitable. It can be rough on us to hear people screaming for 100% uptime, which is impossible, for $40/year when Microsoft can't do it for $1B/year.

    Thanks so much to all of you who are so supportive, understanding, and patient. I know it's frustrating. Having such great customers makes this job worth it.

    Now, I'm hoping we've solved the problem for good, but it's entirely possible we were barking up the wrong tree. So keep those fingers crossed. :)

    Don

    Don,

    Thanks for putting up with arm chair quarterbacks (me included) and for the open, honest communications.
  • kened11kened11 Registered Users Posts: 18 Big grins
    edited January 18, 2007
    Whilst this is all very nice, it isn't giving me the confidence I need to have in your service. If the same errors occur over the weekend and next week, I won't be inconvenienced I will be losing business, simple as that. Saying the likes of Microsift / Google have issues is NOT the point. During my 17 year career in IT, if I had told one of my customers that;

    1. I don't know what the issue is
    2. I can only guess when it will be fixed
    3. Hey everyone else has issues too so thats OK

    ... I would have very quickly lost that customer. Yes outages happen, but outages should not entail total loss of service multiple times in a week.

    Its great you have a loyal customer base, you are obviously doing a lot right, but I am now seriously worried that when I need the service most, it won't be there. Its not like elementary science class, people are relying on your service to run their businesses.
  • AndyAndy Registered Users Posts: 50,016 Major grins
    edited January 18, 2007
    kened11 wrote:
    Its not like elementary science class, people are relying on your service to run their businesses.
    And we know it. We're doing everything humanly possible to provide great uptime. Click on onethumb's name and read his last few postings, that should give you a good idea.
  • PBolchoverPBolchover Registered Users Posts: 909 Major grins
    edited January 18, 2007
    Quick suggestion: in cases where you are worried about possible database issues, would it be possible to run a "read only" mode of the website, using a slightly-out-of-date database. This mode would allow the general public to access the websites, but prevent the account holder from uploading or modifying any details.

    This read-only mode could be flagged by copious warnings saying something like the following "Due to database issues, we are displaying an archive version of the website. Normal service will resume as soon as possible".
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    kened11 wrote:
    Whilst this is all very nice, it isn't giving me the confidence I need to have in your service. If the same errors occur over the weekend and next week, I won't be inconvenienced I will be losing business, simple as that. Saying the likes of Microsift / Google have issues is NOT the point. During my 17 year career in IT, if I had told one of my customers that;

    1. I don't know what the issue is
    2. I can only guess when it will be fixed
    3. Hey everyone else has issues too so thats OK

    ... I would have very quickly lost that customer. Yes outages happen, but outages should not entail total loss of service multiple times in a week.

    Its great you have a loyal customer base, you are obviously doing a lot right, but I am now seriously worried that when I need the service most, it won't be there. Its not like elementary science class, people are relying on your service to run their businesses.

    I was very careful to say that Google/Microsoft/etc crashing was not an excuse. And it's not.

    But the fact remains, this is an *unsolveable* problem. At best, it's a manageable & rare problem. So there will be outages.

    I feel your pain, and you're more than welcome to try other services out, but I promise you - they'll have outages too. Just to be clear, though, we aren't taking a "well, it's inevitable, so we won't work hard to make it less frequent or less painful" approach. Far from it.

    Don
  • onethumbonethumb Administrators Posts: 1,269 Major grins
    edited January 18, 2007
    PBolchover wrote:
    Quick suggestion: in cases where you are worried about possible database issues, would it be possible to run a "read only" mode of the website, using a slightly-out-of-date database. This mode would allow the general public to access the websites, but prevent the account holder from uploading or modifying any details.

    This read-only mode could be flagged by copious warnings saying something like the following "Due to database issues, we are displaying an archive version of the website. Normal service will resume as soon as possible".

    We've thought about this before, and it's a great suggestion. I just need to figure out how to do it. :)

    Don
  • blackholewonblackholewon Registered Users Posts: 21 Big grins
    edited January 19, 2007
    Oh stop it!
    kened11 wrote:
    Whilst this is all very nice, it isn't giving me the confidence I need to have in your service. If the same errors occur over the weekend and next week, I won't be inconvenienced I will be losing business, simple as that. Saying the likes of Microsift / Google have issues is NOT the point. During my 17 year career in IT, if I had told one of my customers that;

    1. I don't know what the issue is
    2. I can only guess when it will be fixed
    3. Hey everyone else has issues too so thats OK

    ... I would have very quickly lost that customer. Yes outages happen, but outages should not entail total loss of service multiple times in a week.

    Its great you have a loyal customer base, you are obviously doing a lot right, but I am now seriously worried that when I need the service most, it won't be there. Its not like elementary science class, people are relying on your service to run their businesses.

    You are absolutely out of touch. :flush

    There isn't one IT that hasn't had system problems where they didn't know what the issue was until it was discovered, that gave an educated guess as to when it would be fixed and that hasn't said that everyone else has issues too. If not then that IT is working on an elementary system. I have direct experience working in a fortune 100 company with 85,000 employees worldwide, and guess what, the staff of IT guys said the same things. There were outages that went on and off and needed step by step diagnosis.

    It's just electronics. yelrotflmao.gif


    These guys at SmugMug are great and up front and honest. I just can't figure out why so many are OVER-REACTING with worry. If any business' success is predicated on a few hours of outage then the business can't be all that in the first place. So the strength of Smugmug is in us loyal customers to wade through the ups and downs of the internet together.

    So, if you want to blame anyone, try Gore, after all he invented the internet!!!!!wings.gif

    blackholewon
  • kygardenkygarden Registered Users Posts: 1,060 Major grins
    edited January 19, 2007
    Hmmm...much like a fresh reload of Windows on a PC, SmugMug seems to be running much faster than it ever has (for me). Pages and images load much faster than before. :cheeburga
  • TomaSTomaS Registered Users Posts: 314 Major grins
    edited January 19, 2007
    kygarden wrote:
    Hmmm...much like a fresh reload of Windows on a PC, SmugMug seems to be running much faster than it ever has (for me). Pages and images load much faster than before. :cheeburga

    I agree! The site seems to be smokin' (in a good way) this morning.

    Can we get an update on the repairs?

    Thanks for all the hard work and for keeping us informed.
  • mrcoonsmrcoons Registered Users Posts: 653 Major grins
    edited January 19, 2007
    You are absolutely out of touch. :flush

    There isn't one IT that hasn't had system problems where they didn't know what the issue was until it was discovered, that gave an educated guess as to when it would be fixed and that hasn't said that everyone else has issues too. If not then that IT is working on an elementary system. I have direct experience working in a fortune 100 company with 85,000 employees worldwide, and guess what, the staff of IT guys said the same things. There were outages that went on and off and needed step by step diagnosis.

    It's just electronics. yelrotflmao.gif


    These guys at SmugMug are great and up front and honest. I just can't figure out why so many are OVER-REACTING with worry. If any business' success is predicated on a few hours of outage then the business can't be all that in the first place. So the strength of Smugmug is in us loyal customers to wade through the ups and downs of the internet together.

    So, if you want to blame anyone, try Gore, after all he invented the internet!!!!!wings.gif

    I agree completely. I've been in IT for over 35 years and I've never seen a company as good as Smugmug about how they deal with issues like this. All IT companies have these problems. They don't want to have them but the bigger you get the more complex your infrastructure becomes and when that happens so do outages.

    Having started my IT career on an IBM Model 25 and looking at where things are today I'm continually amazed that any of it works! rolleyes1.gif
  • BaldyBaldy Registered Users, Super Moderators Posts: 2,853 moderator
    edited January 19, 2007
    TomaS wrote:
    I agree! The site seems to be smokin' (in a good way) this morning.

    Can we get an update on the repairs?

    Thanks for all the hard work and for keeping us informed.
    It was all-hands testing until after 3:00 a.m. PST this a.m. :whip

    A few of youz joined us on irc and we were pretty punchy by then. rolleyes1.gif

    Anyway, what Don didn't say when he installed new hardware yesterday is new database servers went in that are BLAZING. Hopefully the site should be fixed and flying. :ivar:ivar:ivar
  • kygardenkygarden Registered Users Posts: 1,060 Major grins
    edited January 19, 2007
    Baldy wrote:
    Hopefully the site should be fixed and flying.


    Certainly is fixed and certainly is flying. mwink.gif

    Now back to our regularly scheduled program...

    .
Sign In or Register to comment.