I keep getting a
"Http/1.1 Service Unavailable" page.
I understand that things do happen but since I singed up we have had 3 different sets of down time.
It makes using smugmug for business very very frusterating.
Yep and we know it. We're really sorry. I'm guessing it's related to the previous issue, and I'm sure that onethumb and wireless will have it up and going in no time.
Yep and we know it. We're really sorry. I'm guessing it's related to the previous issue, and I'm sure that onethumb and wireless will have it up and going in no time.
Don't mean to sound ungratefull as I do appreciate you all catching it very quickly and your punctuality in notify people. It really sounds like you all don't normally have these kinds of problems. But with me just siging up its just frusterating. I hope you all can get things stablized soon.
Thanks again for reacting so quickly.
Our engineers are aware, and working on it. Will post updates here. Sorry for the hassle.
Eureka!
I know this is difficult for everyone to understand, but we actually *do* need to keep letting the site crash like this.
Why? Because we can't fix the problem without knowing what causes it. So every time it goes down, we change one thing we think might help, and wait to see if it made a difference.
This time, finally, we hit the jackpot. We know definitively what piece of hardware is failing, and we have spares standing by. It is a core piece of hardware, so fixing it permanently will result in the site being down for awhile.
We're getting the site back up right now, and hope that it will limp along until maintenance tonight, but we're starting to prepare just in case it can't.
Don't mean to sound ungratefull as I do appreciate you all catching it very quickly and your punctuality in notify people. It really sounds like you all don't normally have these kinds of problems. But with me just siging up its just frusterating. I hope you all can get things stablized soon.
Thanks again for reacting so quickly.
It is a core piece of hardware, so fixing it permanently will result in the site being down for awhile.
Don
Can you please expand upon this statement? What is the estimated window you will need for repair? As an IT professional I understand things happen. And I sympathize with it taking time to identify the problem. But repair work should have an estimated duration. What is that duration. I'm in the middle of posting results from a shoot this past weekend and I'm getting complaints from customers that the site is not available. It's not those people that concern me though - it's the people from the event that DONT say anything and just don't come back.. So, If you could provide more information about the recovery window I can pass it along to my customers.
Now that the problem is identified a generic "we're working on it and doing the best we can" isn't enough for business clients of mine. Just like the IT depertment in my company must provide our business clients with estimates I expect the same from Smugmug, my business partner. I realize you're doing the best you can to identify and fix things. But you also have to help us plan and manage our clients.
Can you please expand upon this statement? What is the estimated window you will need for repair? As an IT professional I understand things happen. And I sympathize with it taking time to identify the problem. But repair work should have an estimated duration. What is that duration. I'm in the middle of posting results from a shoot this past weekend and I'm getting complaints from customers that the site is not available. It's not those people that concern me though - it's the people from the event that DONT say anything and just don't come back.. So, If you could provide more information about the recovery window I can pass it along to my customers.
Now that the problem is identified a generic "we're working on it and doing the best we can" isn't enough for business clients of mine. Just like the IT depertment in my company must provide our business clients with estimates I expect the same from Smugmug, my business partner. I realize you're doing the best you can to identify and fix things. But you also have to help us plan and manage our clients.
Thanks,
John
As an IT professional, you probably realize that any estimate I give you is simply a best guess, right?
Because I've worked with some of the largest IT organizations on the planet, and have friends working at them to this day, and they rarely meet their estimates. It's sorta like software development that way - Murphy's law strikes almost every time.
In all honesty, I think and hope the problem can be fixed in less than 30 minutes if everything went perfectly, but let's double that to an hour just to be on the safe side. That way, when it takes two hours, I'll only be off by 50%.
FYI, we've narrowed it down to one of three things at this point: either an optical fibre channel cable (unlikely, since the error rate is so low), a bad disk inside of a RAID array (unlikely, since we're not getting any errors from the controller), or a bad RAID controller. We think it's the latter, and that piece of hardware isn't hot-swappable and doesn't have a hot-standby. It is a tool-less swap, though, so theoretically it should be very fast - but we'll just have to see.
After it's been replaced, we then need to bring the data back online and do an integrity check. This will theoretically take the bulk of the time (15-30 minutes) since swapping the card is so easy and we have a relatively large amount of recently touched data to verify. So much of the repair downtime will just be waiting for data to spool off disk.
The site is back up, and the error rate is relatively low for something that we're pushing many GBs through, so I'm hoping we can last until 10pm Pacific tonight without another crash, but should we crash again, we'll start implementing this repair process immediately.
thanks for sharing. Yep, I understand about estimates. But, to summarize your reply it sounds like you are implementing the fix now and estimated time to recovery is no more than 2 hours. So if I tell my clients the site will be available and working by 5 pm that should work.
Thanks for the quick reply - it will really help me out.
EDIT - looks like I read to fast and you're not planning on fixing rigth away. When do you plan on taking the system down for the planned fix if it doesn't crash again?
thanks for sharing. Yep, I understand about estimates. But, to summarize your reply it sounds like you are implementing the fix now and estimated time to recovery is no more than 2 hours. So if I tell my clients the site will be available and working by 5 pm that should work.
Thanks for the quick reply - it will really help me out.
EDIT - looks like I read to fast and you're not planning on fixing rigth away. When do you plan on taking the system down for the planned fix if it doesn't crash again?
Our weekly scheduled maintenance window begins at 10pm Pacific tonight. so hopefully we'll last until then, but if we don't, we'll do it immediately upon failure.
Comments
I keep getting a
"Http/1.1 Service Unavailable" page.
I understand that things do happen but since I singed up we have had 3 different sets of down time.
It makes using smugmug for business very very frusterating.
Homepage • Popular
JFriend's javascript customizations • Secrets for getting fast answers on Dgrin
Always include a link to your site when posting a question
http://www.dgrin.com/showpost.php?p=462722&postcount=52
Portfolio • Workshops • Facebook • Twitter
Portfolio • Workshops • Facebook • Twitter
Don't mean to sound ungratefull as I do appreciate you all catching it very quickly and your punctuality in notify people. It really sounds like you all don't normally have these kinds of problems. But with me just siging up its just frusterating. I hope you all can get things stablized soon.
Thanks again for reacting so quickly.
Eureka!
I know this is difficult for everyone to understand, but we actually *do* need to keep letting the site crash like this.
Why? Because we can't fix the problem without knowing what causes it. So every time it goes down, we change one thing we think might help, and wait to see if it made a difference.
This time, finally, we hit the jackpot. We know definitively what piece of hardware is failing, and we have spares standing by. It is a core piece of hardware, so fixing it permanently will result in the site being down for awhile.
We're getting the site back up right now, and hope that it will limp along until maintenance tonight, but we're starting to prepare just in case it can't.
More as I get it.
Don
http://www.dgrin.com/showthread.php?p=462786#post462786
Portfolio • Workshops • Facebook • Twitter
Can you please expand upon this statement? What is the estimated window you will need for repair? As an IT professional I understand things happen. And I sympathize with it taking time to identify the problem. But repair work should have an estimated duration. What is that duration. I'm in the middle of posting results from a shoot this past weekend and I'm getting complaints from customers that the site is not available. It's not those people that concern me though - it's the people from the event that DONT say anything and just don't come back.. So, If you could provide more information about the recovery window I can pass it along to my customers.
Now that the problem is identified a generic "we're working on it and doing the best we can" isn't enough for business clients of mine. Just like the IT depertment in my company must provide our business clients with estimates I expect the same from Smugmug, my business partner. I realize you're doing the best you can to identify and fix things. But you also have to help us plan and manage our clients.
Thanks,
John
As an IT professional, you probably realize that any estimate I give you is simply a best guess, right?
Because I've worked with some of the largest IT organizations on the planet, and have friends working at them to this day, and they rarely meet their estimates. It's sorta like software development that way - Murphy's law strikes almost every time.
In all honesty, I think and hope the problem can be fixed in less than 30 minutes if everything went perfectly, but let's double that to an hour just to be on the safe side. That way, when it takes two hours, I'll only be off by 50%.
FYI, we've narrowed it down to one of three things at this point: either an optical fibre channel cable (unlikely, since the error rate is so low), a bad disk inside of a RAID array (unlikely, since we're not getting any errors from the controller), or a bad RAID controller. We think it's the latter, and that piece of hardware isn't hot-swappable and doesn't have a hot-standby. It is a tool-less swap, though, so theoretically it should be very fast - but we'll just have to see.
After it's been replaced, we then need to bring the data back online and do an integrity check. This will theoretically take the bulk of the time (15-30 minutes) since swapping the card is so easy and we have a relatively large amount of recently touched data to verify. So much of the repair downtime will just be waiting for data to spool off disk.
The site is back up, and the error rate is relatively low for something that we're pushing many GBs through, so I'm hoping we can last until 10pm Pacific tonight without another crash, but should we crash again, we'll start implementing this repair process immediately.
More as I get it.
Don
Thanks for the quick reply - it will really help me out.
EDIT - looks like I read to fast and you're not planning on fixing rigth away. When do you plan on taking the system down for the planned fix if it doesn't crash again?
Our weekly scheduled maintenance window begins at 10pm Pacific tonight. so hopefully we'll last until then, but if we don't, we'll do it immediately upon failure.
Don
Portfolio • Workshops • Facebook • Twitter
Site is up again at 9pm. Thanks