[RESOLVED] Grid Issues
– [RESOLVED 9/24/2018 @ 1:48pm EST]
The new database is online – Thank you all for your patience.
– [UPDATED 9/23/2018 @ 6:30pm EST]
The new database has been loaded and is completely caught up with all of the changes since our issues began, but we’ve decided to do the “Official” switchover tomorrow morning (9/24/2018) at approx. 7am grid time. This will give everyone 12 hours advanced notice, and since we’ll need to restart all of the regions as well, it will be a good time to install Microsoft updates to all of the servers while we’re at it. We expect downtime to be approx. 3-4 hours.
Again, thank you all for your patience.
– [UPDATED 9/23/2018 @ 11:30am EST]
The loading of the new database is almost complete. I am estimating the time to complete the load to be approx. 3 hours, but once its done we still need to wait for the new database to load all of the changes which have happened since the backup was made. Depending on the amount of time it takes to “Catch up”, we may take the grid offline later tonight and switch over to the new database, but if it’s too late (my time) it will have to wait until tomorrow AM Grid time. We’ll be sure to keep everyone up to date on this article as we go and in our community chat inside the grid.
Thanks again for your patience.
– [UPDATED 9/21/2018 @ 9:30am EST]
Our new server was delivered yesterday morning and work began to rebuild our core database.
If you know anything about creating a backup of a MYSQL database, you know it can take some time and our database is just slightly over 2tb in size and our backup is currently approx. 1/3 completed.
We are currently backing up at a rate of 35gb/hr. and once the backup has completed, we’ll need to transfer the backup to our new core server. Once we have the backup on our new core server, we can start uploading our backup.
Once this backup is in place and running, we’ll then load any changes which have been made to our slave database since our issues began. Once the new database has “Caught up” with all of our changes, we’ll then switch the grid over to using it instead.
In theory, we should be back up and running with slightly better performance than before our issues began as our new server is slightly faster than old server, has more ram, more storage, and we’ve added a couple more drives which should make our database perform a bit quicker.
Once we are up and running on our new server, the database we are currently using will be set back to “Slave” duties. Our other “Slave” databases will be switched to take their information from our master database and we’ll be back to normal.
I have plans to upgrade one of our slaves to a better performing slave so if the need should ever arise again to switch over to a slave database, we’ll be able to perform a bit better.
Once we have our new server placed into operation, we’ll have our main database with 4tb of solid state storage space running on a 16 cpu core server with 96gb ram. This new server will be connected to the internet with a dedicated 1gbps connection and it will be able to talk to our 4 slave databases over a “Local” network with speeds of 10gbps.
Once completed, we will be spending approx. $1500/mo. just for our core servers/services.
As many know, I often help other grids with tech issues and I also host several other grids. I am always telling everyone the importance of backups and redundancies. Most grids simply don’t want to spend the money to create a backup system like we have as the truth is, they charge far too little for their regions and simply cannot afford the added cost of more machines to use as slaves, and a fast performing core.
I don’t “Kiss and Tell”, and am diligent about remaining professional in my dealings with other grids, but the number of times I’ve seen other grids not even running the most basic of backups is very scary. An issue like this would have killed them.
I was once asked to help a grid which had been online for several years. I logged into their servers to begin my work and I couldn’t find a slave database. I asked about it and was told they couldn’t afford one. I asked how long it had been since they had made a backup of their main database and they answered they have never made a backup as they didn’t know how.
I told them just as I tell everyone else, you are running on borrowed time.
Your database will eventually fail, they all do and it’s just a matter of time before it happens. It could happen today, tomorrow, next week, next month, next year, but at some point it will fail, I promise. I also advised them to raise pricing on their services so they can afford to put in the backup systems they needed.
Often times, people start a grid with the best of intentions, but they don’t have all of the knowledge they need to run it themselves and they don’t have the money to spend for the systems they need.
It seems every new grid thinks they have to have the lowest cost in order to attract the users and this low cost is what keeps them from being able to afford the systems they need.
I point out that their users are assuming they have these systems in place to protect their content and builds. I ask them what their plan is to recover their user’s content in case of their single database failing, and they answer they hope to be able to afford to put a single slave in place soon.
The reality is, it costs money to operate a grid. It takes even more money to operate a grid with systems in place to protect user’s investment of time and money. Some people think everything in OpenSim should be free. I would rather pay a little for something I know I can depend on. If I’m going to invest my money and time in a grid, I want to be sure they have systems in place to protect me and my investments.
I’ve been working with OpenSim since December of 2007. I started my first grid in Early 2008. Since my first grid, I have always made it a point to have a good backup system in place and I have always made sure I have a backup server to use in case my main server should fail. I do this because I realize my users are putting their trust in me. They pay me each month for their regions, not just to enjoy building and hanging out with their friends, but to ensure I have systems in place to protect their time/money investments should something go wrong. I realize this, and I’ll always remain diligent about having such systems in place to protect our users.
To those other grids reading this, here’s some advice; Do not start a grid unless you have the means to purchase the equipment needed to protect your user’s investments.
Don’t start a grid unless you know how to manage these systems or can afford to pay someone to manage them for you.
Don’t charge so little that you can’t afford to pay for the basic systems you need to protect your users.
As a grid owner/manager, you owe it to your users to watch out for them. They pay you each month and they assume you have the basic backup systems in place to protect them.
Remember grid owners, it’s your reputation that is on the line and all of your users are counting on you to take care of them in trade for the money they pay you each month. Without your users, there’s no reason to have a grid as it’s your users and their experiences while on your grid which will define you and your grid forever.
Always treat them with respect and dignity, always be fair, always put them first, always be honest, and always do the right thing. If you do these things, you will likely succeed.
And last but not least, There’s one thing we know about databases, they will all eventually fail. We knew this and is the reason we have so many slave servers running and the reason we make and store so many backups. Had we not been prepared with backups and slave databases to take over, our grid would have been down until we were able to rebuild the main database. Had we not had the means to generate the backups we needed, we would not have been able to rebuild our main database. So grid owners, if you do not yet have your backup systems and database redundancies in place, seek them out now and don’t wait until you have a major failure before you worry about them as by then, it will be too late. You owe it to your users, that’s why they pay you.
For the crowd who thinks “Free” is the only way for them; You often get what you pay for.
– [UPDATE 9/19/2018 @ 1:30pmEST]
The news on our Main database is not good, it seems we have corruption caused by a faulty RAID card. Thankfully, we have several backup databases and have been able to use one of them to keep us online while we rebuild our main database. This has resulted in slower than normal performance, but, at least we’re still up and running.
We expect it to be several days before the work has been completed as our database is very large and will take considerable time to rebuild. We’re sorry for the issues and we’ll work as fast as we can to resolve the issue.
In the meantime, if you are trying to login, let the viewer “Cook”, or “Load” until either you get logged in, or it times outs. If it fails, keep trying and you’ll eventually get in. Avatars with smaller inventories will have less trouble than avatars with larger inventories. Once you get logged in, performance isn’t terrible and is still very usable, although uploads are much slower than we’re used to.
We’ll keep you updated as we know more and thank you for your patience.
– At Approx. 10pm EST on 9/19/2018 our Main database went down, but we were able to switch over to one of our backup databases. It is far slower than our main database making logging in painfully slow for some users.
Avatars which do not have as much inventory will login much quicker than those with large inventories. I’m afraid we’ll have to “limp” along until we get our main database back online and we aren’t yet sure of a time frame as we are still working to resolve the issue.
If you can use an “Alt” avatar with a small inventory, you will likely have a better experience until we get the issue resolved.
We’ll do our best to resolve this as quickly as we can and we appreciate your patience.