We ran into an issue last week where our SCOM database started growing rapidly and managed to fill up the drive, this caused the data warehouse to go into Suspect mode. There’s a lot of solutions out there for suspect mode, but the worrying thing i have found is most just blindly give out some standard steps to get a database back and dont hint at the possible carnage that might follow, With that im going to try and give out some details of what these steps are doing, where to be careful and how to do it in a more DBA, business aware method.
Finding a database in Suspect Mode
So the worst way to find this is, to stumble on it in SSMS, as a DBA hopefully your running checks for this sort of thing (EPM is a great and easy solution, check out my daily report from EPM here) all your really doing is checking the status of databases in sys.databases view
select name, state_desc from sys.databases
What puts a Database into Suspect Mode
Improper shutdown of the database server
Corruption of the database files
Unavailable device files
Unavailable database files
Database resource used by operating system
SQL Server incorrectly asserts free data page space when a row is inserted
First Steps When Finding Suspect Mode
So you found a Suspect DB, lets do the standard, Take a backup
BACKUP DATABASE [OperationsManagerDW] TO DISK ='Z:\SQL_BACKUPS\Suspect\OpsMngDW.bak' WITH CHECKSUM, CONTINUE_AFTER_ERROR;
For more details on this check out Paul Randall’s blog post Example 2000/2005 corrupt databases and some more info on backup, restore, page checksums and IO errors
We also know we are going to be running CHECKDB so now is good time to set some expectations to the business, your boss and whoever else you might need to inform. I used my Knowing your CheckDB duration Times (Using Ola Hallengren Maintenance Scripts) to get some expected duration for running checkdb, with that you can start saying if no corruption is found it will take X if we have corruption ill need to run CHECKDB at least once more maybe few so Y amount of time.
Next we are going to go into Emergency Mode, so lets see what that means:
What is Emergency Mode
it’s fully documented and supported. This isn’t some dodgy advice off the internet from Random User, its right there in Books Online.
it’s a one-way operation. Anything it does cannot be rolled back or undone. If this worries you (which would surprise me because if someone ever got into that state then surely they don’t have the healthy concern about their data that they should have in the first place) then make a copy of the damaged database before you run emergency mode repair.
as its a one-way operation, you cannot wrap it in an explicit user-transaction.
it’s the only repair option available in emergency mode – if you try to use REPAIR_REBUILD then it won’t work.
Taken from Paul Randall’s post: Ta da! Emergency mode repair
The code below will then be ran:
ALTER DATABASE [OperationsManagerDW] SET EMERGENCY GO ALTER DATABASE [OperationsManagerDW] SET SINGLE_USER GO DBCC CHECKDB ([OperationsManagerDW]) WITH NO_INFOMSGS, ALL_ERRORMSGS; GO
Now the CHECKDB might fail for you, due to what i spoke about before (See here) where theirs not enough room on the drive to create the snapshot that CHECKDB runs off. If this is the case, read my previous post for solutions.
If you do have corruption, your going to have to deal with (luckily we didn’t!) and hopefully you’ve been playing or at least looking at Steve Stedmans Corruption Challenge, reading Paul Randall’s blog or watching hes videos on PluralSight. Again the main thing to remember is your in Emergency mode, so if you had corruption in only Non-Clustered indexes you cant just run CHECKDBB REPAIR_REBUILD as its not supported.
Once the corruption is resolved you can change the database back to multi user mode,
ALTER DATABASE [OperationsManagerDW] SET MULTI_USER GO