這個案例是前兩天出現的,一直沒有時間總結,25號凌晨4點去處理資料庫的故障問題。遠程連上公司的區域網,psping檢查發現伺服器的1433埠不通,資料庫連接不上,但是主機又能ping通,登錄伺服器檢查發現SQL Server的SQL Server (MSSQLSERVER) Service 等服務... ...
這個案例是前兩天出現的,一直沒有時間總結,25號凌晨4點去處理資料庫的故障問題。遠程連上公司的區域網,psping檢查發現伺服器的1433埠不通,資料庫連接不上,但是主機又能ping通,登錄伺服器檢查發現SQL Server的SQL Server (MSSQLSERVER) Service 等服務都沒有啟動.從Zabix檢查也發現服務停了, 真是懵了,使用systeminfo命令檢查系統的情況,發現這台伺服器在凌晨3:31重啟了,但是對應的SQL Server服務沒有自動啟動,
檢查錯誤日誌,發現SQL Server等相關服務的自動啟動都失敗了,如下所示:
A timeout was reached (30000 milliseconds) while waiting for the SQL Server (MSSQLSERVER) service to connect.
如上截圖所示,其實還有一些自動啟動的服務也都出現了錯誤,繼續往下麵看,發現錯誤日誌有下麵錯誤信息:
The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
Event 41的Kernel-Power錯誤意味著系統在未首先正常關機的情況下重新啟動。當系統停止響應、出現故障或意外斷電時,會發生此錯誤。更多相關信息參考https://support.microsoft.com/zh-cn/help/2028504/windows-kernel-event-id-41-error-the-system-has-rebooted-without-clean
到此,我們知道了系統異常重啟了,但是為什麼系統重啟後,那些自動啟動的服務(例如SQL Server服務都啟動失敗呢?)什麼原因導致呢?“A service does not start, and events 7000 and 7011 are logged in the Windows event log” 這裡簡單的介紹了一下,但是感覺沒有詳細介紹。
The service control manager waits for the time that is specified by the ServicesPipeTimeout entry before logging event 7000 or 7011. Services that depend on the Windows Trace Session Manager service may require more than 60 seconds to start. Therefore, increase the ServicesPipeTimeout value appropriately to give all the dependent services enough time to start.
For more information, click the following article number to view the article in the Microsoft Knowledge Base:
839803 The Windows Trace Session Manager service does not start and Event ID 7000 occurs
我們可以理解為service control manager等待SQL Server的服務啟動,但是這個服務由於資源問題或一些依賴選項問題,導致它在30秒內沒有成功啟動,所以service control manager就出錯了。網上有人這樣介紹:
It could be that some other dependent components (the disk, network shares, etc) take longer to start up. Could you set the service to start as Automatic (delayed)
其實後面跟系統管理員溝通這個問題,才知道是因為資料庫伺服器(Virtual Machine)所在的Nutanix一臺主機由於故障,VM自動切換到另一臺主機,切換過程中VM會重新啟動,而且當時出現問題的有3台VM伺服器(SQL Server 2008/2014都有)。
解決方案:
1: 將SQLSQL Server (MSSQLSERVER)等相關服務的啟動類型改為“Automatic(Delayed Start)”。
2:修改Serivce Timeout的值。
To change the service timeout value:
1:Click the Start button, then click Run, type regedit, and click OK.
2:In the Registry Editor, click the registry subkey HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control.
3:In the details pane, locate the ServicesPipeTimeout entry, right-click that entry and then select Modify.#這個值改為60秒或120秒
Note: If the ServicesPipeTimeout entry does not exist, you must create it by selecting New on the Edit menu, followed by the DWORD Value, then typing ServicesPipeTimeout, and clicking Enter.
4:Click Decimal, enter the new timeout value in milliseconds, and then click OK.
5:Restart the computer.
不過這個錯誤,我沒法重現這個錯誤、從而無法測試驗證上訴解決方案能否真正解決問題。不過上面大體分析是基本正確的。
參考資料:
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2008-R2-and-2008/dd349403(v=ws.10)
https://support.microsoft.com/en-in/help/922918/a-service-does-not-start-and-events-7000-and-7011-are-logged-in-window