最近發現公司的測試環境中有個Socket服務的埠總是莫名其妙Down掉,但是服務卻正常運行著,看樣子是僵死了。。。 雖然是測試環境,但是也不能這樣放著不管,於是連夜寫了一個簡單的監控腳本。因為伺服器是Windows的,所以要用到wmi模塊。邏輯如下: 1、用wmi模塊獲取系統中處於停止狀態的服務, ...
最近發現公司的測試環境中有個Socket服務的埠總是莫名其妙Down掉,但是服務卻正常運行著,看樣子是僵死了。。。
雖然是測試環境,但是也不能這樣放著不管,於是連夜寫了一個簡單的監控腳本。因為伺服器是Windows的,所以要用到wmi模塊。邏輯如下:
1、用wmi模塊獲取系統中處於停止狀態的服務,生成一個字典。
2、判斷監控的服務是否存在於字典中,如果存在說明服務已經停止,那麼將嘗試啟動服務,併發送報警郵件。
3、向本地的Socket服務埠發送一個connect,如果捕獲到異常將嘗試重啟服務,併發送報警郵件。
4、每次執行時腳本將會迴圈執行以上步驟三次,間隔10秒,以確保服務狀態正常。
在運行的時候發現了一個問題,Python使用wmi模塊來對Windows系統進行操作的時候速度格外的慢,不知道有沒有其他的代替方法,哪位如果有更好的方法可以指點一下。
源碼如下:
#!/usr/bin/env python
import os
import wmi
import time
import socket
import base64
import smtplib
import logging
from email.mime.text import MIMEText
def get_stop_service(designation):
"""Get stopped service name and caption,
Filtration 'designation' service whether there is 'Stopped'.
:return: service state
"""
c = wmi.WMI()
ret = dict()
for service in c.Win32_Service():
state, caption = service.State, service.Caption
if state == 'Stopped':
t = ret.get(state, [])
t.append(caption)
ret[state] = t
# If 'designation' service in the 'Stopped', return status is 'down'
if designation in ret.get('Stopped'):
logging.error('Service [%s] is down, try to restart the service. \r\n' % designation)
return 'down'
return True
def monitor(sname):
"""Send the machine IP port 20000 socket request,
If capture the abnormal returns the string 'ex'.
:return: string 'ex'
"""
s = socket.socket()
s.settimeout(3) # timeout
host = ('127.0.0.1', 20000)
try: # Try connection to the host
s.connect(host)
except socket.error as e:
logging.warning('[%s] service connection failed: %s \r\n' % (sname, e))
return 'ex'
return True
def restart_service(rstname, conn, run):
"""First check whether the service is stopped,
if stop, start the service directly.
The check whether the zombies,
if a zombie, then restart the service.
:return: flag or True
"""
flag = False
try:
# From get_stop_service() to obtain the return value, the return value
if run == 'down':
ret = os.system('sc start "%s"' % rstname)
if ret != 0:
raise Exception('[Errno %s]' % ret)
flag = True
elif conn == 'ex':
retStop = os.system('sc stop "%s"' % rstname)
retSart = os.system('sc start "%s"' % rstname)
if retSart != 0:
raise Exception('retStop [Status code %s] '
'retSart [Status code %s] ' % (retStop, retSart))
flag = True
else:
logging.info('[%s] service running status to normal' % rstname)
return True
except Exception as e:
logging.warning('[%s] service restart failed: %s \r\n' % (rstname, e))
return flag
def send_mail(to_list, sub, contents):
"""Send alarm mail.
:return: flag
"""
mail_server = 'mail.stmp.com' # STMP Server
mail_user = 'YouAccount' # Mail account
mail_pass = base64.b64decode('Password') # The encrypted password
mail_postfix = 'smtp.com' # Domain name
me = 'Monitor alarm<%s@%s>' % (mail_user, mail_postfix)
message = MIMEText(contents, _subtype='html', _charset='utf-8')
message['Subject'] = sub
message['From'] = me
message['To'] = ';'.join(to_list)
flag = False # To determine whether a mail sent successfully
try:
s = smtplib.SMTP()
s.connect(mail_server)
s.login(mail_user, mail_pass)
s.sendmail(me, to_list, message.as_string())
s.close()
flag = True
except Exception, e:
logging.warning('Send mail failed, exception: [%s]. \r\n' % e)
return flag
def main(sname):
"""Parameter type in the name of the service need to monitor,
perform functions defined in turn, and the return value is correct.
After the program is running, will test three times,
each time interval to 10 seconds.
:return: retValue
"""
retry = 3
count = 0
retValue = False # Used return to the state of the socket
while count < retry:
ret = monitor(sname)
if ret != 'ex': # If socket connection is normaol, return retValue
retValue = ret
return retValue
isDown = get_stop_service(sname)
restart_service(rstname=sname, conn=ret, run=isDown)
host = socket.gethostname()
address = socket.gethostbyname(host)
mailto_list = ['[email protected]', ] # Alarm contacts
send_mail(mailto_list,
'Alarm',
' <h4>Level: <u>ERROR</u></br> Host name: %s</br>'
' IP Address: %s</br>'
' Service name:</h4> <h5>%s</h5>'
% (host, address, sname))
count += 1
time.sleep(10)
else:
logging.error('[%s] service try to restart more than three times \r\n' % sname)
return retValue
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s',
datefmt='%Y/%m/%d %H:%M:%S',
filename='D:\\logs\\Monitor.log',
filemode='ab')
name = 'Service Name'
response = main(name)
if response:
logging.info('The [%s] service connection is normal \r\n' % name)
以上代碼還是有可以改進的地方,將多個服務名寫到文件中,程式去讀取文件中的服務依次進行檢測。