Bug 7238

Summary: we should monitor more things such as servers' date
Product: Infrastructure Reporter: Thierry Vignaud <thierry.vignaud>
Component: OthersAssignee: Sysadmin Team <sysadmin-bugs>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: Normal CC: sysadmin-bugs, tmb
Version: unspecified   
Target Milestone: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Source RPM: CVE:
Status comment:
Bug Depends on: 7228    
Bug Blocks:    

Description Thierry Vignaud 2012-08-29 10:41:34 CEST
As shown in bug #7228, we should monitor more things such as servers' date.

Also, as yesterday mail failure show, we should monitor more services and send mail alerts to sysadmin ml on error.
Thierry Vignaud 2012-08-29 10:41:59 CEST

Depends on: (none) => 7228

Comment 1 Thomas Backlund 2012-08-29 11:03:06 CEST
We already do xymon monitoring wich alerts us on a separate list to not flood sysadm list, and sympa also notified us that it died when it lost db access.

But as there was server maintenance in progress in the DC, there was no point in restarting services just to have them fail again.

Status: NEW => RESOLVED
CC: (none) => tmb
Resolution: (none) => FIXED

Comment 2 Thierry Vignaud 2012-08-29 11:09:59 CEST
Do we monitor server's date too?
Comment 3 Thomas Backlund 2012-08-29 11:17:53 CEST
Yep, for example head of a mail yesterday regarding valstar:

yellow Tue Aug 28 14:37:48 CEST 2012 up: 18:07, 1 users, 203 procs, load=0.16
&yellow System clock is -7202 seconds off (max 60)