Error logs hunt-n-fix sprint

This page is for streamlining any sprints around digging through various error logs from Wikimedia production services to clean up any oddities that made it to deployment. Can potentially be a source of deeper problems or just simple junk that needs cleaning up.

Finding errors

edit

Phabricator

edit

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Logstash

edit

https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors

Log files

edit

Copied from http://noc.wikimedia.org/~reedy/fatalmonitor

mwlog1001:~ wikidev$

watch "tail -n 1000 /home/wikipedia/syslog/apache.log | grep 'PHP\|Segmentation fault' | sed -r 's/\[notice\] child pid [0-9]+ exit signal //g' | sed 's/, referer.*$//g' | cut -d ' ' -f 7- | sort | uniq -c | sort -rn"

Origin

edit
<RoanKattouw> He said something like "we could spend a day just fixing the stuff our error logs turn up"
<sumanah> indeed that does sound plausible!
<Reedy> a day?
<Reedy> I'm sure we could make a week out of it...
<^demon> Well, more I'm sure. But there's a *lot* of low-hanging fruit in the logs.
<Reedy> heh, indeed
<^demon> Simple stuff like val/ref mismatch, missing params, fatals for stupid mistakes, etc.
<sumanah> a day would be a good place to start
<RoanKattouw> Like ^demon said we could just hand out stuff to people
<RoanKattouw> I guess I should spend that time trying to fix those stupid permissions issues on the cluster

Events

edit

Events where a sprint was kept on this matter (either large or small):

  • Berlin Hackathon 2011
  • New Orleans Hackathon 2011: At the NOLA Hackathon 2011 some attendees attempted to work on this. They chose a moment when very few errors were showing up in the logs, so they did not have much to work with. However, Roan fixed an issue that was cluttering up the logs with symbolic links errors. Therefore we expect that log sizes will go WAY down, helping save space and ensuring more serious errors are easier to find.
  • Berlin Hackathon 2012