Error logs hunt-n-fix sprint
This page is for streamlining any sprints around digging through various error logs from Wikimedia production services to clean up any oddities that made it to deployment. Can potentially be a source of deeper problems or just simple junk that needs cleaning up.
Finding errors
editPhabricator
edithttps://phabricator.wikimedia.org/tag/wikimedia-production-error/
Logstash
edithttps://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors
Log files
editCopied from http://noc.wikimedia.org/~reedy/fatalmonitor
mwlog1001:~ wikidev$
watch "tail -n 1000 /home/wikipedia/syslog/apache.log | grep 'PHP\|Segmentation fault' | sed -r 's/\[notice\] child pid [0-9]+ exit signal //g' | sed 's/, referer.*$//g' | cut -d ' ' -f 7- | sort | uniq -c | sort -rn"
Origin
edit<RoanKattouw> He said something like "we could spend a day just fixing the stuff our error logs turn up" <sumanah> indeed that does sound plausible! <Reedy> a day? <Reedy> I'm sure we could make a week out of it... <^demon> Well, more I'm sure. But there's a *lot* of low-hanging fruit in the logs. <Reedy> heh, indeed <^demon> Simple stuff like val/ref mismatch, missing params, fatals for stupid mistakes, etc. <sumanah> a day would be a good place to start <RoanKattouw> Like ^demon said we could just hand out stuff to people <RoanKattouw> I guess I should spend that time trying to fix those stupid permissions issues on the cluster
Events
editEvents where a sprint was kept on this matter (either large or small):
- Berlin Hackathon 2011
- New Orleans Hackathon 2011: At the NOLA Hackathon 2011 some attendees attempted to work on this. They chose a moment when very few errors were showing up in the logs, so they did not have much to work with. However, Roan fixed an issue that was cluttering up the logs with symbolic links errors. Therefore we expect that log sizes will go WAY down, helping save space and ensuring more serious errors are easier to find.
- Berlin Hackathon 2012