Analytics/Legacy Logging
Legacy webrequest logging
Work surrounding udp2log, udp-filters and webstatscollector
|
Rationale
editTimeline
editDocuments
edit- User requirements:
- Specifications:
- Software design document:
- Test plan:
- Documentation plan:
- User interface design docs:
- Schedule:
- Task management:
- Release management plan:
- Communications plan:
Communications
editProposal for changes to the format and content of the web server logs
editLog Format Changes
editThe Analytics Team wants to make a number of changes to the web server logs to collect more data and to fix some issues with the output format. We propose the following changes:
- Add the X-Carrier header to be able to identify Wikipedia Zero traffic. I have sent a proposal for shortening country names and I have asked Amit to supply abbreviations for mobile carrier names.
- Add the Accept-Language header
- Use tab character as space delimiter instead of the space. This is probably the biggest change and it will affect all the people who use the server logs. See the Plan on how we want to make this transition as smooth as possible.
These log format changes will need to be changed for squid, varnichncsa, and nginx.
Overall Plan
editWe suggest the following approach to introduce these changes without disrupting the existing workflow
- Andrew has built in Labs an nginx/varnish/squid mediawiki configuration where we can extensively test the new configuration of the server logs.
- We will generate test data and supply that to Erik Zachte and give him ample time to adjust his scripts.
- Once we receive thumbs from Erik Zachte, we will communicate with all the other log file consumers when we are going to deploy the changes on the servers. Particularly, the fundraising team is an important consumer of log data that will be affected by this change as well.
- Deploy changes. See #Deployment_Plan below.
Progress
editTask | Status |
---|---|
Webstatscollector | Finished |
udp-filter | Finished |
AWK scripts | Finished |
C-based filter scripts | Finished |
Wikipedia Zero filters | Finished |
Varnish / squid/ nginx config changes | Finished |
Update wikitech documentation | Finished |
Software Changes
editWe will have to make changes to the following programs:
Webstatscollector
editTODO: What needs done here?
udp-filter
edit- Remove exact field count requirement.
- Add ability to filter by HTTP response code.
The above two changes should be deployed before we finally switch to \t.
- Use \t as field delimiter..
AWK scripts
editThese do not need changed, as they are currently splitting on any white space character. They will behave as they currently do either way. However, to make these more accurate than they currently are, we should change them so they split on \t rather than any whitespace.
C-based filter scripts
editMigrate these to use udp-filter.
emery
editlatlongCountry-writer
editThis currently prepends CountryCode lat,lon to log lines. Will it be ok if we change the format to what udp-filter does with -g -b everything?
TODO: talk to someone about latlongCountry-writer.
India
edit-pipe 10 /a/squid/india-filter >> /a/squid/india.log +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
locke
editMobile
edit-pipe 100 /a/squid/m-filter >> /a/squid/mobile.log +pipe 100 /usr/bin/udp-filter -d m.wikipedia.org >> /a/squid/mobile.log
India
editDo we really need two India filters? One is already on emery.
-pipe 10 /a/squid/india-filter >> /a/squid/india.log +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
Edits
edit-pipe 1 /a/squid/edits-filter >> /a/squid/edits.log +pipe 1 /usr/bin/udp-filter -p "action=edit,action=submit" >> /a/squid/edits.log
5xx errors
edit-pipe 1 /a/squid/5xx-filter | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log +pipe 1 /usr/bin/udp-filter --http-status='50' | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log
Fundraising Landing Pages
edit-pipe 1 /a/squid/fundraising/lp-filter >> /a/squid/fundraising/logs/landingpages.log +pipe 1 udp-filter -d wikimediafoundation.org,donate.wikimedia.org >> /a/squid/fundraising/logs/landingpages.log
Fundraising Banner Impressions
edit-pipe 100 /a/squid/fundraising/bi-filter >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log +pipe 100 /usr/bin/udp-filter -p 'Special:BannerLoader' >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log
packet-loss filter
edit- Use \t as field delimiter.
Wikipedia Zero Filters
editModify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.
sqstat.pl
edit- Use \t as field delimiter.
Need to talk to Asher about this.
varnishncsa, nginx, and squid log formats
edit- varnishncsa.default
- nginx.conf.erb
- frontend generate squid .php template
Wikipedia Zero Filters
editModify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.
Reverse the nginx patch were we escape spaces
editThis can be done last. Is there similar patch to varnishncsa?
Update Wikitech documentation
editWe need to update wikitech documentation with the new http headers. I have requested access for us to edit the wikitech wiki.
Deployment Plan
edit- Deploy initial changes to udp-filter. udp-filter needs to be able to accept a variable number of fields. As of May 16 2012, this change has been committed and needs to be deployed.
- Verify that everything works exactly as it did before this change. Wait a few days to catch any potential problems.
- Migrate existing custom C scripts to using udp-filter. This needs to be done after the above udp-filter change as been deployed to avoid losing log lines due to having spaces in some of the fields.
- Verify that all migrated filters still work properly. Wait for at least a few days after all filters have been migrated to ensure that things are ok. It'd be good to get verification from the filter owners before we proceed as well.
- Deploy log sources change that adds additional fields in log format. Do not yet deploy the \t change.
- Verify that all filters continue to work as before, even with the addition of extra fields.
- TODO: Work out \t deployment plan