Analytics/Legacy Logging

Rationale edit

Timeline edit

Documents edit

  • User requirements:
  • Specifications:
  • Software design document:
  • Test plan:
  • Documentation plan:
  • User interface design docs:
  • Schedule:
  • Task management:
  • Release management plan:
  • Communications plan:

Communications edit

Proposal for changes to the format and content of the web server logs edit

Log Format Changes edit

The Analytics Team wants to make a number of changes to the web server logs to collect more data and to fix some issues with the output format. We propose the following changes:

  1. Add the X-Carrier header to be able to identify Wikipedia Zero traffic. I have sent a proposal for shortening country names and I have asked Amit to supply abbreviations for mobile carrier names.
  2. Add the Accept-Language header
  3. Use tab character as space delimiter instead of the space. This is probably the biggest change and it will affect all the people who use the server logs. See the Plan on how we want to make this transition as smooth as possible.

These log format changes will need to be changed for squid, varnichncsa, and nginx.

Overall Plan edit

We suggest the following approach to introduce these changes without disrupting the existing workflow

  1. Andrew has built in Labs an nginx/varnish/squid mediawiki configuration where we can extensively test the new configuration of the server logs.
  2. We will generate test data and supply that to Erik Zachte and give him ample time to adjust his scripts.
  3. Once we receive thumbs from Erik Zachte, we will communicate with all the other log file consumers when we are going to deploy the changes on the servers. Particularly, the fundraising team is an important consumer of log data that will be affected by this change as well.
  4. Deploy changes. See #Deployment_Plan below.

Progress edit

Summary of Progress
Task Status
Webstatscollector Finished
udp-filter Finished
AWK scripts Finished
C-based filter scripts Finished
Wikipedia Zero filters Finished
Varnish / squid/ nginx config changes Finished
Update wikitech documentation Finished

Software Changes edit

We will have to make changes to the following programs:

Webstatscollector edit

TODO: What needs done here?

udp-filter edit

  • Remove exact field count requirement.
  • Add ability to filter by HTTP response code.

The above two changes should be deployed before we finally switch to \t.

  • Use \t as field delimiter..

AWK scripts edit

These do not need changed, as they are currently splitting on any white space character. They will behave as they currently do either way. However, to make these more accurate than they currently are, we should change them so they split on \t rather than any whitespace.

C-based filter scripts edit

Migrate these to use udp-filter.

emery edit
latlongCountry-writer edit

This currently prepends CountryCode lat,lon to log lines. Will it be ok if we change the format to what udp-filter does with -g -b everything?

TODO: talk to someone about latlongCountry-writer.

India edit
 -pipe 10 /a/squid/india-filter >> /a/squid/india.log
 +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
locke edit
Mobile edit
 -pipe 100 /a/squid/m-filter >> /a/squid/mobile.log
 +pipe 100 /usr/bin/udp-filter -d m.wikipedia.org >> /a/squid/mobile.log
India edit

Do we really need two India filters? One is already on emery.

 -pipe 10 /a/squid/india-filter >> /a/squid/india.log
 +pipe 10 /usr/bin/udp-filter -c IN -g -b country -m /usr/share/GeoIP/GeoIP.dat >> /var/log/squid/india.log
Edits edit
 -pipe 1 /a/squid/edits-filter >> /a/squid/edits.log
 +pipe 1 /usr/bin/udp-filter -p "action=edit,action=submit" >> /a/squid/edits.log
5xx errors edit
 -pipe 1 /a/squid/5xx-filter | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log
 +pipe 1 /usr/bin/udp-filter --http-status='50' | awk -W interactive '$9 !~ "upload.wikimedia.org|query.php"' >> /a/squid/5xx.log
Fundraising Landing Pages edit
 -pipe 1 /a/squid/fundraising/lp-filter >> /a/squid/fundraising/logs/landingpages.log
 +pipe 1 udp-filter -d wikimediafoundation.org,donate.wikimedia.org >> /a/squid/fundraising/logs/landingpages.log
Fundraising Banner Impressions edit
 -pipe 100 /a/squid/fundraising/bi-filter >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log
 +pipe 100 /usr/bin/udp-filter -p 'Special:BannerLoader' >> /a/squid/fundraising/logs/bannerImpressions-sampled100.log

packet-loss filter edit

  • Use \t as field delimiter.

Wikipedia Zero Filters edit

Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

sqstat.pl edit

  • Use \t as field delimiter.

Need to talk to Asher about this.

varnishncsa, nginx, and squid log formats edit

  • varnishncsa.default
  • nginx.conf.erb
  • frontend generate squid .php template

Wikipedia Zero Filters edit

Modify the IP range filters to a single udp-filter that matches on X-Carrier header. This can be done after the additional fields are added to log sources.

Reverse the nginx patch were we escape spaces edit

This can be done last. Is there similar patch to varnishncsa?

Update Wikitech documentation edit

We need to update wikitech documentation with the new http headers. I have requested access for us to edit the wikitech wiki.

Deployment Plan edit

  1. Deploy initial changes to udp-filter. udp-filter needs to be able to accept a variable number of fields. As of May 16 2012, this change has been committed and needs to be deployed.
  2. Verify that everything works exactly as it did before this change. Wait a few days to catch any potential problems.
  3. Migrate existing custom C scripts to using udp-filter. This needs to be done after the above udp-filter change as been deployed to avoid losing log lines due to having spaces in some of the fields.
  4. Verify that all migrated filters still work properly. Wait for at least a few days after all filters have been migrated to ensure that things are ok. It'd be good to get verification from the filter owners before we proceed as well.
  5. Deploy log sources change that adds additional fields in log format. Do not yet deploy the \t change.
  6. Verify that all filters continue to work as before, even with the addition of extra fields.
  7. TODO: Work out \t deployment plan