Analytics/Archive/Logging infrastructure/status

Last update on: 2014-04-monthly

2012-05-22 edit

Will soon deploy new version of udp-filter that accepts a variable number of fields. This will allow us to migrate more custom C filters to udp-filter. udp-filter can now filter by HTTP response status, and geocode along side of IP address.

2012-05-10 edit

We have added a third log collector machine (oxygen) to supplement our current collectors (locke and emery). Andrew is working out a strategy for dealing with errant spaces in nginx logs that throw off our logging scripts. Also figuring out how to better match wikipedia-zero traffic; will probably add custom response header.

2012-05-monthly edit

Our plan to improve logging sources (Squid, Varnish, nginx, etc.) includes adding more fields, and also allowing us to add arbitrary fields in the future without breaking features. Changing the field formats of the logging sources requires coordination with the Operations team. The format changes have been committed, but not yet deployed. udp-filter has been modified so that it is more flexible, and a few features have been added as well: it now can geocode and anonymize inline in the same field as the IP address, so that later log parsers don't have to try to detect a new field.

2012-06-03 edit

During the Berlin Hackathon, a patch was submitted that allows udp-filter to do IPv6 address filtering. We hope to incorporate this soon.

2012-06-monthly edit

A change to add 2 new headers to logging fields has been submitted. We are waiting on the go ahead from consumers to merge and deploy this.


2012-07-monthly edit

Modified lucene lsearchd code to use log4j appender for udp2log rather than manually editing codebase. Also built scribe and scribe log4j appenders for sending arbitrary logs to scribe. No movement on log format changes.

2012-09-monthly edit

    • Augmented udp-filter to take CIDR ranges, and to consistently anonymize IP addresses.
    • Worked with Zero team to make sure incoming log filters are consistent.
    • Puppetized rsync module to allow for easy syncing of data between udp2log machines and stat1.

Ongoing admin work with users of stat1 and stat1001.

  • Moved stats.wikimedia.org hosting from spence over to stat1001.
  • Set up easy deployment of data generated on stat1 to stats.wikimedia.org on stat1001.

2012-10-monthly edit

Contractor Stefan Petrea has worked on bug fixes in wikistats, which we also migrated to git. On the udp2log front, we added features in udp-filters and webstatscollector, and deployed a new banner impression filter.

2013-02-monthly edit

It was a quiet month for the logging infrastructure; things were running fine. We have been working on a patch to fix bug 45178, which we will try to deploy in March.

2013-09-monthly edit

  • We have started testing Kafka and it's failover behavior in a multi datacenter setup. So far, the results have been very encouraging: failure of a broker is detected very fast and with almost no data-loss the producers start sending data to the backup broker. We have decided to use JSON as the new message format in combination with the Snappy compression format for sending data from the Kafka producers to the Kafka brokers.

2014-01-monthly edit

We've increased the throughput on Kafka from 6K Requests Per Second (RPS) to 50K RPS to test stability under higher loads.

2014-03-monthly edit

We continue to investigate network issues between our data centers that are causing occasionally delivery issues. As noted above, we are currently deploying Camus, our software for transferring data between Kafka and Hadoop.

2014-04-monthly edit

Data from text Varnishes is now being consumed through varnishkafka -> kafka -> camus into hdfs. Kafka now processes Bits, Images and Text data.