Analytics/Archive/Pixel Service
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on any information on this page. |
This page is archived! Find up-to-date documentation at https://wikitech.wikimedia.org/wiki/Analytics
The Pixel service is the "front door" to the analytics system: a public endpoint with a simple interface for getting data into the datastore.
Components edit
- Request Endpoint: HTTP server that handles
GET
requests to the <blink>PIXEL</blink> service endpoint, responding with204 NO CONTENT
or an actual, honest-to-god 1x1 transparent gif. Data is submitted into the cluster via query parameters. - Messaging System: routes messages (request information + data content) from the request endpoint to the datastore. This component is intended to be implemented by Apache Kafka.
- Datastore Consumer: consumes messages, shunting them into the datastore utilizing HDFS staging and/or append.
- Processing Toolkit: a standard template for a pig job to process (count, aggregate, etc) event data query string params, handling standard indirection for referrer and timestamp, Apache Avro de/serialization, and providing tools for conversion funnel and A/B testing analysis.
- Event Logging Library: a JS library with an easy interface to abstract the sending of data to the service. Handles event data conventions for proxied timestamp, referrer; normal web request components.
Service prototype edit
To get up and running right away, we're going to start with an alpha prototype, and work with teams to see where it goes.
/event.gif
on bits multicast stream -> udp2log (1:1) running in Analytics cluster- Until bits caches are ready, we'll also have a publicly accessible endpoint on
analytics1001
- Until bits caches are ready, we'll also have a publicly accessible endpoint on
- Kafka consumes udp2log, creating topic per product-code -- no intermediate aggregation at cache DC
- Cron to run Kafka-Hadoop consumer, importing all topics into Hadoop to datetime+producer-code paths
EventLogging Integration TODOs edit
- Make sure all event data goes into Kraken (I think it may only be esams at the moment, not sure). [ottomata] (Dec)
- Divvy up some TODOs with Ori:
- Keeping udplog seq id counters for each bits host and emitting some alert if gaps detected
- Until https://rt.wikimedia.org/Ticket/Display.html?id=4094 is resolved, monitor for truncated URIs (detectable because missing trailing ';') and set up some alerting scheme
- Speaking of that RT ticket: check w/Mark if we can do something useful to move that along (like update the patch so it applies against the versions deployed to prod).
- Figure out a useful arrangement for server-side events (basic idea: call wfDebugLog(..) on hooks that represent "business" events, have wfDebugLog write to UDP / TCP socket pointing at Kraken. See EventLogging extension for some idea of what I mean.
- already done? EventLogging's efLogServerSideEvent() validates events against a versioned schema on meta-wiki and writes them using wfDebugLog (currently to UDP). E3 logs all AccountCreation events on all servers using this. -- S Page (WMF) (talk) 00:39, 12 January 2013 (UTC)
- Things Ori needs and would repay in dev time and/or sexual favors: - Puppetization of stuff on Vanadium - Help w/MySQL admin
- Other EventLogging TODOs: mw:Extension:EventLogging/Todos
- Figure out how to map event schemas to Avro(?) or some other way to make Hadoop schema-aware so the data is actually useful rather than just blob-like
Getting to production edit
We're pretty settled on Kafka as the messaging transport, but to use the dynamic load-balancing and failover features we need a ZooKeeper-aware producer — unfortunately, only the Java and C# clients have this functionality. (This is a blocker for both the Pixel Service AND general request logging.)
Three options:
- Pipe logging output from Squid & Varnish into the console producer (which implies running the JVM in production);
- Write code (a Varnish plugin plus configuration as described here, as well as a Squid module, both in something C-like) to do ZK-integration and publish to Kafka
- Continue to use udp2log -> Kafka with the caveat that the stream is unreliable until it gets to Kafka.
Frequently Asked Questions edit
What HTTP actions will the service support? edit
GET
.
What about POST
s?
edit
No POST
. Only GET
. Other than content-length, there's no real justification for a POST
, and if you're sending strings that are greater than 2k, you kind of already have a problem.
Can I send JSON? edit
Sure, but we're probably not going to do anything special with it -- the JSON values will show up as strings that you'll have to parse to aggregate, count, etc. Ex: GET /event.gif?json=={"foo":1,"bar":[1,2,3]}
(and recall you'll have to encodeURIComponent(json)
).
As we want to build tools to cover the normal cases first, this is not really recommended. (Just use www-form-encoding
KV pairs as usual.) If anyone has a REEEEALLY good use-case, we can talk about having a key-convention for sending a json payload, like, say, calling the key json
.
If I send crazy HTTP
headers, will the service record them?
edit
No. We will not parse anything other than the query string.
Custom headers are exactly what we want to avoid -- think of the metadata in an HTTP request as being an interface. You want it to be minimal and well-defined, so little custom parsing needs to occur. KV-pairs in the query string are both flexible and generic enough to meet all reasonable use-cases. If you really need typing, send JSON as the value (as mentioned above).
See also edit
- Extension:EventLogging from the E3 team uses a similar approach to the Kraken Pixel Service: client-side JavaScript makes GET requests to
bits.wikimedia.org/event.gif?param1=value1...
. As of December 2012 mw:Onboarding new Wikipedians and Community portal redesign use this extension and its JSON schema-driven logging, and Extension:MobileFrontend directly makes requests to the event.gif- Meta-wiki hosts the schemas definining the events they log.
- Extension:ClickTracking from 2010 implements (among other features) event logging via HTTP requests to a MediaWiki API that writes to a ClickTracking "log" which we route over UDP. As of December 2012 several extensions still depend on ClickTracking, but few actually generate log events.