Requests for comment/Server-side Javascript error logging

Request for comment (RFC)
Server-side Javascript error logging
Component	Frontend
Creation date	22 June 2014
Author(s)	Tgr
Document status	declined See Phabricator.

Tracked in Phabricator
Task T382

Providing a stack trace with any error report has been the standard for the last few decades in most of software development and maintenance, and greatly reduces time and effort needed to determine the exact cause of a bug. However, in the world of frontend development such support is still rare, and for MediaWiki development it is missing completely. Let's change that!

The tasks of making a Javascript error available to the developers, without involving the user on whose browser the error happens, can be split into four parts:

Catching the error

There are two ways to catch a Javascript error: try/catch and window.onerror. Exception handling is superior in multiple ways, but has to be added to the code manually or via some sort of automated code generation.

window.onerror, on the other hand, is meant to be added globally, without modifying application code, but has its shortcomings:

It does not include column numbers on older browsers, which is problematic for minified code. (Although this is a problem for try/catch exception handling as well.)
The exception object or stack trace is also not available on older browsers. The WHATWG HTML5 standard includes these parameters in the window.onerror parameter list, and recent Chrome and Firefox provide access to the stack trace in window.onerror; Safari and IE do not, although there are hacks for the latter.
If the script was loaded from a different domain (which is almost always the case for WMF sites), the browser hides all error details as a security measure; only a non-specific error message such as Script error. is passed. Most recent browsers (at least Chrome, Firefox and WebKit) allow opting in to show this information via CORS, by setting an Access-Control-Allow-Origin: * HTTP header on the script resource and adding a crossorigin="anonymous" attribute to the <script> tag.
- The CORS standard requires a CORS resource failure to be handled as a network error, which means that whenever the attribute is specified but the HTTP header is somehow missing, script loading will break completely. This can apparently be the case with some firewalls (TODO: measure what percentage of readers is affected).

Thus, there seem to be two ways to collect error information, both somewhat scary:

make ResourceLoader set the crossorigin attribute on scripts and do some sort of error detection/fallback to re-add the same scripts without the crossorigin attribute if CORS headers are stripped and the initial loading attempt fails (or just push a no-JS version of the site to them, like it happens with IE6/7 users)
use ResourceLoader::makeLoaderImplementScript() to manipulate $, window and other global objects so that every top-scope call (event handlers, setTimeout etc) can be automatically wrapped in exception handling

To make it easy to connect stack traces with error reports, the error catching script should also generate an error id which can be displayed to the user by the application. This would be some sort of hash generated from the error details (message + filename + position, or maybe just filename + position to avoid duplicates based on language / browser version). There is also various extra information that would be useful to include (MediaWiki/extension/gadget version information, maybe even some sort of hook where extensions can add extra info).

Sending the error to the server

There should be a simple way to transfer error data to the server, which is simple to set up and suitable for most MediaWiki installs; WMF with its huge traffic probably needs something more complex.

The generic solution could simply be an AJAX request to an API endpoint (with some sort of throttling on client side, to prevent flood with errors on window.setInterval or repetitive events like scroll or mousemove events), then use standard logging with a reserved channel; the site operator can set up the normal way where that channel goes. (This assumes that the structured logging RFC will be implemented.)

The WMF solution will need to be able to handle huge traffic. (If things break badly, every single pageview could trigger an error. If things break really badly (e.g. error in a mousemove handler), every pageview might generate hundreds of them.) This could be done by reusing some of the existing infrastructure for EventLogging (possibly with some sort of throttling or sampling):

add an error.gif file to the DOM (TODO: could GET length limits be problematic for huge stack traces?)
run a second instance of varnishkafka on the bits.wikimedia.org varnish (the first is used by EventLogging), with a VSL option to filter on the error.gif requests, to parse varnish logs in shared memory, extract error reports and push them to Kafka
use logstash-kafka to get error reports from Kafka to logstash, where it is preprocessed and pushed into ElasticSearch. (logstash+ElasticSearch is already used for collecting server-side error logs.)

Processing

Before storing:

EventLogging/UserAgentSanitization
Most errors will reference minified files - we can use source maps to reconstruct. (Generating source maps seems like a hard problem, might justify switching minifiers - e.g. UglifyJS supports it out of the box, and feature-wise there is probably not so much difference between different minifiers.)
The same script or some latter post-processing could try to figure out which groups an error is in (e.g. which extension owns the file) and ping graphite so we have nice error frequency stats.

After storing:

We need to deduplicate errors if we want to get any useful overview:
- The same error can occur on multiple pages, multiple sites, normal vs.debug mode, possibly multiple resource URLs due to different batching of files in ResourceLoader. (Source maps help solve this issue, but not all browsers return column numbers.)
- Error messages might vary due to i18n and browser differences. We should probably ignore messages and assume two errors with the exact same location are the same thing. Besides deduplication, we want to show developers error messages in English, if possible (could be done by logging browser language settings).
Rotating and purging (are 30 days enough?)

Displaying

It should be easy to search and filter errors and see frequencies/counts; different error messages which probably represent the same error should be grouped. As a first step we could send results to ElasticSearch and set up a Kibana frontend to it (this is how backend errors are handled). Alternatively, one of the free JS errorlog displaying applications might be helpful.

Due to security and privacy issues, the availability of this has to be strongly limited, but knowing about this errors would be very useful for many people who work with JS (gadget maintainers, site admins changing MediaWiki:Common.js etc); publishing some sort of statistics (e.g. error count per file, most frequent error messages) would be nice.

Resources

Non-MediaWiki examples

Free software:

stacktrace.js - Javascript library for obtaining the stacktrace
TraceKit - Javascript library for obtaining the stacktrace
Sentry - full-stack error logging service (not JS-specific, but supports JS)
jsErrorLog - full-stack JS error logging service
ErrorBoard - simple full-stack JS error logging service

Commercial / SaaS: TrackJS, bugsense, JSLogger, Qbaka, Muscula, errorception, ExceptionHub, Bugsnag, Exceptional, Airbrake, Raygun, RollBar, probably many more

Related bugs

bug 51857: Log JavaScript errors with onerror
bug 45514: add source map support to ResourceLoader