Talk:Requests for comment/Server-side Javascript error logging
EventLogging
editRather than reimplement something EventLogging-ish, it probably makes sense to either require EventLogging or make it pluggable. There can be a core implementation (possibly using a MW API as the RFC suggests, with corresponding client-side JS module) and an EventLogging listener that sends it to the standard EventLogging server setup.
- Event logging is a tier-2 service not designed for bursts of traffic at all. Actually bursts of traffic are likely to affect and break all users of the service. Also being a tier-2 system (see: EventLogging/OperationalSupport) it is understood that EL can be down for a couple days without issues so it is not a good choice for client side logging which should be tier-1.NRuiz (WMF) (talk) 08:53, 6 August 2014 (UTC)
It may be enough to use mw.track; WikimediaEvents could subscribe and transform it into an EventLogging event. mw.track doesn't have any core subscribers, but perhaps a simple one (that could be turned off) that only listened to onerror could be added as part of this). Superm401 - Talk 00:16, 25 July 2014 (UTC)
- My main concern is that EventLogging has been known to have trouble under unexpectedly high load. If we deploy a software update that causes a JS error in a large percentage of users, that could REALLY spike the load on the system with millions of very-similar entries. If EL can handle this, then great! If not, what would it take to either improve the EL infrastructure or build something similar that can handle that? --brion (talk) 00:26, 25 July 2014 (UTC)
What I have heard is that the way EventLogging transmits events could scale to significantly higher loads but the way it stores them (in MySQL) not so much. And we probably want something more searchable anyway. Is it possible to reuse the EL infrastructure but do something custom at the last step instead of writing into a DB? --Tgr (WMF) (talk) 00:29, 25 July 2014 (UTC)
- Eventlogging is designed for discrete events and low load. As I pointed above EL is a tier-2 service (see: EventLogging/OperationalSupport), in any case tier-1 like logging should be. While there are similarities between what EL does and client side logging would do there are more differences, the main one is bursts of traffic. Also EL data is neither public nor searchable. Seems to me that you need a specific client side mechanism with adjustable sampling rates, client side throttling and searchable backend over which alarms can be set. You have no need for schema versioning (one of EL strong points), very long term storage or an event capsule NRuiz (WMF) (talk) 08:53, 6 August 2014 (UTC)
- Nuria, what I meant by reusing is that the error logger would have its own JS code and its own Vagrant endpoint, would use the same mechanism as EventLogging (Kafka?) to send log events from Vagrant, and would have a different consumer process at the and than EL. Do you think this could work, or would bursts still be problematic? --Tgr (WMF) (talk) 16:40, 6 August 2014 (UTC)
- Tgr (WMF) I get your idea, I think. Bursts of traffic still will go through varnish and it seems that they could go directly to kafka. Also, EL is build around schema validation which in this case does not apply. I understand there are similarities but I would use a similar but much, much lighter system whose client besides being subjected to a sampling rate has throttling. Also I would log to flat files or logstash or sentry rather than to a db, or kafka or hadoop. Overall is a much simpler system when it comes to design but much higher volume that EL is. Do reach me in IRC if you want to talk about this further. 87.221.167.158 17:39, 6 August 2014 (UTC)
The other way this could cause a spike is when errors are generated by some event which fires extremely often (say, a broken scroll or mousemove handler). This should probably be throttled on the client side though. --Tgr (WMF) (talk) 00:31, 25 July 2014 (UTC)
- Not only that, this needs "grouping" on the backend so you do not see literally millions of errors being triggered when someone 'scrolls' over a wikipedia page and some bug gets triggered 2601:8:AC00:FF3:4959:BE51:6556:E829 00:08, 11 April 2015 (UTC)
Reusability and data processing
editI understand that your solution needs to scale to WMF needs, but please make it so that it is also usable by other sites.
We do have a solution at translatewiki.net which implements error catching and delivery to the server, but we lack good processing and displaying. Compared to PHP error logs, JS are of much lower quality (lack of info, l10n as mentioned in the RFC) but also lots of user specific issues due to user scripts or browser extensions – something you cannot reproduce easily and there isn't enough information to understand what is causing it. Hence it has not seen much use in practice.
I suggest you do more research for candidate solutions for processing and display part – preferably so that we are not tried to one component which is not easy to install by third parties.
There is no mention of tracking the version of MediaWiki core, extensions and what gadgets are enabled. I think these are important information to know when new errors are appearing or old ones going away.
--Nikerabbit (talk) 10:06, 25 July 2014 (UTC)
For small sites, errors would be logged to a dedicated channel via the normal MediaWiki logging architecture; they are free to configure whatever handling they want for that channel. Any code that we write for processing should be reusable in such a scenario, that's a good point (the easiest way would be to expose them as Monolog filters). For display, small sites can probably use software built specifically with this aim (see Non-MediaWiki examples).
Good point about the version information, added that to the proposal. --Tgr (WMF) (talk) 19:07, 25 July 2014 (UTC)
Target audience
editWere it implemented, would gadgets, global scripts, user scripts be able to use the server-side logging? --Gryllida 07:02, 27 July 2014 (UTC)
Any Javascript error would be logged, regardless of the source. --Tgr (WMF) (talk) 23:20, 27 July 2014 (UTC)
Opting out
editWould global scripts, user scripts, gadgets, or any code in principle be able to opt-out of server-side logging? Would you consider such ability to opt out necessary, and if so, in what circumstances? Gryllida 07:02, 27 July 2014 (UTC)
Probably not. What would be the point of opting out, anyway? I can see users wanting to opt out for privacy reasons, and supporting that should be straightforward, but why would a gadget need to opt out? --Tgr (WMF) (talk) 23:23, 27 July 2014 (UTC)
I had debugging use-case in mind, during which a user breaks things on purpose. Then users opt-out may work. :-)
I'll be happy to see more discussion on this topic, if needed, such as for other use-cases I may have missed. --Gryllida 08:55, 29 July 2014 (UTC)
Errors coming from userscripts / dynamic code can just be filtered out on the server side (and have to be anyway, browser extensions tend to generate all sort of crap), no need for users to bother with that. The errors come with a stack trace, so it's easy to see whether they happen in MediaWiki core, an extension, a gadget or something else. --Tgr (WMF) (talk) 22:41, 29 July 2014 (UTC)
Other languages people
editThe proposal page is only in English. How would you like non-English projects people (who don't visit mw.org and who don't read mailing lists) to know about this request for comment? --Gryllida 08:57, 29 July 2014 (UTC)
RfC meeting
editThis RFC has been scheduled to be discussed in the Architecture RfC meeting today, 2014-09-03. Sorry for the late notice.--Qgil-WMF (talk) 15:49, 3 September 2014 (UTC)
21:48:58 <TimStarling> tgr: I like your proposal, do you need anything to make it happen? 21:49:28 <tgr> first of all feedback that it's worth investing into 21:49:41 <TimStarling> it sounds pretty awesome to me 21:49:49 <tgr> it turned out significantly more complex than I initially thought 21:50:13 <TimStarling> #topic Server-side Javascript error logging | RFC meeting September 3 | Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE).| https://meta.wikimedia.org/wiki/IRC_office_hours | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ 21:50:21 <TimStarling> #link https://www.mediawiki.org/wiki/Requests_for_comment/Server-side_Javascript_error_logging 21:50:55 <tgr> I have three points specifically where I would like to know if the tradeoff is acceptable 21:51:07 <ori> what are they? 21:51:37 <tgr> 1. cors, 2. source maps, 3. data transport between client and server 21:52:32 <tgr> about cors: modern browsers sanitize errors if the script is on a different domain, need to add crossorigin="anonymous" to the script tag to avoid that 21:52:59 <tgr> if you do that and there is no Accept-Origin header, the browser will handle it as a 404 21:53:21 <csteipp> Eek. I'm not a fan of Access-Control-Allow-Origin: * 21:53:24 <ori> so, imo you shouldn't worry too much about those. having the user-agent and the name of the module where the error occurred (easy to do if you wrap RL's execute() with a try/catch) is already very useful and heaps better than flying blind, which is what we do now 21:53:25 <tgr> which means ResourceLoader would have to have some sort of fallback behavior 21:53:52 <ori> the error details aren't terribly important; you can just load the module in the same browser environment and find out that way 21:54:30 <ori> the bigger issue is how to avoid DDOSing ourselves if we happen to deploy code that breaks on a popular browser 21:54:33 <tgr> if the script is not same-domain, there is no error message, no file, no line number, no exception 21:54:47 <tgr> an error report without those is pretty pointless 21:55:01 <robla> "sanitize" == "swallow" then 21:55:09 <tgr> pretty much 21:55:31 <ori> i don't think that's true for RL modules, because they just register themselves, and it's RL itself that actually invokes them, IIRC 21:55:58 <csteipp> But they all come from bits right? 21:56:25 <csteipp> But... bits already allows cors, iirc. 21:56:30 <tgr> I think window.onerror gets the domain of the HTML file, regardless of what file sets it 21:57:02 <tgr> the problem is that there are weird things on the internet which remove CORS headers 21:57:13 <tgr> office firewalls and such 21:57:13 <ori> iirc every RL module is executed by <https://github.com/wikimedia/mediawiki-core/blob/master/resources/src/mediawiki/mediawiki.js#L1092> 21:57:21 <ori> which knows the name of the module it is about to invoke 21:57:33 <TimStarling> #chair ori csteipp 21:57:33 <wm-labs-meetbot`> Current chairs: TimStarling csteipp ori 21:58:02 <tgr> if you use the crossorigin attribute, but there is no header, the script just does not load, period 21:58:21 <TimStarling> can one of you do #endmeeting when you are done talking? I have to go 21:58:41 <csteipp> will do 21:59:27 <tgr> so unless there is some workaround that I don't see, ResourceLoader needs to catch CORS failures and load scripts without a crossorigin attribute in that case, which is a significant increase in complexity 22:00:16 <tgr> ori: here is a simple test case: https://www.mediawiki.org/wiki/User:Tgr_(WMF)/common.js 22:00:29 <tgr> the error source and the onerror handler are in the same file, even 22:00:39 <tgr> you still don't get error details 22:00:43 <csteipp> tgr: That does make it hard. Wouldn't knowing that there is an error be a good, and easy first step. And then adding a test for cors would be even easier, since we would be logging if the test started failing? 22:01:40 <tgr> csteipp: agreed 22:02:05 <ori> take a look at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.echo.base&skin=vector&version=20140903T215031Z&* 22:02:10 <tgr> also, older browsers don't care about domains, and getting error details for old IE would already be useful 22:02:19 <ori> you can see that the script source is wrapped with: 22:02:28 <ori> if(window.mw){mw.loader.implement("ext.echo.base",function($,jQuery){ 22:02:51 <ori> this registers the function with resourceloader, which then executes it 22:03:18 <tgr> ori: I don't see why that's relevant 22:04:46 <ori> it's relevant because the code of every RL module is enclosed in a closure rather than immediately executed in top scope 22:05:05 <ori> and the closure gets invoked by a script that is always on the same domain, which we control, and which knows the name of the module it is about to call 22:05:56 <ori> this is done in https://github.com/wikimedia/mediawiki-core/blob/master/includes/resourceloader/ResourceLoader.php#L1053 in PHP and in execute() in JS 22:06:24 <tgr> if you are thinking about adding a try-catch to the RL wrapper, that would still not catch exceptions in event handlers, timeout callbacks and such 22:06:55 <ori> yes, but 22:07:13 <ori> take a look at the wrapper code again 22:07:14 <ori> if(window.mw){mw.loader.implement("ext.echo.base",function($,jQuery){ 22:07:19 <ori> we pass each module $ and jQuery 22:07:36 <ori> i added that explicitly so we can pass a patched $ / jQuery to each module 22:08:00 <ori> such that $('.foo').on('something', callback ) can be associated with the module that set it 22:08:02 <tgr> ah, I see 22:08:16 <ori> that doesn't cover window.setTimeout, but it's pretty comprehensive 22:08:17 <tgr> that would work most of the time 22:08:33 <ori> (we could even patch setTimeout, but that's getting ahead of ourselves) 22:08:47 <ori> IIRC Qunit does that 22:08:57 <tgr> sinon.js does, yeah 22:09:15 <ori> Krinkle is absurdly knowledgeable about these things, btw, and is a good resource for that reason 22:09:44 <Krinkle> Patch setTimeout for what? 22:09:57 <Krinkle> scope doesn't change 22:10:09 <ori> so that errors can always be associated with a module 22:10:17 <ori> even if it's asynchronously-executed module code 22:10:26 <ori> that a try/catch in resourceloader's execute() wouldn't catch 22:10:46 <Krinkle> ah, errors, not $/jQuery perf 22:11:02 * ori nods 22:12:36 <tgr> OK, that does sound like a valid workaround 22:13:20 <tgr> the second problem is with source maps, error reports in non-debug mode are pretty useless without that 22:13:54 <ori> you have the module name, that's pretty good already 22:14:02 <ori> module name + user agent is good enough in most cases 22:15:38 <tgr> the main use case of error logging would be 1) error stats, 2) good reports about errors which are not easy to reproduce 22:15:57 <tgr> non-deterministic or race condition or such 22:16:28 <Krinkle> QUnit doesn't, SinonJS does. And it does it synchronously, and in a terribly unstable just-about-works way. 22:16:30 <tgr> if it's easy to reproduce, the developer can usually figure how to do that out from the error report anyway 22:17:04 <Krinkle> what's the topic btw? Is there an office hour? 22:17:18 <Krinkle> debug mode is going away in it's current form as soon as source maps are implemetned. 22:17:19 <tgr> Krinkle: server-side Javascript error logging 22:17:34 <tgr> https://www.mediawiki.org/wiki/Requests_for_comment/Server-side_Javascript_error_logging 22:17:40 <ori> we should probably close the meeting (we can continue chatting informally) 22:17:57 <tgr> you are the chair :) 22:18:07 <ori> #endmeeting