This interface has the same core problems as many other such rating systems: (1) It introduce a way to heavy workload, (2) it is outside the normal workflow, and (3) it is not clear how it actually is meant to be used.
(1) Writing a manual reason is not efficient. Making an interface like this will only lead to users writing crappy gadgets to work around the limitation. It would be far better (faster) to have prepared reasons and that would also make it a lot easier to analyse the reasons later on. (The four prepared reasons does not really fit very well. Bad faith? How can you really know if something is done in bad faith? This is plain crystal ball!)
(2) Whether something like this is used depends on it being part of a workflow. This seems to be outside all workflows, and then I start wondering why should anyone use the page? The only process I know of that really needs something like this is Featured Articles, but I'm not sure there are enough willingness to adapt current processes.
(3) Core problem; should you describe a change or a revision? You identify and post comment on a revision, yet what is shown is a diff. What do you comment in this case? I have no idea. See also phab:T185247, you need a rationale for you classification, but then what do you classify?
A fourth problem with this is that it creates a new point of vandalism. This happens because of the free form judgement. By using prepared judgements it becomes harder to use for targeted vandalism.
One way to make vandalism harder is to only allow non-autoconfirmed users to use prepared judgements, while autoconfirmed users are allowed to use free form judgements. That sort of works. (That is; autoconfirm is used as the cost the user must bypass to be allowed to use the free form judgement.)
It is designed to not incur a new workload. Patrollers would add labels to Jade by going about their normal patrolling activity via integrations with the tools that they are using. This is a long-standing ask from patrollers and tool developers working on patrolling support.
The only time where there is new work to do is in the case where a disagreement occurs. Say for example, if one editor thinks an edit is damaging and another does not, they might need to talk to each other to work out their differences. I expect this to be rare as it is currently a rare occurrence.
If you think that we can't determine good-faith from bad-faith, I might agree in principal, but in practice, people send counter-vandalism warnings to people they find vandalizing Wikipedia (bad-faith) and they send kinder revert notices to people they find making mistakes that don't look like vandalism (good-faith). We have had over 30 wikis perform labeling work to differentiate good edits from bad-faith vandalism and good-faith damage and people seem to do a good job of making sense of this distinction. Further, patrolling tools have this distinction built in. E.g. Huggle has separate buttons and user-warning actions in these two different cases.
We differentiate revisions (versions of the article) from changes (diffs/edits) as different "entity" types. If you are labeling the Diff, you are labeling the change. If you are labeling the Revision, you are labeling that version of the article. Jade's entity visualization makes this clear by either loading the version of the page or the diff of the edit above the proposed labels.
Jade does create a new point where vandals will eventually try to cause problems. Your proposal for "prepared judgments" is interesting, but I don't see that it would be necessary before we see a problem. Surely, we can just prevent contributions in the Jade namespace by anons, new users, etc. after it becomes a problem. If it does become a problem, the same old revert mechanisms will still apply and changes to Jade entities will still appear in the RecentChanges feed -- ala en:User:Risker/Risker's_checklist_for_content-creation_extensions and Everything is a wiki page.
(1) This is not a patroller-only feature as the subject page explains, but a feature the patrollers might use. If it is an additional feature for the patrollers, then it will give rise to an additional workload. Even loading an additional page will incur an additional workload.
(2) Either you have a system where one user take charge, like present editing where last editor overrides all, or you have a system where some sort of consensus is built. This is not a consensus based tool, it is a last editor overrides all.
(3) No, you can't determine what is good faith and what is bad-faith. You can only observe the effects of the users editing. (Patrollers can't even agree on what should and should not be reverted, how can they agree on why some edit was made?) And no, continue doing something because it is tradition in some other tool is a fallacy.
(4) I would like to see an interface that is able to make that distinction, and convey it to the editors. I really do, but I doubt it is possible.
(5) leads to (1) but now you have patrolling of judgements from earlier patrolling. This adds up layer by layer. The approach is wrong, as it often is in this "wikiway"-thinking. You must make it easier to patrol and/or harder to vandalize. Making more advanced patrolling, for whatever reason, will make the content harder to maintain, which will lead to fewer users maintaining the actual content, which will lead to lower quality.
The question you should ask is: What is the physical features we can exploit to create a force-factor that works with the patroller to make it harder for a vandal?
(1) Oh yes. For sure. This tool captures information about the work that people are already doing. Right now, this information is either lost of hidden from view. When making this visible, there will be work involved in ensuring it doesn't fill up with garbage. The level of that workload has yet to be seen and we can respond to issues as they arise. In the meantime, we've designed this system to look and function like wiki pages so that the same old workflows apply.
(1.5) Maybe we should have separate threads rather than numbering things? This is getting hard to follow.
(2) Right. So all wikis page have the same problem then. This system mimics a strawpoll style discussion. In many wikis, there are rules and processes that avoid the "last editor override" problem that you describe. I would imagine those policies and processes to apply to jade too.
(3) I do think that people make judgments about this all of the time. This tool is intended to track human judgment, not truth. Differentiating what appears to be good-faith mistakes from what appears to be intentional vandalism is important work. I'm not saying it is tradition. I'm saying it happens, it's consistent, and we can train machines to help. I'm honestly not sure what the problem is. Jade is actually designed to support disagreements about what should or should not be considered "good-faith". I think we'll see some interesting debates come out of that for sure, but for the most part, I expect people to agree 99% of the time (or more) based on all of the edit labeling work people have already been doing to train ORES.
(4) The revision entity will be named "Jade:Revision/123456" and will include a visualization of the entire article. The edit entity will be names "Jade:Diff/123456" and will include a visualization of the diff that the edit creates. It'll be hard to mix the two up because you'll be looking at what you're judging while you're judging it.
(5) This will be patrolling in the same old way. It's certainly not more complicated and we're not adding new layers -- just another namespace to patrol. It'll have structured data so it'll be harder to vandalize, but vandalism will still be possible. The whole goal is to use this data to train, test, and govern a machine learning models (see mw:ORES) that makes patrolling easier. I see no good reason why we wouldn't also run ORES on Jade pages. In this way, you could say that making patrolling easier is Jade's primary purpose.
(1) It creates additional work, but in some cases it replaces existing work already done. I believe it will invite to more discussion and more work.
(2) This has left my original reasoning and question, but anyhow, it is not about consensus as it is now.
(3) What you try to do by classifying “good vs bad faith” is to add a pseudo-orthogonal dimension that contradicts the “destructive vs non-destructive” dimension, or in other words you add a variant uncertainty.
(3, your 4) Perhaps people will learn to distinguish them, but I believe it would take time. If people start discussing individual diffs the productive work could potentially grind to a near halt.
(your 5) If you use ORES to patrol JADE entries, then you should not use JADE entries to train ORES. You will then create a closed feedback loop, and over time you will get increased bias. This is due to some highly active (dominant) users which ORES will learn, and then ORES will facilitate those (dominant) users patrolling, which are then feed back into ORES. (You can avoid this by tracking “agents” aka users, and controlling individual feedback. [Agents acts like a feature vector.] Given that IP-users are pretty near random, it could be infeasible.)
Given what I have seen about auditing behavior so far, I'm estimating that less than 1% of judgments will attract any additional discussion. But I do admit that this additional discussion is more work. Ultimately, this additional work is needed to get a decision right so I see it as a gap that we *aren't* discussing these decisions now. We don't even know how often someone is reverting something and sending a warning message when it was actually a good edit.
When I classify good vs. bad-faith, I'm mimicking the structure that Huggle and Twinkle (and other tools) use to decide what warning message to send to people in the process of reverting their edits. People *are* already making this distinction. It is a real distinction. It certainly doesn't contradict the "damaging" dimension -- but rather it adds nuance. For example, "What type of damage?" is an important question for people like those who host the en:WP:Teahouse. They invite newcomers who are doing good-faith damaging things but would like to avoid newcomers who are intentionally vandalizing Wikipedia.
I don't think that discussing diffs is going to be more interesting and rewarding than contributing new, productive work. Surely editors can already discuss diffs (and they do. See
en:WP:BRD
), but this hasn't lead to a big dip in productivity as far as I can tell. It's regrettable that we don't have structured information about what opinions were discussed and what conclusions were reached because that would be really useful.
I think you're misunderstanding what would happen when we use Jade to train ORES. In the case that we use ORES to track vandalism in Jade's namespace, we'd be looking at diffs of Jade entity pages. In the case that we use Jade to train/test ORES, we'd be looking at the judgment data itself. This is where the potential feedback loop breaks. We'll need independent judgments of those edits to the Jade namespace to train anything.
Honestly, when it comes back to feedback loops, I'm far more concerned about ORES predictions leading Jade judgments in a particular direction. *That* is a complete loop. Jade captures data about the editor who submits a judgment and about what they are looking at when they submit it (e.g. a UI that includes an ORES prediction vs. just looking at a bare version of Special:Diff). We need to do more work into exploring how to get good data out of Jade for training. Right now, we train our models using meta:Wiki labels -- a tool shows random samples of diffs to editors for labeling specifically (no prediction, minimal context about the edit) to break out of these loops. I plan to move towards using Jade data cautiously using science to study processes that produce data that is consistent -- where new reviewers are likely to agree with the consensus judgment.
Long term, it might be that Wiki labels continues to play a critical role in making sure we train/test ORES with good data. Regardless, Jade will be essential infrastructure for finding out if ORES is going off of the rails or to make sure that we can track and deal with subtle prediction bugs. AIs aren't cheap. In order to reap the efficiency they bring to quality control work, we must track their behavior. Jade is intended to be a powerful tool for keeping ORES in check. Right now, we're tracking ORES behavior sporadically. I'd like to track it consistently leveraging the judgments that people are already making. That's the ask I have gotten from patrollers and patrolling tool developers. Have you ever participated in an audit of ORES? See
for a great example of an audit that helped us fix some subtle issues with the model.
I'm probably mistaken, but are JADE supposed to replace user feedback?
Whatever program act as a crystal ball will not change that it is a crystal ball.
Discussing diffs is mostly an enwiki phenomenon, and its existence on other projects are mostly limited to edit wars. I am afraid it will spread.
In a feedback loop you get the same result whether A → B → A, or B → A → B. You close the loop and the system will spiral out of control unless you use some form of countermeasures. You may not even see what happen before you loose control, ie observability vs controllability. (Claiming that you use “science” will not save you from the effects of a closed feedback loop.)
This time around I'm going to wait for real numbers before I promote the use of additional ORES-based tools. I want to see real numbers for the workload, both added and removed workload, before I promote anything.
Jade is supposed to supplement user feedback. It would not replace anything. In fact, Jade is an excellent mechanism to make a case that an individual user feedback corresponds to a larger trend.
In the feedback loop example, I'm saying that A --> B C --> A is not a complete loop. Ultimately, B and C do not directly correspond.
Jade is the counter-measure to prevent ORES from spiraling out of control. The application of scientific methods will help us detect feedback loops. E.g., we can take observations outside the context of ORES and compare them to how judgments work when ORES is present in order to learn the effect of ORES on producing feedback loops.
We won't have real numbers until Jade is adopted and used. I'm not asking you to promote anything. But thanks for your thoughts.