النمو/اليوم الأول المعدّل/المهام المهيكلة/تحرير النسخ

مجموعة:	النمو
البدء:	2021-07-19
أعضاء الفريق:	Martin Gerlach (باحث), Gergő Tisza (مهندس برمجيات), Benoît Evellin (مختص في التواصل مع المجتمعات), Elena Tonkovidova (مهندس تأكيد جودة), Kosta Harlan (مهندس برمجيات), Morten Warncke-Wang (محلل بيانات), Rita Ho (مصمّم رئيسي لتجربة المستخدم), Max Binder (مدرب), Mew Ophaswongse (مهندس برمجيات)
قائد المشروع:	Kosta Harlan
الإدارة:	Marcella Florence (هندسيات), Marshall Miller (المنتج)
آخر الأخبار:	معظم التحديثات تنشر هنا.

This page is a translated version of the page Growth/Personalized first day/Structured tasks/Copyedit and the translation is 37% complete.

تُترجم هذه الصفحة العمل على المهمّة المهيكلة «تحرير النسخ»، التي هي نوع من المهام المهيكلة التي ربما سيُوفرها فريق النمو عبر لوحة المستخدمين الخاصة بالوافدين الجدد. تحتوي الصفحة على أهداف رئيسة وتصاميم وأسئلة مفتوحة وقرارات. سيتم نشر معظم الأخبار الخاصة بتدرّج العمل في الصفحة العامة لـأخبار فريق النمو، مع نشر بعض الأخبار الهامّة أو المفصّلة هنا.

الوضع الحالي

2021-07-19: إنشاء صفحة المشروع والشروع في البحوث الخلفيّة.
2022-08-12: add initial research results.
Next: complete manual evaluation.

ملخص

الهدف من المهام المهيكلة هو تفصيل مهام التعديل إلى مسارات عمل خطوة بخطوة من شأنها أن تناسب الوافدين الجدد وأن تناسب الأجهزة النقالة. يعتقد فريق النموّ أنّ إدراج هذه الأنماط الجديدة من مسارات التعديل من شأنه أن يتيح لعدد أكبر من المستخدمين الجدد أن يبدؤوا في بالمساهمة في ويكيبيديا، بعضهم سيتعلّم كيفية القيام بتعديلات مهمّة ويكون فاعلا في مجتمعاتهم. بعد مناقشات داخل المجتمعات حول المهام المهيكلة، قرّرنا إنشاء المهمّة المهيكلة الأولى: «إضافة رابط».

حتى عندما قمنا ببناء هذه المهمة الأولى، كنا نفكر في ماهية المهام المهيكلة اللاحقة؛ نريد أن يكون لدى الوافدين الجدد أنواع متعددة من المهام للاختيار من بينها حتى يتمكنوا من العثور على الأنواع التي يرغبون في القيام بها، ويتمكّنوا من المرور إلى تعديلات أصعب أثناء زيادة تعلّمهم. المهمّة الثانية التي بدأنا العمل عليها كانت «إضافة صورة». إلاّ أنّه في نقاشاتنا الأولى مع المجتمعات حول فكرة المهام المهيكلة، حبّذت هذه المجتمعات بصفة خاصة مهمّة من نوع تحرير النسخ -- كلّ ما يرتبط باللغة، والرّسم، والنحوّ، وقواعد الإملاء والتنقيط، إلخ. هذه ملاحظاتنا الأولى في أثناء اطلاعنا على النقاشات مع أفراد المجتمعات.

نعلم أنه يوجد الكثير من الأسئلة حول طرق القيام هذا العمل، الكثير من الأسباب المحتملة يمكن أن تؤدّي إلى منحى خاطئ: ماهي أنواع تحرير النسخ التي نتحدّث عنها؟ هل فقط الرّسم، أم شيء آخر؟ هل توجد خوارزميّة، تستطيع العمل على مختلف اللغات؟ هذه الأسئلة هي مطروحة لأمل الحصول على أكثر عدد ممكن من الملاحظات من قبل أفراد المجتمعات وعلى محادثات جارية عندما نقرر كيفية معالجة الأمور.

البحث في الخلفية

مخطط البحث

الأهداف

We want to understand the types of copyediting tasks it might be possible to assist with algorithms.
We want to use an algorithm that can suggest tasks for a type of copyediting in articles across different languages.
We want to know how good the algorithm works (e.g. know which model works best from a set of existing models).

Literature review

What different subtasks are considered copyediting?
Identify different aspects of copyediting across the spectrum: typo/spelling to grammar to style/tone
What are existing approaches to copyediting in Wikipedia?
- Communities such as Guild of Copy Editors or the Typo Team.
- Maintenance-templates such as the copyedit-template.
- Tools such as the moss-tool to identify typos (also JarBot in Arabic Wikipedia)
What are existing public commonly-used tools for spell-checking/grammar etc such as hunspell, LanguageTool, or Grammarly?
- We know that our communities prefer transparent algorithms, so it is easy for everyone to understand where suggestions come from.
- What are available models from research in NLP and ML, for example for the task of Grammatical Error Correction.

Defining the task

Which aspect of copyediting will we model for the structured task?
Type of task: spelling, grammar, tone/style
- For example: What can browser-spellcheckers do?
Granularity -- highlighting task on the level of: article, section, paragraph, sentence, word, sub-word
- Depends on the task
Surface known items (e.g. from templates) or predict new ones?
Only suggest that improvement is needed, or suggest how to improve?
- Suggesting improvement is easier for simpler tasks.
- Simply highlighting that work is needed is easier for more complex tasks (e.g. style or tone)
Language support: how many languages do we aim to support?
- Include Spanish and Portuguese as target languages alongside Arabic, Vietnamese, Bengali, Czech.
- We ideally want to cover all languages, but will realistically need to evaluate solutions based on the depth of their language coverage.

Building a dataset for evaluation

Generate a test-dataset (ideally in multiple languages) for the task for which we can compare different algorithms. This can be achieved in different ways
- An existing benchmark dataset, such as CoNLL-2014 Shared Task on Grammatical Error Correction, or approaches for corpora generation (from Wikipedia)
- Generate our own dataset from revision history using templates (copyedit) or edit summaries (typo)
- Manual evaluation of output of models run on a set of sentences from Wikipedia.

Research results

A full summary of Research is available on MetaWiki: Research:Copyediting as a structured task.

Literature Review

Background research and literature review can be read in full.

Main findings:

Simple spell- and grammar checkers such as LanguageTool or Enchant are most suitable for supporting copyediting across many languages and are open/free.
Some adaptation to the context of Wikipedia and structured task will be required in order to decrease the sensitivity of the models; common approaches are to ignore everything in quotes or text that is linked.
The challenge will be to develop a ground-truth dataset for backtesting. Likely, some manual evaluation will be needed.
Long-term: Develop a model to highlight sentences that require editing (without necessarily suggesting a correction) based on copyediting templates. This could provide a set of more challenging copyediting tasks compared to spellchecking.

LanguageTool

We have identified LanguageTool as a candidate to surface possible copyedits in articles because:

It is open, is being actively developed, and supports 30+ languages
The rule-based approach has the advantage that errors come with an explanation why they were highlighted and not just due to a high score from a ML-model. In addition, it provides functionality for adding custom rules by the community https://community.languagetool.org/
The copyedits from LanguageTool go beyond spellchecking of single words using a dictionary but also capture grammatical errors and style.

We can get a very rough approximation of how well LanguageTool works for detecting copyedits in Wikipedia articles by comparing the amount of errors in featured articles with those in articles containing a copyedit-template. We find that the performance is reasonable in many languages after applying a post-processing step in which we filter some of the errors from LanguageTool (e.g. those overlapping with links or bold text).

We also compared the performance of simple spellcheckers which are available for more languages than supported by LanguageTool. They can also surface many meaningful errors for copyediting but suffer from a much higher rate of false positives. This can be partially addressed by post-processing steps to filter the errors. Another disadvantage is that spellcheckers perform considerably worse than LanguageTool in suggesting the correct improvement for the error.

One potentially substantial improvement could be to develop a model which assigns a confidence score to the errors surfaced by LanguageTool/spellchecker. This would allow us to prioritize those errors for the structured task copyediting task for which we have a high confidence that they are true copyedits. Some initial thoughts are in T299245.

Read here for more details: Research:Copyediting as a structured task/LanguageTool

Evaluation

We have completed an initial evaluation of sample copy edits utilizing LanguageTool and Hunspell. To compare how each tool worked for Wikipedia articles, our research team created a list of sample copy edits for 5 languages: Arabic, Bengali, Czech, Spanish (Growth pilot wikis) and English (as a test-case for debugging).

Methodology

Started with a subset of the 10,000 first articles from the HTML dumps using the 20220801-snapshot of the respective wiki (arwiki, bnwiki, cswiki, eswiki, and enwiki).
Extracted the plain text from the HTML-version of the article (trying to remove any tables, images, etc).
Ran LanguageTool and the Hunspell-spellchecker on the plain text.
Applied a series of filters to decrease the number of false positives (further details available in this Phabricator task).
Selected the first 100 articles for which there is at least one error left after the filtering. We only consider articles that have not been edited in at least 1 year. For each article, only one error was selected randomly; thus for each language we had 100 errors from 100 different articles.
Growth Ambassadors evaluated the samples in their first language, and decided if the suggested edit was accurate, incorrect, or if they were unsure, or if was unclear (the suggestion wasn't clearly right or wrong).

Results

Hunspell

The precision for Hunspell copy edits were judged less than 40% accurate across all wikis. Suggestions were accurate for 39% of English suggestions, 11% for Spanish, and 32% for Arabic, 16% for Czech, and 0% for Bengali.

Hunspell results for English, Spanish, Arabic, Czech, and Bengal

LanguageTool

LanguageTool first evaluation (V1 sample): LanguageTool currently supports ~30 languages, so only two of the Growth team pilot languages are supported: Spanish and Arabic. LanguageTool's copy edits were judged at 50% accurate or higher across all three wikis. Suggestions were accurate for 51% of English suggestions, 50% for Spanish, and 57% for Arabic.

LanguageTool V1 results for English, Spanish, and Arabic

LanguageTool second evaluation (V2 sample): We completed a second evaluation of LanguageTool as a way to surface copy edits in Wikipedia articles. We evaluated suggested errors in Arabic, English, and Spanish. In the previous evaluation we determined that certain rules often resulted in incorrect suggestions, so we added functionality to filter certain rules. You can see that we ended with results with a higher level of accuracy than in the V1 sample.

LanguageTool V2 results for English, Spanish, and Arabic

Common Misspellings

For this evaluation we simply used a list of common misspellings curated by Growth pilot Ambassadors, and then checked for those misspellings in Wikipedia articles. Results looked promising, but we ended up with a fairly small sample in some languages. This might be a solution to help provide coverage to languages that aren't supported by LanguageTool, however, if we pursue this option further we will test again with a longer list of misspellings to see if we can get a more representative & significant results (and a better sense of what level of coverage this solution would provide).

Next Steps

Consider how to better handle highly inflected and agglutinated languages, which likely won't benefit much from standard spell-checking approaches.

Further improving LanguageTool filters to decrease the number of false positives and thus further improve accuracy.

For languages not supported by an open source copy editing tool, we will consider a rule-based approach, i.e. only looking for very specific errors which could be based on a list of common misspellings. We will set up an additional test to estimate the accuracy and coverage of this type of approach.