User:PerfektesChaos/WikidiffLX/suggestions

Further details on the suggestions: Notes on implementation and methods.

The coding as such is detailed here.

Improve modified consecutive lines edit

Objective: The line based algorithm makes minor corrections in consecutive lines appear as a dramatic change.

Example:

previous line small modification …whaffle… …blah…   …Blah… …Whaffle… change little following line
previous line minor modification …whaffle… …blah… added line …Blah… …Whaffle… change slighty following line

results currently in:

previous line previous line
small modification

…whaffle…

…blah…
 
…Blah…

…Whaffle…

change little
 
  minor modification

…whaffle…

…blah…
  added line
  …Blah…

…Whaffle…

change slighty
following line following line

This shall be made more readable by an improved algorithm:

previous line previous line
small modification

…whaffle…

…blah…
minor modification

…whaffle…

…blah…
  added line
…Blah…

…Whaffle…

change little
…Blah…

…Whaffle…

change slighty
following line following line

Background: Line comparison algorithms have been developed in the 1970s. Due to the size of a punched card no program line was longer than 80 characters. This gives meaningful results when detecting modified and unchanged sections of lines.
However, wikitext lines (I would like to call them ‘paragraphs’ here) may consist of 1000 characters and more, containing several sentences in human language. Any slight change, even a single space, makes the entire paragraph to be ‘changed’. The line comparison algorithm needs to skip over this and looks for the next recovering point with absolute identity.

Suggestion:

  1. Split long lines by insertion of 'virtual' breaks.
  2. Run the diff engine as is (cheat by virtual lines).
  3. Analyze the result and merge adjacent virtual lines.
  4. Display the differences based on the original paragraphs.

The following rules illustrate how splitting is to be performed. Figures like 100 and 250 are just clarifying the envisioned size, they are set by #define and may be chosen as desired.

  • If a line is longer than 250 bytes
    • search a possible position for breaking
      • start at byte 100
      • look for e.g. ". " (period+space, \x2E\x20) or similar
      • if remaining length is longer than 100 bytes
        • insert virtual break
    • if remaining length is longer than 250 bytes
      • start at byte 100 of the remainder as above

The choice of something like period+space for a break point is conducted by the following assumption: A long paragraph is quite likely written in human language. If the author reformulates something, the entire sentence might have been subject to modification. The following sentence is kept unchanged, hopefully. A common separator in human languages is the period and a space; period might have another meaning (e.g. abbreviation) but doesn’t matter. There are many other ways to terminate a sentence in various human languages. For the sake of simplicity and speed of the program we look also for exclamation+space and question+space only; this might still be extended to U+3002 in CJK.

Splitting yields to

previous line small modification …whaffle… …blah… …Blah… …Whaffle… change little following line
previous line minor modification …whaffle… …blah… added line …Blah… …Whaffle… change slighty following line

The result of the diff engine is an amalgamated sequence of parts, containing the original from and to as well as a code op (details here). Note that from and to lines with the same op code are already merged into line arrays by diff engine.


The example above results in the traditional diff engine output:

from previous line¶ small modification …whaffle… …blah…¶ …Blah… …Whaffle… change little null null null following line¶
to previous line¶ null null minor modification …whaffle… …blah…¶ added line …Blah… …Whaffle… change slighty following line¶
op copy del del add add add copy

Following the suggested insertion of virtual line breaks the recovering points were found:

from previous line¶ small modification♣ …whaffle…♣ …blah…¶ null …Blah…♣ …Whaffle…♣ change little following line¶
to previous line¶ minor modification♣ …whaffle…♣ …blah…¶ added line …Blah…♣ …Whaffle…♣ change slighty following line¶
op copy change copy copy add copy copy change copy


This leads to the following rules for merging of virtual line breaks, showing any possible combination of op codes:

Engine result Reconstruction
op[i] from[i] to[i] op[i+1] op[i] op[i+1] Remark
copy
change
hard hard any keep Skip
Regular procedure
del null
add null hard
copy virtual virtual copy keep Remove Merge entire [i] with successor [i+1]
from[i]=from[i]+from[i+1]
to[i]=to[i]+to[i+1]
Then Remove [i+1]
else change
change any
del null del keep
else change
add null virtual add keep
else change
copy
change
hard virtual copy
change
del
del change to[i+1]=to[i]+to[i+1]
to[i]=null
add add
virtual hard copy
change
add
add change from[i+1]=from[i]+from[i+1]
from[i]=null
del del

Example for merging hard and virtual line break:

Before reconstruction
# op from to
i copy Virtually terminated sentence. Virtually terminated sentence.
i+1 del Deleted line of any termination … null
i+2 any And now … … something completely different.
After reconstruction
# op from to
i change Virtually terminated sentence. Deleted line of any termination … Virtually terminated sentence.
i+1 any And now … … something completely different.

However, the suggested implementation does not merge or erase cells. Rather than changing the Diff vector the entire information (op code) is transferred into the Line vectors first, equipping each Line object with op code and a pointer to corresponding Line, if any. Whitespace only changes are detected now and the pair of such lines gets a specific mark. Adjusting virtual lines with paragraph cells is done only by changing the op code of the Line.

After coordinating the segmentation the two Line vectors also contain the final sequence of op codes. Those which are unchanged are marked as copy. Only two (or which number ever defined) visible (virtual) context lines are displayed adjacent to the sequence of modified paragraphs. If an entire virtual line is added or deleted the op code is turned into change but the pointer to the corresponding Line stays NULL.

The sequence of unchanged or in some way modified lines is then presented as HTML.

There are additional efforts required for postprocessing as described. On the other hand some resources are saved: Since any invisible line is hidden from the diff algorithm the number of lines processed by the engine is smaller. This saves storage and objects in Diff, which are eaten by storage for flags and marks in each Line object. Lines already detected to be equal in their visible part need not to be analyzed word by word for presentation if the only difference is trailing whitespace.

Coding edit

Suggestions for code are made

  • by an extension of explodeLines() for detection of virtual breaks.
  • by various modifications in the aftermath when re-adjusting the diff engine result.

Avoid confusion by empty lines edit

Objective: If an “empty line” (maybe with some invisible content) has been inserted or removed by the author and the adjacent paragraphs are modified in some way, the presentation of the result is currently disturbed.

Currently:

previous line previous line
small modification  
change little  
  minor modification
   
  change slighty
following line following line

This could be remedied as:

previous line previous line
small modification minor modification
   
change little change slighty
following line following line

Solution: The method is already in effect for Word objects:
Administrate

  • a suffix: spaces between last visible character and line termination (if virtual break identified by period+space there is a suffix of at least one space)
  • trailing lines: after hard break any further paragraph without any visible content (ASCII spaces and \t for the moment, might be extended later to invisible Unicode)

As currently performed for Word strings compare the visible body of Line only in the DiffEngine.

Postprocess the Line objects for reconstruction of the original paragraphs, if there are any changes.

Coding edit

Suggestions for code are made

  • by an extension of explodeLines() for detection of empty lines.
  • by introduction of Line.cpp for storing trailing invisible lines.
  • by various modifications in the various printLines functions when presenting the diff engine result.

Recovering the hidden invisible lines and invisible trailing whitespace changes needs quite a lot of simple decisions but won’t impair performance. Hiding lines on the other hand makes comparison faster and requires less objects for invisible and empty lines. Outcome will pay off.

Improve visualization of context lines edit

Objectives: Currently any kind of line is displayed when two lines preceding or following shall give an impression of the unchanged context where a changed block is located.

  1. If one or both of these lines are empty they are put into HTML source but invisible and not informative.
  2. If these lines are very long (sometimes each 1000 bytes and more), both paragraphs are displayed anyway, making the output lengthy and hard to survey.

The suggested code has a slightly different behaviour:

  1. The most adjacent non-empty (visible) lines will be shown.
  2. Not two paragraphs but the next two virtual lines (each expected of at least 100 bytes if not full paragraph) will be displayed.
  • Another idea: context \n could be represented as <br />
  • Are line numbers still meaningful? They are common when comparing lines of source code, but the wiki author hasn’t a clue where line 57 may be located, and some lines have 3000 characters, others just 15 (scrollbar position doesn’t help).

Visualize space-only differences edit

Objective: The current standard diff function doesn’t show space-only differences. Readers find identical black text and may guess that the reason is a superfluous space character somewhere, or perhaps a period changed into a comma?

Example for visualized space difference:

old: The old  lady looks  confused.
new: The young girl looks confused.
The old lady looks▯▯confused. The young girl looksconfused.
  • Make space-only differences visible, including heading/trailing space.
  • Enable reader to count number of spacing characters.
  • Don’t confuse reader with space-▯ if there are visible changes: If one of the adjacent words is already red, space difference is negliable.
  • Show not only different number of spacing characters, but also varying types (currently only: ASCII Space U+0020 and HorTab U+0009, but there are many more spaces like U+2004-200A).

Coding edit

The suggested code enables wikidiff2 to make this visible. Performance is not remarkable influenced, since the diff algorithm is not touched. Only if any difference was already found, displaying the result is improved.

Changes:

  • Word is extended by two methods equals_suffix() and get_suffixlength()
  • WikidiffLX is extended by printWordDiffSideBlack()
    • printWordDiffSide() needs to be modified by lastBlack flag in case of copy

Open Issue edit

Open Issue: Leading whitespace is currently not present in worddiff[]. If the improvement above is basically adopted, there are two solutions to integrate this feature:

  1. Modify explodeWords() and don't skip heading break. This requires a Word with bodyStart=bodyEnd=0. That might confuse the operators and String(), Diff algorithm could be disturbed.
  2. Provide printWordDiffSide() with both text1 and text2. If worddiff[0].op==DiffOp::copy visualize heading spaces, if any and different, and continue with loop.

Visualize non-ASCII spaces edit

If visualization of space-only differences is adopted, the methodology might be extended to other types of space.

Objective: Other spaces shall be treated like ASCII space. This goes for both word-splitting and displaying of modification.

Affected unicodes:

2002;EN SPACE
2003;EM SPACE
2004;THREE-PER-EM SPACE
2005;FOUR-PER-EM SPACE
2006;SIX-PER-EM SPACE
2007;FIGURE SPACE
2008;PUNCTUATION SPACE
2009;THIN SPACE
200A;HAIR SPACE

Coding edit

Looking into the existing code, I found no point to place this feature into the procedure. Moreover, I was quite confused and got the impression that there has been a longer history of amendments. The resulting state seemed to be not very efficient. Therefore I decided to rewrite the entire explodeWords() business heading for a more clear and accelerated execution.

The code is faster now, since:

  • The loop over all characters is run just once.
  • Thai sequence is investigated only if there are really Thai characters in text; more than 99 % of edits won't contain them.
  • Even if there is a Thai string only that particular sequence is broken into words. (See below for Thai issues)
  • UTF-8 analysis is started if UTF-8 encoding really starts, not for every plain English ASCII letter.
  • "inline" functions of just one line integrated for the moment, nowhere else used. May be extracted later if conditions can be shared between method units.

I think the procedure is much more clearly arranged now, enabling further changes and reducing future efforts in extending.

Additional benefit:

  • CJK recognition is extended.
  • Non-ASCII spaces are used for word separation.

Visualize zero-width differences edit

Objective: There are modified words where the modification keeps invisible, since a zero-width character has been added or removed. The user encounters two red words without any visible difference.

Affected unicodes:

  • 00AD;SOFT HYPHEN   &shy;
  • 200B;ZERO WIDTH SPACE
  • 200C;ZERO WIDTH NON-JOINER   &zwnj;
  • 200D;ZERO WIDTH JOINER   &zwj;
  • 200E;LEFT-TO-RIGHT MARK   &lrm;
  • 200F;RIGHT-TO-LEFT MARK   &rlm;
  • 202A;LEFT-TO-RIGHT EMBEDDING
  • 202B;RIGHT-TO-LEFT EMBEDDING
  • 202C;POP DIRECTIONAL FORMATTING
  • 202D;LEFT-TO-RIGHT OVERRIDE
  • 202E;RIGHT-TO-LEFT OVERRIDE

Example (invisible characters shown as HTML entities):

old: Meaning&shy;less to &rlm;change&lrm; direction.
new: Meaningless to change direction.
Meaningless to change direction. Meaningless to change direction.

The users have no clue why these words are red, give’em one:

Meaning▯less to ▯change▯ direction. Meaningless to change direction.

Any red word is affected. Deleted and added lines are not considered.

Coding edit

Replace appropriate function calls for printText() by new function printTextRed() which displays any invisible character in text as replacement symbol.

Open Issue edit

The same story goes for invalid Unicodes U+007F-009F. They result mainly from Windows codepage (like CP-1250/1252) and won’t be displayed at all by many browsers, best case by a replacement character. Can be captured easily above, since everything > U+007E is bad UTF and UTF starts at U+00C0.


Thai edit

The wikidiff2 implementation invokes Thai handling always and for any diff result being displayed. This was excluded from explodeWords() when re-writing, calling a separate function only if a Thai character is really detected. More than 99 % of edits won’t contain any. The wikidiff2 had to run twice over all characters, while there is now one single loop over all characters. Therefore the code is faster now.

Coding edit

explodeWordsThai() examines a substring consisting of Thai characters only. Separation is added to the Word segmentation.


CJK extension edit

explodeWords() may be improved for detecting CJK punctuation like U+3000 and U+3002 in the same manner Thai characters are handled.