User:Daniel Kinzler (WMDE)/MCR-PageUpdater/Notes-2018-04

This page is for discussing the architecture of PageUpdater and related classes as implemented in https://gerrit.wikimedia.org/r/c/405015/.

Anomie's Proposal edit

Anomie made the following proposal in response to a comment by Gerö, which I have slightly adapted to allow easier discussion on this page:

There are several pieces of code that need to exist:

  • (CAN) Validate the ability of a user to edit a page.
  • (EXP) Take minimally-processed user input and turn it into some "page update" data structure. IMO this should include expanding section edits and detecting and resolving edit conflicts.
  • (PUP) Directly create a "page update" data structure from code, which might skip some of the features of the previous bullet.
  • (VAL) Validate a "page update" data structure for things like allowed slots, correct content models for each slot, and so on.
  • (RVR) Turn a "page update" data structure into an unsaved RevisionRecord, which at least probably includes PST.
  • (SVE) Save a RevisionRecord. Or not, if it turns out to be a null edit.
  • (POT) Produce a ParserOutput and possibly other data from a RevisionRecord (saved or unsaved) or "page update" (although that could just do (RVR) then proceed), plus a ParserOptions. This includes whatever merging of per-slot ParserOutputs might be required, and optionally checking/updating ParserCache.
  • (DUP) Schedule update jobs from the data produced in (POT).

(POT) might include flags to indicate whether we want HTML, data for (DUP), or both. Current MediaWiki only has the flag for HTML, and ParserCache only stores POs with both.

The use cases include:

  • ApiStashEdit: Does (EXP), (VAL), (RVR), (POT), and then saves data into memcached.
  • EditPage/ApiEditPage: Does (CAN), (EXP), and (VAL). Then gets the data for (RVR) and (POT) either directly or from ApiStashEdit's stash, and does (SVE) and (DUP).
  • Various backend code: Does (PUP), probably (VAL), (RVR), (SVE), (POT), and (DUP). This includes anything making an internal call to ApiEditPage.
  • Dummy edits: Somehow or other saves a RevisionRecord, then does (POT) and (DUP).
  • Purge: Loads the existing RevisionRecord and does (POT) and (DUP).
  • Page view: Loads the existing RevisionRecord and does (POT).
  • Page preview: Does (EXP), probably (VAL), (RVR), and (POT). The ParserOptions it uses for (POT) needs to use setupFakeRevision() with the revision from (RVR).
  • ApiParse: Depending on the options, either loads an existing RevisionRecord or does more or less (EXP), probably (VAL), and (RVR). Then does (POT), possibly with setupFakeRevision().

In the current patch, PageUpdater handles (EXP) partially, (PUP), (RVR) partially, and (SVE). PageMetaDataUpdater handles (VAL) more-or-less, (RVR) partially, (POT) for canonical parser options only, and (DUP). RevisionSlotsUpdate from the parent patch handles another bit of (EXP). Ie772a4a adds RevisionRenderer which handles (POT) more generally. Nothing handles (CAN), and (EXP) is incomplete even after combining all the parts.

If I understand Gergő's proposal, he's having a PageUpdate for (PUP) plus either (POT) or just the data produced by (POT); putting (VAL), (RVR), and (SVE) except for null-edit determination in "PageUpdater"; and (POT) maybe and (DUP) in "PageSecondaryDataUpdater". (CAN), (EXP), and null-edit determination would presumably be in another abstraction layer above this.

After writing all this out, here's another proposal:

  • Have PageUpdate for collecting and holding the data for (EXP) and (PUP). Like Gergő suggested but without the (POT)-related bits.
  • Have some "Rendering" object, but unlike Ie772a4a do not have ParserOutput implement that. Probably it needs only a few methods: getRevision(), getParserOptions(), getParserOutput(), and getParserOutputForSlot( $role ). The "ForSlot" method could be lazy.
  • Have a RevisionRenderer service that does (POT), but unlike Ie772a4a it probably needs only 1 or 2 public methods: renderRevision( $revision, $parserOptions, $generateHtml ) and maybe renderRevisionForPreview( $revision, $parserOptions ).
  • Have a PageUpdater service with several methods
    • PageUpdate resolveUpdate( $pageUpdate ) // the section and edit conflict parts of (EXP)
    • Status checkPermissions( $pageUpdate ) // (CAN), has all the permission and block checks from EditPage
    • Status validateUpdate( $pageUpdate ) // (VAL)
    • RevisionRecord makeRevisionForUpdate( $pageUpdate ) // (RVR)
    • Status saveRevisionIfChanged( $revisionRecord ) // (SVE)
    • Status scheduleSecondaryUpdates( $rendering ) // (DUP)
    • (maybe) Rendering renderRevision( $revisionRecord ) // call-through to RevisionRenderer with canonical $parserOptions
    • Status stashRendering( $pageUpdate, $renering ) // ApiStashEdit
    • Rendering|null unstashRendering( $pageUpdate ) // ApiStashEdit
    • Status makeEdit( $pageUpdate ) // does (VAL), (RVR)+(SVE)+(POT) or unstash+(SVE), and (DUP)

Daniel's Response edit

> EditPage/ApiEditPage: Does (CAN), (EXP), and (VAL). Then gets the data for (RVR) and (POT) either directly or from ApiStashEdit's stash, and does (SVE) and (DUP).

(RVR) and (POT) and (SVE) and (DUP) are all done in WikiPage, and are also needed by other use cases.

  • This seems to be a non sequitur, or completely misunderstanding what I was saying here. The things I listed are what EditPage and ApiEditPage need to do in order to do what they do. Nothing about it has anything to do with what's currently done in WikiPage or what other cases need. Anomie (talk) 22:41, 11 April 2018 (UTC)

> In the current patch, PageUpdater handles (EXP) partially, (PUP), (RVR) partially, and (SVE).

What bit of (EXP) does this handle? And what part of (PUP) and (RVR) does it not handle?

  • EXP: Being some sort of "page update" data structure, but the input has to be heavily processed by the caller.

    I didn't say "partially" on PUP.

    RVR misses the PST. Anomie (talk) 22:41, 11 April 2018 (UTC)

> PageMetaDataUpdater handles (VAL) more-or-less, (RVR) partially, (POT) for canonical parser options only, and (DUP).

> RevisionSlotsUpdate from the parent patch handles another bit of (EXP). Ie772a4a adds RevisionRenderer which handles (POT) more generally.

What bit of (EXP) does this handle?

  • EXP: Only collecting the slots that are changing (being some sort of "page update" data structure). Anomie (talk) 22:41, 11 April 2018 (UTC)

> Nothing handles (CAN), and (EXP) is incomplete even after combining all the parts.

That is by design - in my opinion, that logic belongs in the controller/interactor layer that performs an "edit", as opposed to creating a revision.

  • And I think your intended design has too many layers: you want a controller/interactor to get the user input, and then a layer that resolves edit conflicts and expands sections, and then your RevisionSlotsUpdate that collects the slots, then your PageUpdater that collects some other data about the edit, then your PageMetaDataUpdater that turns some of that data into ParserOutputs, then back to your PageUpdater to finally make the revision. I told you that code from EditPage should be pulled into this patch near the beginning, but you disagreed and here we are now. Anomie (talk) 22:41, 11 April 2018 (UTC)

> If I understand Gergő's proposal, he's having a PageUpdate for (PUP) plus either (POT) or just the data produced by (POT);

> putting (VAL), (RVR), and (SVE) except for null-edit determination in "PageUpdater";

I went down that path a couple of times in code experiments, trying to introduce a PageStore. It turned out to require a much more extensive refactoring than I thought feasible. A mutable, builder-like PageUpdate may make it easier, but it still means that pretty much all the code needs to be re-written, instead of being moved around.

  • Tying yourself to existing poor structure to avoid changing too much means you just wind up duplicating poor structure with new code. Anomie (talk) 22:41, 11 April 2018 (UTC)

> and (POT) maybe and (DUP) in "PageSecondaryDataUpdater".

Or "SecondaryPageDataUpdater"? I kind of like that name.

> (CAN), (EXP), and null-edit determination would presumably be in another abstraction layer above this.

I agree that this makes sense, since it properly separates the idea of "making an edit" from "creating a revision".

  • We already have MutableRevisionRecord for creating a revision and RevisionStore for storing it. We don't need another layer on top to create a revision and store it with a slightly different abstraction. Anomie (talk) 22:41, 11 April 2018 (UTC)

> Have PageUpdate for collecting and holding the data for (EXP) and (PUP). Like Gergő suggested but without the (POT)-related bits.

Do I understand correctly that PageUpdate would not have application logic and would not know about services, but would only have setters which are used only by the PageUpdater? That's possible, but I'm not sure I see the point.

I'm generally in favor of the value/service split, but with a mutable/stateful value, that seems more confusing than useful. Consider:

$update = $service->newUpdate();
$service->doX( $update );
$service->doY( $update );

Vs.

$update = $service->newUpdate();
$update->doX();
$update->doY();

I think the latter is much clearer. I see no advantage in the former.

  • Neither of these matches either of our proposals. Mine is more like
    $update = $service->newUpdate(); // Or maybe just "new PageUpdate", we may not actually need a factory here.
    $update->setX( $value );
    $update->setY( $value );
    $update = $service->resolveValues( $update );
    $service->doEdit( $update );
    </source> while yours is more like <source lang="php">
    $revisionSlotUpdates = $something->newRevisionSlotUpdates();
    $revisionSlotUpdates->addSlot( $someOtherService->resolveValue( $value ) );
    $statefulService = $something->newStatefulService();
    $statefulService->setX( $revisionSlotUpdates );
    $statefulService->setY( $value );
    $statefulService->doEdit();
    
    (I think). Anomie (talk) 22:41, 11 April 2018 (UTC)

> Have some "Rendering" object, but unlike Ie772a4a do not have ParserOutput implement that. Probably it needs only a few methods: getRevision(), getParserOptions(), getParserOutput(), and getParserOutputForSlot( $role ).

That's the SlotRenderingProvider in my patch.

  • No it's not. Your SlotRenderingProvider is just a wrapper around Content::getParserOutput() with some in-process caching and a different typehint for the return value, and returns per-slot ParserOutput objects. My Rendering is all the data needed for the DUP step.

    In terms of code you wrote, my RevisionRenderer is like your PageMetaDataUpdater->prepareUpdate() plus ->getCanonicalParserOutput(), and my Rendering is the resulting state in your PageMetaDataUpdater. Then my PageUpdater->scheduleSecondaryUpdates() is your PageMetaDataUpdater->doUpdates() (and mine probably should probably actually be named "doSecondaryUpdates()", whatever).

    Anomie (talk) 22:41, 11 April 2018 (UTC)

> Have a RevisionRenderer service that does (POT), but unlike Ie772a4a it probably needs only 1 or 2 public methods: renderRevision( $revision, $parserOptions, $generateHtml ) and maybe renderRevisionForPreview( $revision, $parserOptions ).

That was my original intention, but I found no way to make that work without making things more obscure.

  • You almost did it in your patch, but you overcomplicated it with factories for ParserOptions and your too-stateful SlotRenderingProvider abstraction. Anomie (talk) 22:41, 11 April 2018 (UTC)

> Have a PageUpdater service with several methods

  • PageUpdate resolveUpdate( $pageUpdate ) // the section and edit conflict parts of (EXP)
    • It seems to me like the input to this method can't be a PageUpdate. You'd need something to represent a partial PageUpdate for just one section of one slot, etc.
      • Why would you think that? The input $pageUpdate holds the section number, the base revision ID, and the section-level Content for each slot. The returned PageUpdate has no section number anymore, possibly a different "base" revision ID, and full Content for each slot. How exactly section edits might work in an MCR world is a separate issue, IMO at least initially doing a section edit should probably limit you to editing only the one slot that contained the section. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status checkPermissions( $pageUpdate ) // (CAN), has all the permission and block checks from EditPage
    • User permission checks don't belong into the storage layer IMHO. The abstraction for "perform an edit" should be separate from the abstraction for "create a revision".
      • This isn't the "storage" layer, RevisionStore is the storage layer. This is the service layer for making an edit. And I already responded to your "create a revision" comment above. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status validateUpdate( $pageUpdate ) // (VAL)
    • "validate" could mean anything... anyway, why should this be public?
      • In this case, it means "make sure this PageUpdate has all the data necessary to be saveable". It may not need to be public if we have wrappers like "makeEdit" for the use cases that need the functionality. It's distinct from RVR because of the VAL→unstash→SVE,DUP use path, although it might turn out we can include it in RVR after all. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • RevisionRecord makeRevisionForUpdate( $pageUpdate ) // (RVR)
    • Why should this be public?
      • It may not need to be public, as above. It's a distinct step because we need the RVR's output revision as input to both SVE and POT. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status saveRevisionIfChanged( $revisionRecord ) // (SVE)
    • If this does not involve (DUP), it should probably be in RevisionStore. Perhaps PageUpdate should just have a isChange() method.
      • Except RevisionStore doesn't know about the page table, to create the page if necessary and to update page_latest.
        The "IfChanged" part is just
        $currentRevision = $revisionStore->getRevisionByPageId( ... );
        if ( !$revisionRecord->hasSameContent( $currentRevision ) ) {
            $revisionStore->insertRevisionOn( ... );
        }
        
        It'd be somewhat difficult for PageUpdate to know whether isChange() is true, except when it was returned from resolveUpdate(). Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status scheduleSecondaryUpdates( $rendering ) // (DUP)
    • Since this is needed in several places where no revision is created, this should be in a separate class (or at least, in a separate interface).
      • Could be. But it seems pointless considering how closely related the operations are. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • (maybe) Rendering renderRevision( $revisionRecord ) // call-through to RevisionRenderer with canonical $parserOptions
    • Should exist, but in a separate service.
      • RevisionRenderer is the separate service. This is a potential convenience method. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status stashRendering( $pageUpdate, $renering ) // ApiStashEdit
  • Rendering|null unstashRendering( $pageUpdate ) // ApiStashEdit
    • An EditStash (perhaps also a local EditCache?) should be a separate service.
      • Could be. But it seems somewhat pointless considering how closely related the operations are. In particular "unstashRendering()" may not need to be a public at all and we might have the actual public method be "stashEdit( $pageUpdate )" which does RVR+POT+stashRendering. Anomie (talk) 22:41, 11 April 2018 (UTC)
  • Status makeEdit( $pageUpdate ) // does (VAL), (RVR)+(SVE)+(POT) or unstash+(SVE), and (DUP)
    • Ideally, this would be the only public method in the Page
      • What's "the Page"? And it can't be the only public method because of all the other use cases I listed at the top of the post that do different subsets of the operations. Anomie (talk) 22:41, 11 April 2018 (UTC)

I will try to respond here before the session on Wednesday -- Daniel Kinzler (WMDE) (talk) 17:47, 23 April 2018 (UTC)

PageUpdater responsibilities, 2018-04-23 edit

Per a conversation between tgr and me at wmcon18, below is a list of the responsibilities defined above, and how they are intended to be covered by the code as currently proposed in If610c68f49:

  • CAN: EditPage, should be factored out into an EditController.
  • EXP: EditPage, should be factored out into an EditController.
  • PUP: WikiPage, should be factored out into a factory service (could be PageStore, or separate from that).
  • VAL: this presently means calling prepareSave, and in the future should also include validating required/allowed slots. PageUpdater does this in the current proposal, but leaving this entirely to a future EditController be better, see phab:T192777. The idea is that it's ok for imported and undeleted revisions to be in violation of some constraint that is enforced for regular edits.
  • RVR: DerivedPageDataUpdater (PST, etc). PST could perhaps be part of EXP, instead of being done down here. PageUpdater and DerivedPageDataUpdater would then enver see pre-PST content. This means detection of null edits could also be pulled up to the EXP phase. This would be a breaking change to the PageUpdater interface, so needs deciding soon.
  • SVE: RevisionStore and WikiPage. WikiPage::insertOn should be factored out into a PageStore in the future, or moved into PageUpdater.
  • POT: DerivedPageDataUpdater, to be factored out into a RevisionRenderer and/or SlotRenderingProvider.
  • DUP: DerivedPageDataUpdater. Notifying the various subsystems about page creation/update/deletion should use a listener system that could perhaps replace the hook system in the future.

Further considerations:

The life cycle of PageUpdater follows the need to be able to grab a parent revision, apply an edit to that parent, and then save the result, while avoiding race conditions. PageUpdater acts as a kind of handle for this kind of operation, similar to how IDatabase objects act as a transaction context.

The life cycle of DerivedPageDataUpdater is dictated by how callbacks currently use WikiPage::prepareContentForEdit. WikiPage::prepareContentForEdit provides a kind of manual memoization. WikiPage::getDerivedPageDataUpdater follows the same approach, while DerivedPageDataUpdater uses internal caching. That caching could in the future move into a SlotRenderingProvider.

With respect to PageUpdater, we compared two patterns that could be used with a stateful update(er) object:

The current proposal:

 $updater = $repository->newUpdater( $page )
 $updater->setBaseRevisionId( $baseRevId );
 if ( $updater->hasEditConflict() ) { ... }
 $updater->setContent( 'main', $content );
 $updater->createRevision( $user, $comment );

...which requires the updater to contain business logic, or at least delegate to services that have the business logic, versus a version that requires no application logic to be present in (or called from inside) the update object:

 $update = $repository->newUpdater( $page )
 $repository->setBaseRevisionId( $update, $baseRevId );
 if ( $repository->hasEditConflicts( $update ) ) { ... }
 $repository->setContent( $update, 'main', $content );
 $repository->createRevision( $update, $user, $comment );

This seems more awkward to the caller, which has to pass the same "accumulator" type object into every method call. It also requires the $update object to expose a (secondary) interface to the repository service that allows for the relevant state changes to be made (so, the update object still needs setBaseRevisionId() and setContent(), and it's unclear whether it would be ok to call them directly). The repository has to know a lot about the state inside the update object, and the update object is not usable without the repo service.

Overall, the second approach seems to offer no clear advantage. The first approach seems OK as long as the actual business logic gets factored out of the updater class eventually. It is acknowledged that it's generally a bad idea to manage complex internal state and have complex interactions with services and external state in the same class.

A "clean" functional approach with no mutable state was ruled out as impractical in the current context. It seems particularly pointless for the task of updating a database entry, which by definition is a side effect. — Preceding unsigned comment added by Daniel Kinzler (WMDE) (talkcontribs) 17:29, 23 April 2018 (UTC)

So your proposal is to have five different services (EditController, some random factory for "PUP", PageUpdater, DerivedPageDataUpdater, PageStore) to be used for something to make an edit? And you also propose replacing the hooks system with another different hooks system that you refer to as "listeners" instead? And as for your "second approach", that reads as a textbook w:Straw man. Anomie (talk) 13:17, 24 April 2018 (UTC)
I think the current functionality should be split into at least five classes - but calling code only ever need two (ideally just one) of them to do one thing. -- Daniel Kinzler (WMDE) (talk) 16:37, 25 April 2018 (UTC)