Code review management/Aug 2011 training

Wednesday, 3 August at Wikimania in Haifa, Israel.

Audio recording: 1 hour 55 min, 75MB

Basically based on http://mediawiki.org/wiki/Code_review_management/July_2011_training

CodeReview extension in MediaWiki shown off:

Anyone with commit access should have a MW.org account that is linked to the Subversion commit id. In order to help with code review you also need to be in the 'coder' user group. This user group is viral, so just ask anyone from there to add your MW.org account to the 'coder' user group and then go to Special:Code/MediaWiki/author/YOUR_NAME and link your mw.org account to the svn account. /oradmin/last/bdump/backup_last.sh.log Code review comments are emailed to committer and previous commenters. Since there is no watch function, one may need to make a useless comment. There is an open feature request bug for that. Someone write a patch!

Note: In order to get these email updates about comments, status changes and follow-ups you need 1) A MW.org acount linked to the SVN account, 2) E-mail notifiations enabled in your preferences. For more things you need to do as a commiter, check out: Commit access requests#Getting started and set up

You cannot and should not mark your own revisions as 'ok' or 'resolved'. If someone points out an issue with your code and you fix it in a later revision, you should leave a comment about it, and set status back from 'fixme' to 'new'.

Q: If I think I fixed something from an earlier commit, what do I set the old revision's status to? If you think all issues addressed, NEW. or leave it FIXME.

We are not pedantic about that distinction. Helps ensure people check for followups.

Most people in this room do not have commit access. So, to submit a patch, on Bugzilla, you attach a patch to a bug. Add the keywords "patch" and "needs-review", to make it clear that this bug contains a patch that needs review.

We're trying to get more aggressive about reviewing new patches quickly

If you have a recent patch that needs review, shout in IRC or on mailing list. Bump the bug with a comment - that helps if a few people are cc'd on it.

You can find hexmode (Mark Hershberger) on IRC, fulltime bugmeister, whose role it is to find patches, get them reviewed, and get attention to fixes.

Q: Is there an effort to unify code review for Bugzilla & for CodeReview extension?

A: Some, yes. Splinter extension for Bugzilla, for example. Kind of like Github inline comment.

"more like convergence than integration"

Branching

Quick discussion of branching.

Developing code is trunk.

Branches are where we prepare the releases.

We work in a branch (like "REL1_18") directory. The directory is created at some point in time by copying what is in trunk to that new directory. After that time ("the branchpoint"), testing starts for the release. When bugs are discovered, they are fixed in trunk and merged ("re-commit to a different directory") to the branch. New things and features only go in trunk and never to a branch. To track backports-to-do, we have a codereview tag: "1.18". Simiarly, for deployment, we have a branch.... has "wmf" in the name.

"MFT" means "merge from trunk"

for the most part, we do active dev on trunk; if we need to merge fixes for depoyment or add a specific feature such as support for an extension, we backport it

TODO: need to clean up, but good enough for now

explanation of "scaptrap" - may cause deployment issues, so watch out.

Brion opines: Would be helpful if we could mark the patch submitter in a machine-readable way so that updates could be sent to them as well!

Followups:

A revision is either approved or not, and what if you don't feel competent to say YES THIS IS OK? Then you can do a "signoff" -- "I have inspected this revision." Does not change status/state of revision, but there's a mark in a log at the bottom of the codreview saying you inspected it. This can be a help to reviewers, especially where you have areas of the code that require specific expertise.

Testing integration -- automatic testing gets run, though not yet fully integrated into the CR flow. Feel free to run them yourself too!

WIFI HAS RETURRRRRRRNED

After the break

Brion on code review in general like https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Code_review_management/July_2011_training#Brion_and_his_TARDIS

Historically we did not have this kind of infrastructure Tim or Brion would look at a patch, decide to push updates out. as we've grown - a lot more committers, in core & extensions.

CodeReview extension shares this workflow.

What do we look for?

Does this commit do something we want?
- bigger concern for patch review (Bugzilla) - submitters might not be as tied in ... might make a change to something in MediaWiki in a way we don't want. Or we don't want it as is. Great opportunity to tell a submitter that you (who, the submitter?) should do something different.

Does it work?
- well?
- efficiently?
- securely?
- consistently?

MediaWiki is used on Wikipedia - sometimes performance considerations others would not think of

Example: random commit -- "FIXME" tagged. r93635 I think?

Supposed to fix a bug. Brion looks at bug report.

more people have looked at it, so that adds confidence that it is good, does what it says it will
but there is a comment: regression!
so next action: either revert it now, if it is a severe problem, or leave it with a FIXME to see if author fixes it.

We do not have explicit assignment for code review, except for tag queues, but in general a code review status change ... whoever set the resolution is implicitly responsible to eventually say YES or NO. It's polite to work with whoever first started working on the code review.

We are trying to be more consistent & systematic about Bugzilla default assignees, for example.

Maintainability - a big issue for a project like this. When we come back in 5 years, will we know what's going on with this code? Look out for dense, confusing structures. Variables should have reasonable names. Comments?

Coding conventions

We will sometimes comment that someone should change the order of parameters. And instead of boolean flag, maybe use a named constant. Visibiity constants for example. Or opening database - debug connection or not? Instead of six TRUEs in a row.

& testing is one of the ultimate maintainability helps. If you change code & hope it works, you're doing something wrong. If you are working in response to a feature request or a bug, you have an opportunity to ...... there's a defined condition to write a test. And in the future, then, we'll always check that this works. This becomes very important with big refactorings, e.g. tablesorting code.

Brion starts talking about test-driven development, Sumana cuts him off, quotes from https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Code_review_management/July_2011_training#Brion_and_his_TARDIS re PHPDocumenter/Doxygen, visibility modifiers, testing whether you break extensions.

Sometimes we break backwards compatibiity. We'll try to keep old interfaces around until they are very broken. We sometimes mark things as deprecated.

Tim on security

notes already exist at

Why should you care about security?

"What's the worst that can happen?" Go to #mediawiki and type !worstcase

A few kinds of vulnerabilities to watch out for: think about security when you are writing code

read http://owasp.org -- good overall online guide for web app security

a good guide to application security

A few basics exist on mediawiki.org Security for developers We recommend everyone who is writing code for MediaWiki read this page and its subpages

Most common vulnerability for a web app: XSS, or cross-site scripting. Arbitrary HTML being put into your output.

XSS

How to stop it? Escape anything that goes into output, but only when it goes into output.

JMOL extension example: jmol.body.php

When we are reviewing code for security, we look for places where HTML is generated. The string! $someVariable comes from where? search for it. And if it's coming from user input, the user could end the whole element and do something else with the HTML output and take over the browser, the user's account, the world.

So how to make this safe? You want to see some escaping on the line where the output is generated, not far away. Recommendation is to escape close to or at output, not far before.

If HTML is generated raw, then one should either make sure the output can't be arbitrary (ie. rand() or intval() always return something safe), or if it is variable it must be escaped with a function like "htmlspecialchars".

Just adding htmlspecialchars() fixes it. Or, if you want it to happen automatically, use HTML class methods. For example:

// User input
$variable = ' "> <script>alert("hello evil world")</script><div title=" ';

// Unsafe way
echo '<div foo="bar" title="' . $variable . '">some content</div>';
// Outputs:
// <div foo="bar" title=""><script>alert("hello evil world");</script><div title=" ">some content</div>

// Safe way
Html::element( 'div', array(
        'foo' => 'bar',
        'title' => $variable // Html::element automatically sanitizes this part so that it can't do funky stuff.
) );

If you use these functions from the Html class, in html.php in core, all escaping is done for you. If you have JavaScript or URL embedded in element, you need to do extra work to escape things, spaces, etc.

Wil generate valid HTML, if not valid CSS. If you want the latter, wrap it in ... convert to an integer

The kind of thing we're looking for is suspicious construction of HTML. In cases where it is not obvious by looking at code, that code is secure.

CSRF: Cross-site request forgery

Happens when you have a write action form that can be submitted off-site
- Form can be hidden... your cookies will still be used!

MediaWiki is well hardened against this but.... probably 90% of all web apps are vuln to this. For example, Facebook.

You can put user scripts inside MediaWiki. Anything checking that those are okay?

if user is entering in the script, then there is no problem really.
Our CSRF protections on edit page mean that you can't edit a page as another user
if you can force another user to log in as you then you will run the user scripts

Register globals: register_globals -- deprecated for a long time. This is the reason most PHP files in MediaWiki, espec extensions, start with a protection against a script which you don't intend as an entry point run as an entry point.

SQL Injections Pretty straightforward. Almost the same as HTML. Instead of injeecting scripts into a webpage, SQL expressions into a SQL query. Instead of a table returning a list of usernames, it could return passwords instead! We have pretty good protection if you use our facilities in MediaWiki

Performance! see https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Code_review_management/July_2011_training#Tim.27s_security_and_performance_talk MySQL: People often write queries that are not indexed and/or scan a lot of rows. You need to understand what MySQL is doing when it executes my query Know how a database works & how it does what it does. Then you'll understand performance issues. Index, for example.

phonebook example make sure info is in an appropriate index

Where do we get info about this? On basic MySQL performance optimization?

MySQL documentation

Where do we get statistics on indexes & tables? Table status? You want to enter the backlog.... ? queries - find the query you added .... MYSQL DESCRIBE query tells you which indexes affected by a specific query...

How the query is executed depends on size of tables & indices

Tim: get a feeling for this by getting a toolserver account. You'll have access to almost all the tables and data present on Wikimedia sites, and be able to query it in various ways, time the queries to know how long they take to execute. Since it's the Toolserver, it's harder to make the whole site crash!

If you wanna see what indices are available look at raw SQL file that sets up tables in first place. Not so bad if you know SQL. Example: look at bottom of User table, see what it's cheap to look users up by.

if you want to know what users have exacty 50 edits.

Some indices are over multiple fields. Page.... One of the worst things you can do - try to find out how many pages someone started

There is documentation on the database schema. As for how dbs works, binary trees, etc. no titles off the top of Tim's head. As for optimization of Mediawiki specific DB queries, we have some pages on mediawiki.org and in the docs directory of the core a thing called database.txt a few notes by Tim re databases & query optimization, lock contention, lag

<OrenBo> how well is the database documented ? <Nikerabbit> *the* database? <Nikerabbit> there is some docs on the mediawiki database schema <brion> in short: probably not as well as it should be ;) <brion> there are comments in the schema file -- maintenance/tables.sql <brion> and also docs of varying quality on mediawiki.org <brion> <-

Brandon on UI review

https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Code_review_management/July_2011_training#Brandon_on_UI_review

New thing we're gonna work on for Foundation projects, things we may end up deploying

We've had a few issues where people check stuff into core, or extensions, and there are bad changes. May not seem bad at first, but are in long term.

Example: change to search box -- focus-follows-mouse. Deployed to translatewiki.

Brandon is working on a more rigid process for doing design reviews. We wanna talk about stuff we're doing at Foundation ahead of time.

Types of things we want to review & apply to do design review on: major inteface changes, or changes everyone will see rule of thumb: the fewer people will see the change, the less important it is to review.

Pendingchanges vs main flaggedrevs.

Searchbox - everyone sees that.

A gadget only admins see - less important.

If your change will change the background color, Brandon wants to know about that. We have been trying to figure out best tag -- "design" or "jorm" works.

Best way to test may be to make a screenshot

Most important changes, ordered by who they will be visible to:

All readers (e.g a change that goes into an extension we have deployed in all wikis)
Readers of major wikis (en & de)
Editors (anywhere)
And so on.

often all Brandon does is look at the screenshot provided

Fewer preferences we have, the better off we are! The more "configurable" a site is, the worse it is for users.

Q: Is there a group of UI experts to ask for advice?

A: Brandon and Trevor.

Brandon is writing a MediaWiki styleguide, in order to pour the UI team's expertise into one place.

"here's why you don't want to use red unless it's an error"

Also, Timo & Kaldari are working on taking this [style guide] & making it part of core. So is Andrew Garrett.

HTMLForm class will then do it for you.

On-stage live code review

Florian H. has Tim look at the recent check-in of the CollabWatchlist extension. Commits: Special:Code/MediaWiki/?path=trunk/extensions/CollabWatchlist Description: A collaborative watchlist. A group of people reviewing changes to a set of pages would like to have a way to flag revisions - just for their group.

Is this something we want?

It's an extension. so users installing this extension want it. Easy answer

Data storage

Look at tables, SQL queries. -- how does it store data?
Common problem: No indexes. That's not very good.
Code structure:
- All core tables in 1 sql file, and later modifications in patches for upgraders. Not every table in a separate file.
  - Did you have a reason to split it up? No. So, easier to have it in one place (both for review and for the installer)

Output

this is a copy of the watchlist code? with features relevant to that extension
might be worthwhile to try to refactor some of this to avoid code duplication
difficult to review code that's duplicate of another code. Like "spot the difference" game.

Idea: extend the watchlist class in core...

User interface for editing collaborative watchlist: Editor page

checking... a lot of functions in common... good if those could be inherited and not duplicated

This checks the right boxes for security, uses XML functions, XML::label & input? which escape stuff automatically.
- May want to use HTMLForm class instead of seperate input/labels together.
  - Tim: that's more of a general approach thing, too minor to do afterwards as a "fix".
    - Tim: If you have 1000 lines of code, want to make it work, often easier to add 3 lines than to refactor to use HTMLForm.
Looking to find CSRF protection -- none added.
- Add a hidden input that contains a random token. Straightforward.

Gnome/unspoken comments

Minor: should be
Grouping "global" statements on top of the function.
(crowd review!)

Waldir:

Some editors on Wikipedia frown upon minor changes. In MW development, if a commit is made that e.g. just cleans up some code, what's the position of MW developers on that? This kind of changes could be good for people who want to start with small things.

Brion:

well, sometimes cleanup breaks other things. Obvious idea: visibility markers on class methods. Old way: no way to declare certain things. Everything public by default. Marked in code comments that generated code documentation. PHP5 switch: started explicitly using markers.... sometimes old ones are misleadingly labelled.

Tim: Fix a bug or write an extension to start, then you will know more about how things ought to be done, then you can clean things up more expertly.

Andrew doing a 20-min workshop on how to write a MediaWiki extension... tommmmorrrrow at 2 pm.

Brandon attests to the value of Andrew's mentorship; Andrew taught Brandon how to write an extension.

Waldir follows up: what about clarifying comments as a first patch? Roan says: good idea! can't break anything.