Talk:Requests for comment/Clean up URLs

Latest comment: 9 years ago by SPage (WMF) in topic About /wiki/ prefix

Wary; some notes

edit

I'm naturally wary of such changes. :) A few notes:

  • There's only "no conflict" between "robots.txt" file and "Robots.txt" article due to the historical accident that we by default force the first letter of a page name to capitalized. Please do not rely on this being true in the future; we may well "fix" that one day, and we'd possibly want to rename those files.
  • If we do some sort of massive URL rearrangement, it could break third-party users of our HTML output (including parsed-by-the-API HTML output). For instance I know this would break handling of article-to-article links in the current Wikipedia mobile apps (they would no longer recognize the URLs as article pages, and would probably load them in an external browser instead). This would at the least require some careful planning and coordination.
  • If we're making a rearrangement of URLs, we'll probably have a fun ..... shift... in search engine rankings etc. It might be disruptive.
  • Regarding the index.php .... the primary problem with simply changing everything over to /Article_URL?with=a&query=string is that our robots.txt would no longer be able to exclude those links from spidering. Using a separate prefix means we can very easily chop off all our standard-generated links-with-querystrings that need to be dynamically generated, and make sure that spiders don't squash our servers into dust.
    • Using action paths (eg /edit/Article_name, etc) would provide a nice readable URL without damaging that. However this existing support doesn't cover the case of things like old revisions ('?oldid=123') etc, which default to $wgScript

-- brion (talk) 18:24, 16 September 2013 (UTC)Reply

I personally think that forcing the two articles about robots.txt and favicon.ico to be capitalized is an acceptable trade-off. This does not prevent us from using lower-case titles in general (which we already support).
I agree that we'd have to coordinate with third-party users. Some of the users of the PHP parser's HTML are also preparing to use Parsoid output, which uses relative URLs everywhere. One of these users is Google. Since we are in contact with the Google folks and can contact other search engines too we can probably avoid issues with ranking changes.
Re robots.txt: At least Google, MSN, Slurp (Yahoo) and Yandex support globbing. I have been using this with success for many years, and sites like Quora do the same. -- Gabriel Wicke (GWicke) (talk) 19:08, 16 September 2013 (UTC)Reply
Keep in mind that there may be other conflicts in the future besides just robots.txt, favicon.ico, and internal stuff. For example RFC 5785 defines a namespace for new "Well Known" URIs and it's already been picked up by some standards. We never know what kind of new standard we might want to implement in the future. And if we ever decide to implement one which simultaneously got notable enough for Wikipedia to write an article and add a redirect for then there would be a conflict. And unlike robots.txt that would be a real conflict. Because the well-known standard prefix happens to be /.well-known/??? and . is the same in upper and lower case strings. Not to mention the fact that even if we didn't implement any of those standards there would still be an undesirable conflict where implementations of that standard try to parse Wikipedia articles because they happen to sit at the exact URL they are expecting and happen to return a 200 OK code. Daniel Friesen (Dantman) (talk) 08:00, 17 September 2013 (UTC)Reply
If RFC 5785 gets widely adopted then that would basically avoid future name clashes. Only /.well-known/ including the trailing slash would conflict, and an article about the standard can well be called /.well-known. Lets hope it gets adopted for future sitemaps etc. -- Gabriel Wicke (GWicke) (talk) 17:15, 17 September 2013 (UTC)Reply

While I favor simplifying our URLs and think this RfC would generally be a win, it would nonetheless backfire with our current, fragile pipeline to process access logs. For example webstatscollector considers each and every /wiki/ requests and ignores the action :-(. Thereby, not only requests for some article's content, but also requests for the article's history (which might be ok), requests for feeds of the article's history (which might be less intentional), or fetching raw js (e.g.: MediaWiki:RefToolbarBase.js, which is hardly a pageview) would get counted towards a project's page views. Thereby, it would for example directly impact and further falsify the numbers of the page view tables as for example Page Views for Wikipedia, All Platforms, Normalized, or the reportcard. If we could postpone until analysis at Analytics' card 1387 is done (so we can make a better informed comment) that would help to mitigate effect of the switching of used URLs. --QChris (talk) 13:02, 22 January 2014 (UTC)Reply

As some months passed, and nothing happened on Analytics side, it seems Analytics does not care. So consider my above comment (asking to postpone a bit) moot. Let's get cleaned up urls :-D QChris (talk) 17:40, 14 May 2014 (UTC)Reply

YES

edit

Cleaning up our URLs this way would make me and so many others who don't understand our uniform resource locators happy. The two current changes outlined within the scope of this RFC are a classic example of an "implementation mental model" problem: when an interface presented to users follows the way it is implemented technically, rather than in a way users naturally expect. Steven Walling (WMF) • talk 00:14, 17 September 2013 (UTC)Reply

I think this is a great idea, as long as it doesn't cause any problems with search engines. πr2 (tc) 19:05, 15 December 2013 (UTC)Reply

Yes, yes, yes! I manually edit /w/index.php?title=ThePartThatMatters&param=value... URLs in e-mails to the far better /wiki/ThePartThatMatters?param=value... , this change would save me time and improve URLs for everyone who doesn't bother. title= is just noise that's of no benefit, and swapping between /wiki/ for a page and /w/ for actions on a page is, as Steven Walling says, presentation of irrelevant technical implementation. -- S Page (WMF) (talk) 22:56, 14 January 2014 (UTC)Reply

+1. I already use /wiki/Foo?action=history URLs in emails for more beautiful URLs. If it is feasible without too much problems with spider bots, it would be great to generalise it. ~ Seb35 [^_^] 22:46, 19 January 2014 (UTC)Reply

Separating page URLs from resources/actions

edit

Another approach to the w/ problem might be to put page names on the root, for example change en.wikipedia.org/wiki/Main_Page to en.wikipedia.org/Main_Page, and all other URLs (.php entry points, images/, actions) on another domain, for example changing en.wikipedia.org/w/api.php to say en.wp-resources.org/api.php or wmf-resources.org/enwp/api.php... - Wonder (talk) 03:02, 17 September 2013 (UTC)Reply

Putting pretty urls and other resources on other domains is practically impossible and doesn't fix the issues:
  • Resources such as robots.txt and favicon.ico are universal. Even if you add another domain root urls still conflict with these resources.
  • You cannot simply change live URLs like these. Even if you make the new API location wmf-resources.org/enwp/api.php the url en.wikipedia.org/w/api.php must still point to the API because there are piles of things still pointing at this URL. So the original issues haven't gone away.
  • You cannot place the API and site on different domains. We use the API within the live site for things like watchlist updates. Moving the API to a different domain will break site features that use the API because of cross-origin restrictions. Even if we implement CORS 1.83% of WP's traffic doesn't implement CORS in any way at all and 9.65% of it doesn't implement CORS in a way we can use.
  • Using two domains will break sessions. Login sessions will be tied to the en.wikipedia.org domain. As a result any resource served to the user from the other domain won't have the session. While this won't be an issue for things like images – unless of course the wiki is private and using img_auth – this is critical for things like the API. Not only will there be CORS issues with the API but the API won't even have the user session so even in a CORS supporting browser things like watchlist toggling will break. And logins done with API action=login will have cookies tied to the wrong domain if they are intended for any sort of AJAX login or intended to present desktop views to the user (I wonder if the mobile site would fit under this category).
Daniel Friesen (Dantman) (talk) 04:04, 17 September 2013 (UTC)Reply

Statement of problem

edit

Hi. I still think this request for comments is missing a clear statement of what problem is trying to be addressed. As I understand it, the vast majority of requests currently are to example.com/wiki/Foo (using the implicit "view" action). What is the purpose of rewriting URLs of other actions? They get far lower traffic and we generally don't want to (for example) cache history pages (action=history) or edit screens (action=edit). Why would we rewrite URLs to have a uniform prefix? What problems are we seeing right now that would be addressed by this change? --MZMcBride (talk) 00:13, 22 September 2013 (UTC)Reply

There are two applicable bugs linked in the RFC, and probably more not mentioned. I support this, as long as we commit to making all the old URLs still work through rewrites. We do need to think through the possible issues.
As far as caching, I see that as secondary. Some pages with query strings could probably be cached (e.g. maybe printable=yes), but that should be a separate conversation. Also, I wonder if Varnish or Squid can be made to ignore parameter order (see "multiple query parameter" point in the lede). I don't know enough to answer that question. Superm401 - Talk 11:44, 27 September 2013 (UTC)Reply
Hi Superm401.
The two bugs you cite as applicable (bugzilla:16659 and bugzilla:17981) are tangential at best. For permalinks, we already have /wiki/Special:Permalink/XXXXX. How much cleaner can that get? For history pages, I guess you're making the argument that /w/index.php?title=Foo&action=history is worse than /wiki/Foo?action=history? The marginal benefit is very small. And the bug is really asking for action paths (i.e., /history/Foo), which are completely outside the current scope of this RFC, as I understand it.
If these are the only "problems" that this RFC seeks to address, the cost of changing the URL structure does not seem proportionate to the benefit. --MZMcBride (talk) 20:35, 27 September 2013 (UTC)Reply

I agree. This proposal doesn't seem to be solving any problem (or at least no stated problem). The most compelling argument given seems to be «equivalent queries, which both would need to be purged»: can Varnish folks confirm (BBlack?)? How much of an issue is the multiple caching/purging? To solve that, AFAICS you'd just have to normalise the URL, e.g. rewriting them to have the parameters in alphabetical order. --Nemo 06:37, 8 April 2015 (UTC)Reply


I tend to agree with Mzmcbride. Well we could certainly do this, im unsure what the benefit would be. The proposed url scheme looks just as ugly as the current one (imho. Tastes varry. However if you proposed action paths style urls, i might be more convinced of the asethetic argument) the caching argment does not seem that compelling to me. Most resources other than view arent cached by url and even if they are thid hardly seems like a big deal. Even in the best case scenario it will probably cause mild confusion when users suddenly see the scheme they are used to shift, and i dont see what the benefit is that would balance this (albeit extremely mild) drawback. Bawolff (talk) 12:02, 8 April 2015 (UTC)Reply

robots.txt

edit

robots.txt seems like a part of this that needs careful research. We need to verify that the well-behaved bots (can't do anything about ones that just ignore robots.txt) responsible for most of our traffic obey the glob Disallow mentioned. Also we need to make sure that pages like https://en.wikipedia.org/wiki/Who%3F_%28novel%29 are interpreted correctly. In other words, that robots.txt engines don't decide that's equivalent to https://en.wikipedia.org/wiki/Who?_(novel) and thus blocked. Superm401 - Talk 11:48, 27 September 2013 (UTC)Reply

I also have a separate issue with the idea of blindly blacklisting /wiki/*? I sent an email about.

Ok. Though even assuming the * and Allow: non-standard features are supported by all bots we want to target I actually don't like the idea of blacklisting /wiki/*? in this way.

I don't think that every url with a query in it qualifies as something we want to blacklist from search engines. There are plenty but sometimes there is content that's served with a query which could otherwise be a good idea to index.

For example the non-first pages on long categories and Special:Allpages' pagination. The latter has robots=noindex – though I think we may want to reconsider that – but the former is not noindexed and with the introduction of rel="next", etc... would be pretty reasonable to index but is currently blacklisted by robots.txt. Additionally while we normally want to noindex edit pages. This isn't true of redlinks in every case. Take redlinked category links for example. These link to an action=edit&redlink=1 which for a search engine would then redirect back to the pretty url for the category. But because of robots.txt this link is masked because the intermediate redirect cannot be read by the search engine.

The idea I had to fix that naturally was to make MediaWiki aware of this and whether by a new routing system or simply filters for specific simple queries make it output /wiki/title?query urls for those cases where it's a query we would want indexed and leave robots blacklisted stuff under /w/ (though I did also consider a separate short url path like /w/page/$1 to make internal/robots blacklisted urls pretty). However adding Disallow: /wiki/*? to robots.txt will preclude the ability to do that.
— http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/72620/focus=72654

Daniel Friesen (Dantman) (talk) 12:43, 27 September 2013 (UTC)Reply
I agree. Excluding default "dynamic" pages like the history from crawling makes sense, but reducing the availability of content more than what we do now is unwarranted and a huge cost which should be considered in the drawbacks to be compared to the benefits (if any; see problem statement). Moreover, expanding robots.txt exclusions is unfair, because parameters to index.php can also be sent in subpage format (with many special pages), so what's the point? --Nemo 06:42, 8 April 2015 (UTC)Reply

But you can already do this?

edit

The feature proposed here has worked for as long as I can remember, it just isn't used by any links on the site.

http://www.mediawiki.org/wiki/Requests_for_comment/Clean_up_URLs?action=history

101.160.15.107 09:29, 28 December 2013 (UTC)Reply

As I understand the RFC, the proposed feature is to use it by default: the links in the pages would directly be with this type of URL. E.g. the History tab would be /wiki/Foo?action=history (vs. currently /w/index.php?title=Foo&action=history). Isn’t it? ~ Seb35 [^_^] 22:31, 19 January 2014 (UTC)Reply

Bug 12619

edit

I just came upon bug 12619, which seems to have requested the same thing, and which is not mentioned in the RFC. Matma Rex (talk) 14:26, 28 December 2013 (UTC)Reply

Thanks, I added T14619 to the {{RFC }} on this page. SPage (WMF) (talk) 20:28, 3 March 2015 (UTC)Reply

action=raw behavior difference

edit

Given any page whose title ends with .json, I noticed asking for ?action=raw with the lovely so-much-better please-do-this URL get a 403 forbidden. E.g.

https://www.mediawiki.org/wiki/User:SPage_%28WMF%29/test.json?action=raw

, fails. includes/WebRequest.php->checkUrlExtension() throws error 403 Forbidden and displays "Invalid file extension found in the path info or query string."

But the old-style index.php URL

https://www.mediawiki.org/w/index.php?title=User:SPage_(WMF)/test.json&action=raw

doesn't trigger this.

The comment is "Check if Internet Explorer will detect an incorrect cache extension..." The inconsistency is weird, if it's insecure or causes cache problems a bad actor can stick either link on a page and make it look the same.

-- SPage (WMF) (talk) 20:42, 3 March 2015 (UTC)Reply

About /wiki/ prefix

edit

wikidata.org also uses http://wikidata.org/entity/ as the URL for entities and other ontology item URLs assume /wiki/ space does not intersect with ontology URLs like /entity/. Removing /wiki/ may lead to trouble there too. Just something to keep in mind. --Smalyshev (WMF) (talk) 00:26, 8 April 2015 (UTC)Reply

Gabriel dropped that part of the RFC proposal, what remains (and was approved 2015-04-08) is solely /wiki/PageName?action=foo as URLs. In IRC people discussed using action paths like /edit/PageName but that wasn't pursued. -- SPage (WMF) (talk) 21:12, 9 April 2015 (UTC)Reply
Return to "Requests for comment/Clean up URLs" page.