User:Bináris/Pywikibot cookbook

Published

Pywikibot is the ninth wonder of the world, the eighth being MediaWiki itself.

Pywikibot is very flexible and powerful tool to edit Wikipedia or another MediaWiki instance. However, there comes the moment when you feel that something is missing from it, and the Universe calls you to write your own scripts. Don't be afraid, this is not a disease, this is the natural way of personal evolution. Pywikibot is waiting for you: you will find the scripts/userscripts directory, which is ready to host your scripts.

This book is for you, if you

For general help see the bottom right template. In this book we go imto coding examples with some deeper explanation.

(A personal confession from the creator of this page: I just wanted to use Pywikipedia, as we called it in the old times, then I wanted to slightly modify some of the scripts to better fit to my needs, then I went to the book store and bought my first Python book. So it goes.)

Introduction

edit
Published

Creating a script

edit
Encoding and environment
It is vital that all Python 3 source files MUST[1] be UTF-8 without a BOM. Therefore it is a good idea to forget the bare Notepad of Windows forever, because it has the habit to soil files with BOM. The minimal suggested editor is Notepad++, which is developed for programming purposes and is cross-platform. It has an Encoding menu where you see what I am speaking about, and you may set UTF-8 without BOM as default encoding. Any real programming IDE will do the job properly, e.g. Visual Studio Code is quite popular nowadays. Python has an integrated editor called IDLE, which uses proper encoding, but for some mysterious reason does not show line numbers, so you will suffer a lot from error messages, when you keep trying to find the 148th line of your code.
Where to put
scripts/userscripts directory is designed to host your scripts. This is a great idea, because this directory will be untouched when you update Pywikibot, and you can easily backup your own work, regarding just this directory.
You may also create your own directory structure. If you would like to use other than the default, search for user_script_paths in user-config.py, and you will see the solution.
See also https://doc.wikimedia.org/pywikibot/master/utilities/scripts_ref.html#module-pywikibot.scripts.wrapper.

Running a script

edit

You have basically two ways. The recommended one is to call your script through pwb.py. Your prompt should be in Pywikibot root directory where pwb.py is and use:

python pwb.py <global options> <name_of_script> <options>

However, if you don't need these features, especially if you don't use global options and don't want pwb.py to handle command line arguments, you are free to run the script directly from userscripts directory.

Coding style

edit

Of course, we have PEP 8, Manual:Coding conventions/Python and Manual:Pywikibot/Development/Guidelines. But sometimes we feel like just hacking a small piece of code for ourselves and not bothering the style.

Several times a small piece of temporary code begins to grow beyond our initial expectations, and we have to clean it.

If you'll take my advice, do what you want, but my experience is that it is always worth to code for myself as if I coded for the world.

On the other side, when you use Pywikibot interactively (see below), it is normal to be lazy and use abbreviations and aliases. For example

>>> import pywikibot as p
>>> import pywikibot.pagegenerators as pg

Note that the p alias cannot be used in the second import. It will be useful later, e.g. for p.Site().

However, in this cookbook we won't use these abbreviations for better readability.

Beginning and ending

edit

In most cases you see something like this in the very first line of Pywkibot scripts:

#!/usr/bin/python or #!/usr/bin/env python3

This is a shebang. If you use a Unix-like system, you know what it is for. If you run your scripts on Windows, you may just omit this line, it does not do anything. But it can be a good idea to use anyway in order someday others want to use your script.

The very last two lines of the scripts also follow a pattern. They usually look like this:

if __name__ == '__main__':
    main()

This is a good practice in Python. When you run the script directly from command line (that's what we call directory mode), the condition will be true, and the main() function will be called. That's where you handle arguments and start the process. On the other side, if you import the script (that is the library mode), the condition evaluates to false, and nothing happens (just the lines on the main level of your script will be executed). Thus you may directly call the function or method you need.

Scripting vs interactive use

edit

For proper work we use scripts. But there is an interesting way of creating a sandbox. Just go to your Pywikibot root directory (where pwb.py is), type

python

and at the Python prompt type

>>> import pywikibot
>>> site = pywikibot.Site()

Now you are in the world of Pywikibot (if user-config.py is properly set). This is great for trying, experimenting, even for small and rapid tasks. For example to change several occurences of Pywikipedia to Pywikibot on an outdated community page just type:

>>> page = pywikibot.Page(site, 'titlecomeshere')
>>> page.text = page.text.replace('Pywikipedia', 'Pywikibot')
>>> page.save('Pywikibot forever!')

Throughout this document >>> prompt indicates that we are in the interactive shell. You are encouraged to play with this toy. Where this prompt is not present, the code lines have to be saved into a Python source file. Of course, when you use save(), it goes live on your wiki, so be careful. You may also set the testwiki as your site to avoid problems.

A big advantage of shell is that you may omit the print() function. In most cases

page.title()

equals to

print(page.title())

#Working with namespaces section shows a rare exception when these are not equivalent, and we can take advantage of the difference for understanding what happens.

Documentation and help

edit

We have three levels of documentation. As you go forward into understanding Pywikibot, you will become more and more familiar with these levels.

  1. Manual:Pywikibot – written by humans for humans. This is recommended for beginners. It also has a "Get help" box.
  2. https://doc.wikimedia.org/pywikibot – mostly autogenerated technical documentation with all the fine details you are looking for. Click on stable if you use the latest deployed stable version of Pywikibot (this is recommended unless you want to develop the framework itself), and on master if you use the actual version that is still under development. Differences are usually small.
  3. The code itself. It is useful if you don't find something in the documentation or you want to find working solutions and good practices. You may reach it from the above docs (most classes and methods have a source link) or from your computer.

Notes

edit
  1. The full-capitalized MUST has a special meaning in programming style guides, see RFC 2119.

Basic concepts

edit
Published

Throughout the manual and the documentation we speak about MediaWiki rather than Wikipedia and Wikibase rather than Wikidata because these are the underlying software. You may use Pywikibot on other projects of WikiMedia Foundation, any non-WMF wiki and repository on Internet, even on a MediaWiki or Wikibase instance on your home computer. See the right-hand menu for help.

Site
A site is a wiki that you will contact. Usually you work with one site at a time. If you have set your family and language in user-config.py, getting your site (e.g. one language version of Wikipedia) is as simple as site = pywikibot.Site().
You will find more options at https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.html#pywikibot.Site
Site is a bit tricky because you don't find its methods directly in the documentation. This is because Site is not a class, rather a factory method that returns class instances. Digging into it you will find that its methods are under https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._basesite.BaseSite as well as https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#module-pywikibot.site._apisite.
Even more strange is that some of these methods manipulate pages. This is a technical decision of the developers, but you don't have to deal with them, as they are not for direct use. It is interesting to browse the site methods as you may get ideas from them how to use the framework.
>>> import pywikibot
>>> site = pywikibot.Site()
>>> site
APISite("hu", "wikipedia")
This shows that your site is an APISite, but it also inherits methods from BaseSite.
Repository
A data repository is a Wikibase instance; when you work with WMF wikis, it is Wikidata itself. While you can directly work with Wikidata, in which case it will be a site, often you want to connect Wikipedia and Wikidata. So your Wikipedia will be the site and Wikidata the connected data repository. The site knows about its repository, therefore you don't have to write it in your user-config, rather get it as site.data_repository(). (See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._apisite.APISite.data_repository.) However, articles have some direct methods to their Wikidata page, so in most cases you may not have to use the repository at all.
Image repository basically means Commons, at least for WMF wikis.
Continuing the previous code will show that the repository is also a special subclass of BaseSite:
>>> site.data_repository()
DataSite("wikidata", "wikidata")
>>> site.image_repository()
DataSite("commons", "commons")
Commons and Wikidata don't have language versions like Wikipedia, so their language and family is the same.
Page
Page is the most sophisticated entity in Pywikibot as we usually work with pages, and they have several kinds, properties, operations. See:
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.html#pywikibot.Page
Page is a subclass of BasePage, a general concept to represent all kinds of wikipages such as Page, WikibasePage and FlowPage. See:
https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.BasePage
In most of your time you are likely to work with Page instances (unless your primary target is Wikidata itself). You won't instantiate BasePage directly, but if you wonder what you can do with a page (such as an article, a talk page or a community noticeboard), you should browse both documentations above.
We have two main approaches to pages: ready-to-use page objects may be obtained from page generators or single pages may be created one by one from the titles. Below we look into both.
Category
Category is a subclass of Page. That means it represents a category with its special methods as well as the category page itself.
User
User is a subclass of Page. That means it represents a user with its special methods as well as the user page itself.

All the above concepts are classes or classlike factories; in the scripts we instantiate them. E.g. site = pywikibot.Site() will create a site instance.

Directory mode and library mode of Pywikibot
See in section #Beginning and ending.

Testing the framework

edit
Published

Let's try this at Python prompt:

>>> import pywikibot
>>> site = pywikibot.Site()
>>> print(site.username())
BinBot

Of course, you will have the name of your own bot there if you have set the user-config.py properly. Now, what does it mean? This does not mean this is a valid username, even less it is logged in. This does not mean you have reached Wikipedia, neither you have Internet connection. This means that Python is working, Pywikibot is working, and you have set your home wiki and username in user-config.py. Any string may be there by this point.

If you save the above code to a file called test.py:

import pywikibot
site = pywikibot.Site()
print(site.username())

and run it with python pwb.py -user:Brghhwsf test, you will get Brghhwsf.

Now try

print(site.user())

This is already a real contacting your wiki; the result is the name of your bot if you have logged in, otherwise None. For advanced use it is important that although user() is similar to a User() object instance, here it is just a string. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.site.html#pywikibot.site._basesite.BaseSite.user.

Getting a single page

edit
Published

Creating a Page object from title

edit

In the further part of this cookbook, unless otherwise stated, we always assume that you have already used these two basic statements:

import pywikibot
site = pywikibot.Site()

You want to get the article about Budapest in your wiki. While it is in the article namespace, it is as simple as

page = pywikibot.Page(site, 'Budapest')

Note that Python is case sensitive, and in its world Site and Page mean classes,[1] Site() and Page() class instances, while lowercase site and page should be variables.

For such simple experiments interactive Python shell is comfortable, as you can easily see the results without using print(), saving and running your code.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> type(page)
<class 'pywikibot.page._page.Page'>

Getting the type of an object is often useful when you want to discover the capabilities of Pywikibot. It seems to be strange, but the main thing is that you got a Page. Now let's see the user page of your bot. Either you prefix it with the namespace ('User' and other English names work everywhere, while the localized names only in your very wiki) or you give the namespace number as the third argument. So

>>> title = site.username()
>>> page = pywikibot.Page(site, 'User:' + title)
>>> page
Page('Szerkesztő:BinBot')

and

>>> title = site.username()
>>> page = pywikibot.Page(site, title, 2)
>>> page
Page('Szerkesztő:BinBot')

will give the same result. 'Szerkesztő' is the localized version of 'User' in Hungarian; Pywikibot won't respect that I used the English name for the namespace in my command, the result is always localized.

Getting the title of the page

edit

On the other hand, if you already have a Page object, and you need its title as a string, title() method will do the job:

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.title()
'Budapest'

Possible errors

edit

While getting pages may cause much less errors than saving them, a few types are worth to mention, some of them being technical, while others possible contradictions between our expectations and reality. Let's speak about them before actually getting the page.

  1. The page does not exist.
  2. The page is a redirect.
  3. You may have been mislead regarding the content in some namespaces. If your page is in Category namespace, the content is the descriptor page. If it is in User namespace, the content is the user page. The trickiest is the File namespace: the content is the file descriptor page, not the file itself; however if the file comes from Commons, the page may not exist in your wiki at all, while you still see the picture.
  4. The expected piece of text is not in the page content because it is transcluded from a template. You see the text on the page, but cannot replace it directly by bot.
  5. Sometimes a badly formed code may work well. For example [[Category:Foo  bar]] with two spaces will behave as [[Category:Foo bar]]. While the page is in the category and you will get it from a page generator (see below), you won't find the desired string in it.
  6. And, unfortunately, Wikipedia servers sometimes face errors. If you get a 500 error, go and read a book until server comes back.
  7. InvalidTitleError is raised in very rare cases. A possible reason is that you wanted to get a page title that contains illegal characters.

Getting the content of the page

edit

Important: by this time we don't have any knowledge about the existence of the page. We have not contacted live wiki yet. We just created an object. It is just as a street number: you may write it on a document, but either there is a house there or not.

There are two main approaches of getting the content. It is important to understand the difference.

Page.text

edit

You may notice that text does not have parentheses. Looking into the code we discover that it is not a method, rather a property. This means text is ready to use without calling it, may be assigned a value, and is present upon saving the page.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.text

will write the whole text on your screen. Of course, this is for experiment.

You may write
text = page.text
if you need a copy of the text, but usually this is unneccessary. Page.text is not a method, so referring to it several times does not slow down your bot. Just manipulate page.text or assign it a new value, then save.

If you want to know details on how a property works, search for "Python decorators". For using it in your scripts it is enough to know the behaviour. Click on the above link and go through the right-hand menu. You will find some other properties without parentheses.

Page.text will never raise an error. If the page is a redirect, you will get the redirect link instead of the content of the target page. If the page does not exist, you will get an empty string which is just what happens if the page does exist, but is empty (it is usual at talk pages). Try this:

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
...     print('Got it!')
... else:
...     print(f'Page {page.title()} does not exist or has no content.')
...
Page Arghhhxqrwl!!! does not exist or has no content.

Page.text is comfortable if you don't have to deal with the existence of the page, otherwise it is your responsibility to make the difference. An easy way is Page.exists().

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> if page.text:
...     print(len(page.text))
... else:
...     print(page.exists())
...
False

While page creation does not contact the live wiki, refering to text for the first time and Page.exists() usually does. For several pages it will take a while. If it is too slow for you, go to the #Working with dumps section. page.has_content() shows if it is neccessary; if it returns True, the bot will not retrieve the page again. Therefore it returns True for non-existing pages as it is senseless to reload them. Although this is a public method, you are unlikely to have to use it directly.

Page.get()

edit

The traditional way is page.get() which forces you to handle the errors. In this case we store the value in a variable.

>>> page = pywikibot.Page(site, 'Budapest')
>>> text = page.get()
>>> len(text)
165375

A non-existing page causes a NoPageError:

>>> page = pywikibot.Page(site, 'Arghhhxqrwl!!!')
>>> text = page.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
    self._getInternals()
  File "c:\Pywikibot\pywikibot\page\_page.py", line 436, in _getInternals
    self.site.loadrevisions(self, content=True)
  File "c:\Pywikibot\pywikibot\site\_generators.py", line 772, in loadrevisions
    raise NoPageError(page)
pywikibot.exceptions.NoPageError: Page [[hu:Arghhhxqrwl!!!]] doesn't exist.

A redirect page causes an IsRedirectPageError:

>>> page = pywikibot.Page(site, 'Time to Shine')
>>> text = page.get()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Pywikibot\pywikibot\page\_page.py", line 397, in get
    self._getInternals()
  File "c:\Pywikibot\pywikibot\page\_page.py", line 444, in _getInternals
    raise self._getexception
pywikibot.exceptions.IsRedirectPageError: Page [[hu:Time to Shine]] is a redirect page.

If you don't want to handle redirects, just make the difference between existing and non-existing pages, get_redirect will make its behaviour more similar to that of text:

>>> page = pywikibot.Page(site, 'Time to Shine')
>>> page.get(get_redirect=True)
'#ÁTIRÁNYÍTÁS [[Time to Shine (egyértelműsítő lap)]]'

Here is a piece of code to handle the cases. It is already too long for prompt, so I saved it.

for title in ['Budapest', 'Arghhhxqrwl!!!', 'Time to Shine']:
    page = pywikibot.Page(site, title)
    try:
        text = page.get()
        print(f'Length of {page.title()} is {len(text)} bytes.')
    except pywikibot.exceptions.NoPageError:
        print(f'{page.title()} does not exist.')
    except pywikibot.exceptions.IsRedirectPageError:
        print(f'{page.title()} redirects to {page.getRedirectTarget()}.')
        print(type(page.getRedirectTarget()))

Which results in:

Length of Budapest is 165375 bytes.
Arghhhxqrwl!!! does not exist.
Time to Shine redirects to [[hu:Time to Shine (egyértelműsítő lap)]].
<class 'pywikibot.page._page.Page'>

While Page.text is simple, it gives only the text of the redirect page. With getRedirectTarget() we got another Page instance without parsing the text. Of course, the target page may also not exist or be another redirect. Scripts/redirect.py gives a deeper insight.

For a practical application see #Content pages and talk pages.

Reloading

edit

If your bot runs slowly and you are in doubt that the page text is still actual, use get(force=True). The experiment shows that it does not update page.text, which is good on one side, as you don't lose your data, but on the other side needs attention to be concious,

>>> import pywikibot as p
>>> site = p.Site()
>>> page = p.Page(site, 'Kisbolygók listája (1–1000)')
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text = 'Luke,I am your father!'
>>> page.text
'Luke, I am your father!'
>>> page.get(force=True)
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'
>>> page.text
'Luke, I am your father!'
>>>
>>> page.text = page.get()
>>> page.text
'[[#1|1–500.]] • [[#501|501–1000.]]\n\n{{:Kisbolygók listája (1–500)}}\n{{:Kisbolygók listája (501–1000)}}\n\n[[Kategória:A Naprendszer kisbolygóinak
listája]]'

Page.exists() currently does not reflect to forced reload, see phab:T330980.

Notes

edit
  1. This is not quite true; as we saw earlier, Site is a factory that creates objects. The difference is hidden on purpose because it acts like a class, and Site() will really be an instance.

Saving a single page

edit
Published

Here we also have two solutions, but the difference is much less then for getting.

The new and recommended way is save(). As we discussed above, page.text is a property, that means it is always present together with the page object. So when you save the page, the new content will be page.text. save() works without any argument, but a proper edit summary is expected on most wikis, especially from bots, so please always use it. Some wikis also expect that it begins with a Bot: prefix. Please always follow your community's rules.

Save is hard to follow beacuse it calls _save() which calls APISite.editpage(). This is a technical decision that operations writing to the wiki are implemented in APISite class rather then in Page. This way API calls and token things and other disgusting low-level stuff are hidden somewhere in the corner.

On the other hand, put() is the old and traditional way that you will find in several existing scripts on Wikipedia. It takes the text as first argument. It is based on the concept that you have the text in a separate variable:

page = pywikibot.Page(site, 'Special:MyPage')
text = page.get()
# Do something with text
page.put(text, 'Hello, I modified it!')

Putting text equals to

page.text = text
page.save('Hello, I modified it!')

If you look into the code, this is just what put() makes. Just calls save(). So you may create your text in whatever variable and put it, but the recommended way is to place your content into page.text and then save().

The only capability of old put() which is not transferred to save() is the show_diff parameter. If you set it to True, the diff between old and new text will be shown on your screen.

put() can be useful if you began to create a text, and the page object is not available at the beginning. You may also prefer it when you put the same text to several pages. For example in Hungarian Wikipedia we decided to empty old anon talk pages with outdated warnings. Of course,

page.put('This page was emptied.', 'Empty talk pages')

in a loop is more simple than to assign page.text a value each time.

Possible errors

edit

 

Page generators – working with plenty of pages

edit
Published

Overview

edit

A page generator is an object that is iterable (see PEP 255) and that yields page objects on which other scripts can then work.

Most of these functions just wrap a Site or Page method that returns a generator. For testing purposes listpages.py can be used, to print page titles to standard output.

Documentation

Page generators form one of the most powerful tools of Pywikibot. A page generator iterates the desired pages.

Why use page generators?

  • You may separate finding pages to work on and the actual processing, so the code becomes cleaner and more readable.
  • They implement a reuseable code for typical tasks.
  • Pywikibot team writes page generators, and can follow the changes of MediaWiki API, and you have to write your code on a higher level.

A possible reason to write your own page generator is mentioned in Follow your bot section.

Most page generators are available via command line arguments for end users. See https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.pagegenerators.html for details. If you write your own script, you may use these arguments, but if they are permanent for the task, you may want to directly invoke the appropriate generator instead of handling command line arguments.

Life is too short to list them all here, but the most used generators are listed under the above link. you may also discover them in the pywikibot/pagegenerators directory of your Pywikibot installation. They may be divided into three main groups:

  1. High-level generators for direct use, mostly (but not exclusively) based on MediaWiki API. Usually they have long and hard-to-remember names, but the names may always be cheated from docs or code. They are connected to command line arguments.
  2. Filters. They wrap around another generator (take the original generator as argument), and filter the results, for example for namespace. This means they won't run too fast...
  3. Low-level API-based generators may be obtained as methods of Page, Category, User, FilePage, WikibasePage or Site objects. Most of them is packed into a high-level generator function, which is the preferred use (we may say, the public interface of Pywikibot), however nothing forbids the direct use. Sometimes they yield structures rather than page objects, but may be turned to a real page generator, as we will see in an example.

Pagegenerators package (group 1 and 2)

edit

Looking into pywikibot/pagegenerators directory you discover scripts whose names begin with an underscore. This means they are not for direct import, however they can be useful for discovering the features. The incorporated generators may be used as

import pywikibot.pagegenerators
for page in pywikibot.pagegenerators.AllpagesPageGenerator(total=10):
    print(page)

which is almost equivalent to:

from pywikibot.pagegenerators import AllpagesPageGenerator
for page in AllpagesPageGenerator(total=10):
    print(page)

To interpret this directory which appears for us as pywikibot.pagegenerators package in code:

  • __init__.py primarily holds the documentation, but there are also some wrapper generators in it.
  • _generators.py holds the primary generators.
  • _filters.py holds the wrapping filter generators.
  • _factory.py is responsible for interpreting the command line arguments and choosing the appropriate generator function.

API generators (group 3)

edit

MediaWiki offers a lot of low-level page generators, which are implemented in GeneratorsMixin class. APISite is child of GeneratorsMixin, so we may use these methods for our site instance. While the above mentioned pagelike objects have their own methods that may easily be found in the documentation of the class, they usually use an underlying method which is implemented in APISite, and somtimes offers more features.

Usage

edit

Generators may be used in for loops as shown above, but also may be transformed to lists:

print(list(AllpagesPageGenerator(total=10)))

But be careful: while loops continuously process pages, the list comprehension may take a while because it has to read all the items from the generator. This statement is very fast for total=10, takes noticeable time for total=1000, and is definitely slow for total=100000. Of course, it will consume a lot of memory for big numbers, so usually it is better to use generators in a loop.

A few interesting generators

edit

A non-exhaustive list of useful generators. All these may be imported from pywikibot.pagegenerators.

Autonomous generators (_generators.py)

edit

Most of them correspond to a special page on wiki.

  • AllpagesPageGenerator(): Yields all the pages in a long-long queue along the road in alphabetic order. You may specify the start, namespace, a limit to avoid endless queues, and if redirects should be included, excluded or exclusively yielded. See an example below.
  • PrefixingPageGenerator(): Pages whose title begins with a given string. See an example below.
  • LogeventsPageGenerator(): Events from logs
  • CategorizedPageGenerator(): Pages from a given category.
  • LinkedPageGenerator(): Pages that are linked from another page. See an example in chapter Creating and reading lists.
  • TextIOPageGenerator(): Reads from file or URL. See an example in chapter Creating and reading lists.
  • PagesFromTitlesGenerator(): Generates pages from their titles.
  • UserContributionsGenerator(): Generates pages that a given user worked on.
  • XMLDumpPageGenerator(): Reads from a downloaded dump on your device. In the dump pages are usually sorted by pageid (creation time). See in Working with dumps chapter.

... and much more...

An odd one out

edit
  • XmlDumpReplacePageGenerator() looks for pages in dump that are subject to a text replacement. It is defined within replace.py, and may be imported from there.

Filtering generators (_filters.py)

edit
  • NamespaceFilterPageGenerator(): Only lets out pages from given namespace(s).
  • PageTitleFilterPageGenerator(): Let's you define an ignore list which pages won't be yielded.
  • RedirectFilterPageGenerator(): Yields either only redirect pages or only not redirects.
  • SubpageFilterGenerator(): Generator which filters out subpages based on depth.
  • RegexFilter: This is not a function, rather a class. Makes possible to filter titles with a regular expression.
  • CategoryFilterPageGenerator(): Lets out only pages which are in all of the given categories.

Other wrappers (__init__.py)

edit
  • PageClassGenerator(): You have Page objects from another generator. This wrapper examines them, and whichever represents a user, a category or a file description page, turns it into the appropriate subclass so that you may use more methods;others remain untouched.
  • PageWithTalkPageGenerator(): Takes a page generator, and yields the content pages and the corresponding talk pages each after the other (or just the talk pages).
  • RepeatingGenerator(): This one is exciting: makes possible to follow the events on your wiki in live, taking pages from recent changes or some log.

Examples

edit

List Pywikibot user scripts with AllpagesPageGenerator

edit

You want to collect users' home made Pywikibot scripts from all over Wikipedia. Supposing that they are in user namespace and have a title with .py ending, a possible soulution is to create an own pagegenerator using AllpagesPageGenerator. Rather slow than quick. :-) This will not search in other projects.

from pywikibot.pagegenerators import AllpagesPageGenerator
def gen():
    langs = site.languages()
    for lang in langs:
        for page in AllpagesPageGenerator(
                namespace=2,
                site=pywikibot.Site(lang, 'wikipedia')):
            if page.title().endswith('.py'):
                yield page

And test with:

for page in gen():
    print(page)

If you want a reduced version for testing, you may use

    langs = site.languages()[1:4]

This will limit the number of Wikipedias to 3, and excludes the biggest, enwiki. You may also use a total=n argument in AllpagesPageGenerator.

Sample result:

[[de:Benutzer:Andre Riemann/wmlinksubber.py]]
[[de:Benutzer:Chricho/draw-cantorset.py]]
[[de:Benutzer:Christoph Burgmer/topiclist.py]]
[[de:Benutzer:Cmuelle8/gpx2svg.py]]
etc.

They are not sure to be Pywikibot scripts as there are other Python programs published. You may retrieve the pages and check them for import pywikibot line.

Titles beginning with a given word – PrefixingPageGenerator

edit

During writing this section we had a question on the village pump about the spelling of the name of József Degenfeld. A search showed that we have several existing articles about Degenfeld family. To quickly compose a list of them the technics from Creating and reading lists chapter was copied:

from pywikibot.pagegenerators import PrefixingPageGenerator
print('\n'.join(['* ' + page.title(as_link=True) for page in PrefixingPageGenerator('Degenfeld')]))

* [[Degenfeld-Schonburg-kastély]]
* [[Degenfeld-kastély]]
* [[Degenfeld-kastély (Baktalórántháza)]]
* [[Degenfeld-kastély (Téglás)]]
* [[Degenfeld-kastély (egyértelműsítő lap)]]
* [[Degenfeld család]]

For such rapid tasks shell is very suitable.

Pages created by a user with a site iterator

edit

You want to list the pages created by a given user, for example yourself. How do we know that an edit was creation of a new page? The response is the parentid value, which is the oldid of the previous edit. If it is zero, there is no previous version, that means it was either creation or recreation after a deletion. Where do we get a parentid from? Either from a contribution (see Working with users chapter) or a revision (see Working with page histories).

Of course, we begin with high-level page generators, just because this is the title of the chapter. We have one that is promising: UserContributionsGenerator() Its description is: Yield unique pages edited by user:username. Now, this is good for beginning, but we have only pages, so we have to get the first revision, and see if the username equals to the desired user, which will not be the case in the vast majority, but will be very slow.

So we look into the source and notice that this function calls User.contributions(), which is a method of a User object, and has the description Yield tuples describing this user edits. This is promising again, but looking into it we see that the tuple does not contain parentid, but we find an underlying method again, which is Site.usercontribs(). This looks good, and has a link to API:Usercontribs, which is the fourth step of our investigation. Finally, this tells what we want to hear: yes, it has parentid.

Technically, Site.usercontribs() is not a page generator, but we will turn it. It takes the username as string, iterates contributions, and we may create the pages from their titles. The simpliest version just to show the essence:

for contrib in site.usercontribs(username):
    if not contrib['parentid']:
        page = pywikibot.Page(site, contrib['title'])
        # Do something

Introduction was long, solution short. :-)

But it was not only short, but also fast, because we did not create unnecessary objects, and, what is more important, did not get unnecessary data. We got dictionaries, read the parentid and the title from them, and created only the desired Page objects – but did not retrieve the pages, which is the slow part of the work.

Based on this simple solution we create an advanced application that

  • gets only pages from selected namespaces (this is not a post-generation filtering as Pywikibot's NamespaceFilterPageGenerator(), MediaWikiAPI will do the filtering on the fly)
  • separates pages by namespaces
  • separates disambiguation pages from articles by title and creates a fictive namespace for them
  • filters out redirect pages from articlea and templates (in other namespaces and among disamb pages the ratio of these is very low, so we don't bother; this is a decision)
  • saves the list to separate subpages of the user by namespace
  • writes the creation date next to the titles.

Of these tasks only recognizing the redirects makes it necessary to retrieve the pages which is slow and loads the server and bandswidth. While the algorithm would be simpler if we did the filtering within the loop, more efficient is to do this filtering afterwards for selected pages

import pywikibot

site = pywikibot.Site()
username = 'Bináris'
summary = 'A Bináris által létrehozott lapok listázása'
# Namespaces that I am interested in
namespaces = [
    0,  # main
    4,  # Wikipedia
    8,  # MediaWiki
    10,  # template
    14,  # category
]
# Subpage titles (where to save)
titles = {
    0: 'Szócikkek',
    4: 'Wikipédia',
    8: 'MediaWiki',
    10: 'Sablonok',
    14: 'Kategóriák',
    5000: 'Egyértelműsítő lapok',  # Fictive ns for disambpages
}
# To store the results
created = dict(zip(namespaces + [5000], [[] for i in range(len(titles))]))

for contrib in site.usercontribs(username, namespaces=namespaces):
    if contrib['parentid']:
        continue
    ns = contrib['ns']
    if ns == 0 and contrib['title'].endswith('(egyértelműsítő lap)'):  # disamb pages
        ns = 5000
    title = (':' if ns == 14 else '') + contrib['title']
    created[ns].append((title, contrib['timestamp']))

# Remove redirects from articles and templates
for item in created[0][:]:
    if pywikibot.Page(site, item[0]).isRedirectPage():
        created[0].remove(item) 
for item in created[10][:]:
    if pywikibot.Page(site, item[0]).isRedirectPage():
        created[10].remove(item)

for ns in created.keys():
    if not created[ns]:
        continue
    print(ns)
    page = pywikibot.Page(site, 'user:Bináris/Létrehozott lapok/' + titles[ns])
    print(page)
    page.text = 'Bottal létrehozott lista. Utoljára frissítve: ~~~~~\n\n'
    page.text += '\n'.join(
            [f'# [[{item[0]}]] {item[1][:10].replace("-", ". ")}.' 
                for item in created[ns]]
        ) + '\n'
    print(page.text)
    page.save(summary)
Line 4
Username as const (not a User object). Of course, you may get it from command line or a web interface.
Line 23
A dictionary for the results. Keys are namespace numbers, values are empty lists.
Line 26
This is the core of the script. We get the contributions with 0 parentid, check them for being a disambiguation page, prefix the category names with a colon, and store the titles together with the timestamps as tuples. We don't retrieve any page content by this point.
Line 35
Removal of redirects. Now we have to retrieve selected pages. Note the slicing in the loop head; this is necessary when you loop over a list and meanwhile remove items from it. [:] creates a copy to loop over it, preventing a conflict.
Line 43
We save the lists to subpages, unless they are empty.

A sample result is at hu:Szerkesztő:Bináris/Létrehozott lapok.

Summary

edit

High-level page generators are really various and flexible and are often useful when we do some common task, especially if we want the pwb wrapper to handle our command-line arguments. But for some specific tasks we have to go deeper. On the next level there are the generator methods of pagelike objects, such as Page, User, Category etc., while on the lowest level page generators and other iterators of the Site object, which are directly based on MediaWiki API. Going deeper is possible through the documentation and the code itself.

On the other hand, iterating pages through an API iterator, given the namespece as argument may be faster than using a high-level generator from pagegenrators package, then filter it with a wrapping NamespaceFilterPageGenerator(). At least we may suppose (no benchmark has been made).

In some rare cases this is still not enough, if some features offered by API are not implemented in Pywikibot. You may either implement them and contribute to the common code base, or make a copy of them and enhance with the missing parameter according to API documentation.

Working with page histories

edit
Almost published, except the file pages

Revisions

edit

Processing page histories may be frightening due to the amount of data, but is easy because we have plenty of methods. Some of them extract a particular information such as a user name, while others return an object called revision. A revision represents one line of a page history with all its data that are more than you see in the browser and more than you usually need. Before we look into these methods, let's have a look at revisions. We have to keep in mind that

  • some revisions may be deleted
  • otherwise the text, comment or the contributor's name may be hidden by admins so that non-admins won't see (this may cause errors to be handled)
  • oversighters may hide revisions so deeply that even admins won't see them

Furthermore, bots are not directly marked in page histories. You see in recent changes if an edit is made by bot because this property is stored in the recent changes table of the database and is available there for a few weeks. If you want to know if an edit was made by a bot, you may

  • guess it from the bot name and the comment (not yet implemented, but we will try below)
  • follow through a lot of database tables which contributor had a bot flag in the time of the edit, and consider that registered bots can switch off their flag temporarily and admins can revert changes in bot mode (good luck!)
  • retrieve from the wiki the current bot flag owners and suppose that same users were bots in the time of the edit (that's what Pywikibot does)

API:Revisions gives a deeper insight, while Manual:Slot says something about slots and roles (for most of us this is not too interesting).

Methods returning a single revision also return the content of the page, so it is a good idea to choose a short page for experiments (see Special:ShortPages on your home wiki). revisions() by default does not contain the text unless you force it.

For now I choose a page which is short, but has some page history: hu:8. évezred (8th millennium). Well, we have really few to say about it, and we suffer from lack of reliable sources. Let's see the last (current) revision!

page = pywikibot.Page(site, '8. évezred')
rev = page.latest_revision  # It's a property, don't use ()!
print(rev)

{'revid': 24120313, 'parentid': 15452110, 'minor': True, 'user': 'Misibacsi', 'userid': 110, 'timestamp': Timestamp(2021, 8, 10, 7, 51), 'size': 341,
'sha1': 'ca17cba3f173188954785bdbda1a9915fb384c82', 'roles': ['main'], 'slots': {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki',
'*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.
\n\n[[Kategória:Évezredek|08]]"}}, 'comment': '/* Csillagászati előrejelzések */ link', 'parsedcomment': '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>', 'tags': []
, 'anon': False, 'userhidden': False, 'commenthidden': False, 'text': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]", 'contentmodel': 'wikitext'}

As we look into the code, we don't get too much information about how to print it in more readably, but we notice that Revision is a subclass off Mapping, which is described here. So we can try items():

for item in rev.items():
    print(item)

('revid', 24120313)
('parentid', 15452110)
('minor', True)
('user', 'Misibacsi')
('userid', 110)
('timestamp', Timestamp(2021, 8, 10, 7, 51))
('size', 341)
('sha1', 'ca17cba3f173188954785bdbda1a9915fb384c82')
('roles', ['main'])
('slots', {'main': {'contentmodel': 'wikitext', 'contentformat': 'text/x-wiki', '*': "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\n\n[[Kategória:Évezredek|08]]"}})
('comment', '/* Csillagászati előrejelzések */ link')
('parsedcomment', '<span dir="auto"><span class="autocomment"><a href="/wiki/8._%C3%A9vezred#Csillagászati_előrejelzések" title="8. évezred">→\u200eCsillagászati előrejelzések</a>: </span> link</span>')
('tags', [])
('anon', False)
('userhidden', False)
('commenthidden', False)
('text', "{{évezredek}}\n\nA '''8. évezred''' ciklusa 7001. [[január 1.|január 1-jén]] kezdődik, és 8000. [[december 31.|december 31-éig]] tart.\n\n== Csillagászati előrejelzések ==\n\n* A [[90377 Sedna|Szedna]] [[kisbolygó]] keringése során pályájának legtávolabbi pontjába ([[aphélium]]) ér 7800 körül.\\n[[Kategória:Évezredek|08]]")
('contentmodel', 'wikitext')

While a revision may look like a dictionary on the screen, however it is not a dictionary, a print(type(item)) in the above loop would show that all these tuple-like pairs are real tuples.

Non-existing pages raise NoPageError if you get their revisions.

parentid keeps the oldid of the previous edit and will be zero for a newly created page. For any revision as rev:

>>> 'Modification' if rev.parentid else 'Creation'
'Creation'

Extract data from revison

edit

We don't have to transform a Revision object into a dictionary to use it. The above experiment was just for overview. Now we know what to search for, and we can directly get the required data. Better, this structure is more comfortable to use than a common directory, because you have two ways to get a value:

print(rev['comment'])
print(rev.comment)

/* Csillagászati előrejelzések */ link
/* Csillagászati előrejelzések */ link

As you see, they are identical. But keep in mind that both solutions may cause problems if some parts of the revision were hidden by an admin. Let's see what happens upon hiding:

Key Value when not hidden Value when hidden Value when hidden (adminbot)
Page content
text text of the given revision as str (may be '' if empty)
None
None
User
user user name (str) (not a User object!) '' (empty string) same as if it wasn't hidden
userid user id (int) (0 for anons) AttributeError same as if it wasn't hidden
anon True or False False same as if it wasn't hidden
userhidden False True True
Edit summary
comment human-readable comment (str) '' (empty string) same as if it wasn't hidden
parsedcomment comment suitable for page histories and recent changes (clickable if /* section */ present) (str) AttributeError same as if it wasn't hidden
commenthidden False True True

You may say this is not quite consequent, but this is the experimental result. You have to handle hidden properties, but for a general code you should know whether the bot runs as admin. A possible solution:

admin = 'sysop' in pywikibot.User(site, site.user()).groups()

If you are not an admin but need admin rights for testing, you may get one on https://test.wikipedia.org.

For example

print(rev.commenthidden or rev.parsedcomment)

will never raise an AttributeError, but is not very useful in most cases. On the other hand,

if rev.text:
    # do something

will do something if the content is not hidden for you and not empty. False means here either an empty page or a hidden content. If it counts for you,

if rev.text is not None:
    # do something

will make the difference.

Example was found when an oversighter suppressed the content of the revision. text and sha1 were None both for admin and non-admin bot, and an additional suppressed key appeared with the value '' (empty string).

Is it a bot edit?

edit

Have a look at this page history. It has a lot of bots, some of which is no more or was never registered. Pywikibot has a site.isBot() method which takes user name (not an object) and checks if it has a bot flag. This won't detect all these bots. We may improve with regarding the user name and the comment. This method is far not sure, may have false positives as well as negatives, but – as shown in the 3rd column – gives a better result then site.isBot() which is in the second column.

def maybebot(rev):
    if site.isBot(rev.user):
        return True
    user = rev.user.lower()
    comment = rev.comment.lower()
    return user.endswith('bot') or user.endswith('script') \
        or comment.startswith('bot:') or comment.startswith('robot:')

page = pywikibot.Page(site, 'Ordovicesek')
for rev in page.revisions():
    print(f'{rev.user:15}\t{site.isBot(rev.user)}\t{maybebot(rev)}\t{rev.comment}')

Addbot          False   True    Bot: 15 interwiki link migrálva a [[d:|Wikidata]] [[d:q768052]] adatába
Hkbot           True    True    Bottal végzett egyértelműsítés: Tacitus > [[Publius Cornelius Tacitus]]
ArthurBot       False   True    r2.6.3) (Bot: következő hozzáadása: [[hr:Ordovici]]
Luckas-bot      False   True    r2.7.1) (Bot: következő hozzáadása: [[eu:Ordoviko]]
Beroesz         False   False
Luckas-bot      False   True    Bot: következő hozzáadása: [[sh:Ordoviki]]
MondalorBot     False   True    Bot: következő hozzáadása: [[id:Ordovices]]
Xqbot           True    True    Bot: következő módosítása: [[cs:Ordovikové]]
ArthurBot       False   True    Bot: következő hozzáadása: [[cs:Ordovicové]]
RibotBOT        False   True    Bot: következő módosítása: [[no:Ordovikere]]
Xqbot           True    True    Bot:  következő hozzáadása: [[br:Ordovices]]; kozmetikai változtatások
Pasztillabot    True    True    Kategóriacsere: [[Kategória:Ókori népek]] -> [[Kategória:Ókori kelta népek]]
SamatBot        True    True    Robot:  következő hozzáadása: [[de:Ordovicer]], [[es:Ordovicos]]
Istvánka        False   False
Istvánka        False   False   +iwiki
Adapa           False   False
Data Destroyer  False   False   Új oldal, tartalma: '''Ordovicesek''', [[ókor]]i népcsoport. [[Britannia]] nyugati partján éltek, szemben [[Anglesea]] szigetével. [[Tacitus]] tesz említést róluk.  ==Források==  {{p...

hu:Kategória:Bottal létrehozott olasz vasútállomás cikkek contains articles created by bot. Here is a page generator that yields pages which were possibly never edited by any human:

cat = pywikibot.Category(site, 'Bottal létrehozott olasz vasútállomás cikkek')
def gen():
    for page in cat.articles():
        if all([maybebot(rev) for rev in page.revisions()]):
            yield page

Test with:

for page in gen():
    print(page)

Timestamp

edit

Revision.timestamp is a pywikibot.time.Timestamp object which is well documented here. It is subclass of datetime.datetime. Most importantly, MediaWiki always stores times in UTC, regardless of your time zone and daylight saving time.

The documentation suggests to use Site.server_time() for the current time; it is also a pywikibot.time.Timestamp in UTC.

Elapsed time since last edit:

>>> page = pywikibot.Page(site, 'Ordovicesek')
>>> print(site.server_time() - page.latest_revision.timestamp)
3647 days, 21:09:56

Pretty much, isn't it? :-) The result is a datetime.timedelta object.

In the shell timestamps are human-readable. But when you print them from a script, they get a machine-readable format. If you want to restore the easily readable format, use the repr() function:

>>> page = pywikibot.Page(site, 'Budapest')
>>> rev = page.latest_revision
>>> time = rev.timestamp
>>> time
Timestamp(2023, 2, 26, 9, 4, 14)
>>> print(time)
2023-02-26T09:04:14Z
>>> print(repr(time))
Timestamp(2023, 2, 26, 9, 4, 14)

For the above subtraction print() is nicer, because repr() gives days and seconds, without converting them to hours and minutes.

Useful methods

edit

Methods discussed in this section belng to BasePage class with one exception, so may be used for practically any page.

Page history in general

edit
  • BasePage.getVersionHistoryTable()will create a wikitable form the page history. The order may be reverted and number of rows may be limited. Useful e.g. when the page history gets unavailable and you want to save it to a talk page.
  • BasePage.contributors() returns a small statistics: contributors with the number of their edits in the form of a dictionary, sorted by the decreasing number. Hidden names appear as empty string both for admin and non-admin bots.
  • BasePage.revisions() will iterate through the revisions of a page beginning from the latest. As detailed above, this differs from one revision in that by default it does not retrieve the content of the revision. To get a certain revision turn the iterator into list.
use:
  • reverse=True for beginning from the oldest version
  • content=True for retrieving the page contents
  • total=5 for limiting the iteration to 5 entries
  • starttime= endtime= with a pywikibot.Timestamp() to limit the iteration in time

For example to get a difflink between the first and the second version of a page without knowing its oldid (works for every language version)

print(f'[[Special:Diff/{list(page.revisions(reverse=True))[1].revid}]]')

And this one is a working piece of code from hu:User:DhanakBot/szubcsonk.py. This bot administers substubs. Before we mark a substub for deletion, we wonder if it has been vandalized. Maybe it was a longer article, but someone has truncated it, and a good faith user marked it as substub not regarding the page history. So the bot places a warning if the page was 1000 bytes longer or twice as long at any point of its history as now.

    def shortenedMark(self, page):
        """ Mark if it may have been vandalized."""
        
        versions = list(page.revisions())
        curLength = versions[0]['size']
        sizes = [r['size'] for r in versions]
        maxLength = max(sizes)
        if maxLength >= 2 * curLength or maxLength > curLength + 1000:
          return '[[File:Ambox warning orange.svg|16px|The article became shorter!]]'
        else:
          return ''
  • Site.loadrevisions() may also be interesting; this is the underlying method that is called by revisions(), but it has some extra features. You may specify the user whose contributions you want or don't want to have.

Last version of the page

edit

Last version got a special attention from developers and is very comfortable to use.

  • BasePage.latest_revision (property): returns the current revision for this page. It's a Revision object as detailed above.

For example to get a difflink between the newest and the second newest version of a page without knowing its oldid (works for every language version)

print(f'[[Special:Diff/{page.latest_revision.revid}]]')
But some properties are available directly (they are equivalent to retrieve values from latest_revision):

Oldest version of a page

edit
  • BasePage.oldest_revision (property) is very similar to BasePage.latest_revision, but returns the first version rather than last.

Determine how many times is the current version longer then the first version (beware of division by zero which is unlikely but possible):

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.latest_revision.size / page.oldest_revision.size
115.17727272727272

>>> pywikibot.Page(site, 'Test').put('', 'Test')
Page [[Test]] saved
>>> page.latest_revision.size / page.oldest_revision.size
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

Determine which user how many articles has created in a given category, not including its subcategories:

>>> import collections
>>> cat = pywikibot.Category(site, 'Budapest parkjai')  # Parks of Budapest
>>> collections.Counter(page.oldest_revision.user for page in cat.articles())

Counter({'OsvátA': 4, 'Antissimo': 3, 'Perfectmiss': 2, 'Szilas': 1, 'Pásztörperc': 1, 'Fgg': 1, 'Solymári': 1, 'Zaza~huwiki': 1, 'Timur lenk': 1, 'Pa
sztilla (régi)': 1, 'Millisits': 1, 'László Varga': 1, 'Czimmy': 1, 'Barazoli40x40': 1})

(Use cat.articles(recurse=True) if you are interested in subcategories, too, but that will be slightly slower.)

Knowing that Main Page is a valid alias for the main page in every language and family, sort several Wikipedias by creation date:

>>> from pprint import pprint
>>> langs = ('hu', 'en', 'de', 'bg', 'fi', 'fr', 'el')
>>> pprint(sorted([(lang, pywikibot.Page(pywikibot.Site(lang, 'wikipedia'), 'Main Page').oldest_revision.timestamp) for lang in langs], key=lambda tup: tup[1]))
[('en', Timestamp(2002, 1, 26, 15, 28, 12)),
 ('fr', Timestamp(2002, 10, 31, 10, 33, 35)), 
 ('hu', Timestamp(2003, 7, 8, 10, 26, 5)),
 ('fi', Timestamp(2004, 3, 22, 16, 12, 24)),
 ('bg', Timestamp(2004, 10, 25, 18, 49, 23)),
 ('el', Timestamp(2005, 9, 7, 13, 55, 9)),
 ('de', Timestamp(2017, 1, 27, 20, 11, 5))]

For some reason it gives a false date for dewiki where Main Page is redirected to Wikipedia: namespace, but looks nice anyway. :-)

You want to know when did the original creator edit in your wiki last time. In some cases it is a question whether it's worth to contact him/her. The result is a timestamp as described above, so you can subtract it from the current date to get the elapsed time. See also Working with users section.

>>> page = pywikibot.Page(site, 'Budapest')
>>> user = pywikibot.User(site,  page.oldest_revision.user)
>>> user.last_edit[2]
Timestamp(2006, 5, 21, 11, 11, 22)

Other

edit
This will return a permalink to the given version. The oldid argument may be got as the revid value of that revision. If you omit it, the latest id (latest_revision_id) will automatically be assigned. For getting permalinks for all versions of the page:
for rev in page.revisions():
    print(f'{repr(rev.timestamp)}\t{rev.revid}\t{page.permalink(rev.revid, with_protocol=True)}')

Timestamp(2013, 3, 9, 2, 3, 14)                13179698 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=13179698
Timestamp(2011, 7, 10, 5, 50, 56)               9997266 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9997266
Timestamp(2011, 3, 13, 17, 41, 19)              9384635 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9384635
Timestamp(2011, 1, 15, 23, 56, 3)               9112326 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=9112326
Timestamp(2010, 11, 18, 15, 25, 44)             8816647 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8816647
Timestamp(2010, 9, 27, 13, 16, 24)              8539294 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=8539294
Timestamp(2010, 3, 28, 10, 59, 28)              7422239 https://hu.wikipedia.org/w/index.php?title=Ordovicesek&oldid=7422239
etc.

However, besides this URL-format during the years MediaWiki invented a nicer form for inner use. You may use in any language

print(f'[[Special:Permalink/{rev.revid}]]')

This will result in such permalinks that you can use on your wiki: [[Special:Permalink/6833510]].

  • BasePage.getOldVersion takes an oldid and returns the text of that version (not a Revision object!). May be useful if you know the version id from somewhere.

Deleted revisions

edit

When a page is deleted and recreated, it will get a new id. Thus the only way of mining in the deleted revisions is to identify the page by the title. On the other hand, when a page is moved (renamed), it takes the old id to the new title and a new redirect page is created with a new id and the old title. Taking everything into account, investigation may be complicated as deleted versions may be under the old title and deleted versions under the same title may belong to another page. It may be easier without bot if it is about one page. Now we take a simple case where the page was never renamed.

  • BasePage.has_deleted_revisions() does not need admin rigths, and simply says a yes or no to the question if the page has any deleted revisions. Don't ask me for a use case.
>>> page = pywikibot.Page(site, '2023')
>>> page.has_deleted_revisions()
True

The following methods need amin rights, otherwise they will raise pywikibot.exceptions.UserRightsError.

  • BasePage.loadDeletedRevisions() iterates through the timestamps of deleted revisions and yields them. Meanwhile it caches other data in a private variable for later use. Iterators may be processed with a for loop or transformed into lists. For example to see the number of deleted revisions:
print(len(list(page.loadDeletedRevisions())))

The main use case is to get timestamps for getDeletedRevision().

This method takes a timestamp which is most easily got from the above loadDeletedRevisions(), and returns a dictionary. Don't be mislead by the name; this is not a Revision object. Its keys are:

dict_keys(['revid', 'user', 'timestamp', 'slots', 'comment'])

Theoretically a content=Trueargument should return the text of the revision (otherwise text is returned only if it had previously been retrieved). Currently the documentation does not exactly cover the happenings, see phab:T331422. Instead, revision text may be got (with an example timestamp) as

text = page.getDeletedRevision('2023-01-30T19:11:36Z', content=True)['slots']['main']['*']

Underlying method for both above methods is Site.deletedrevs() which makes possible to get the deleted revisions of several pages together and to get only or rather exclude the revisions by a given user.

File pages

edit

FilePage is a cubclass of Page, so you can use all the above methods, but it has some special methods. Keep in mind that a FilePage represents a file desciption page in the File: namespace. Files (images, voices) themselves are in the Media: pseudo namespace.

 

Working with namespaces

edit
Published

Walking the namespaces

edit

Does your wiki have an article about Wikidata? A Category:Wikidata? A template named Wikidata or a Lua module? This piece of code answers the question:

for ns in site.namespaces:
    page = pywikibot.Page(site, 'Wikidata', ns)
    print(page.title(), page.exists())

In Page.text section we got know the properties that work without parentheses; site.namespaces is of the same kind. Creating page object with the namespace number is familiar from Creating a Page object from title section. This loop goes over the namespaces available in your wiki beginning from the negative indices marking pseudo namespaces such as Media and Special.

Our documentation says that site.namespaces is a kind of dictionary, however it is hard to discover. The above loop went over the numerical indices of namespaces. We may use them to discover the values by means of a little trick. Only the Python prompt shows that they have a different string representation when used with print() or without it – print() may not be omitted in a script. This loop:

for ns in site.namespaces:
    print(ns, site.namespaces[ns])

will write the namespace indices and the canonical names. The latters are usually English names, but for example the #100 namespace in Hungarian Wikipedia has a Hungarian canonical name because the English Wikipedia does not have the counterpart any more. Namespaces may vary from wiki to wiki; for Wikipedia WMF sysadmins set them in the config files, but for your private wiki you may set them yourself following the documentation. If you run the above loop, you may notice that File and Category appears as :File: and :Category:, respectively, so this code gives you a name that is ready to link, and will appear as normal link rather than displaying an image or inserting the page into a category.

Now we have to dig into the code to see that namespace objects have absoutely different __str__() and __repr__() methods inside. While print() writes __str__(), the object name at the prompt without print() uses the __repr__() method. We have to use it explicitely:

for ns in site.namespaces:
    print(repr(site.namespaces[ns]))

The first few lines of the result from Hungarian Wikipedia are:

Namespace(id=-2, custom_name='Média', canonical_name='Media', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False)
Namespace(id=-1, custom_name='Speciális', canonical_name='Special', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=False)
Namespace(id=0, custom_name=, canonical_name=, aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False)
Namespace(id=1, custom_name='Vita', canonical_name='Talk', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=2, custom_name='Szerkesztő', canonical_name='User', aliases=[], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True)
Namespace(id=4, custom_name='Wikipédia', canonical_name='Project', aliases=['WP', 'Wikipedia'], case='first-letter', content=False, nonincludable=False, subpages=True)

So custom names mean the localized names in your language, while aliases are usually abbreviations such as WP for Wikipedia or old names for backward compatibility. Now we know what we are looking for. But how to get it properly? On top level documentation suggests to use ns_normalize() to get the local names. But if we try to give the canonical name to the function, it will raise errors at the main namespace. After some experiments the following form works:

for ns in site.namespaces:
    print(ns, 
          site.namespaces[ns], 
          site.ns_normalize(str(site.namespaces[ns])) if ns else '')

-2 Media: Média
-1 Special: Speciális
0 :
1 Talk: Vita
2 User: Szerkesztő
3 User talk: Szerkesztővita
4 Project: Wikipédia
5 Project talk: Wikipédia-vita
6 :File: Fájl
etc.

This will write the indices, canonical (English) names and localized names side by side. There is another way that gives nicer result, but we have to guess it from the code of namespace objects. This keeps the colons:

for ns in site.namespaces:
    print(ns,
          site.namespaces[ns],
          site.namespaces[ns].custom_prefix())

-2 Media: Média:
-1 Special: Speciális:
0 : :
1 Talk: Vita:
2 User: Szerkesztő:
3 User talk: Szerkesztővita:
4 Project: Wikipédia:
5 Project talk: Wikipédia-vita:
6 :File: :Fájl:
etc.

Determine the namespace of a page

edit

Leaning on the above results we can determine the namespace of a page in any form. We investigate an article and a user talk page. Although the namespace object we get is told to be a dictionary in the documentation, it is quite unique, and its behaviour and even the apparent type depends on what we ask. It can be equal to an integer and to several strings at the same time. This reason of this strange personality is that the default methods are overwritten. If you want to deeply understand what happens here, open pwikibot/site/_namespace.py (the underscore marks that it is not intended for public use) and look for __str__(), __repr__() and __eq__() methods.

>>> page = pywikibot.Page(site, 'Budapest')
>>> page.namespace()
Namespace(id=0, custom_name='', canonical_name='', aliases=[], case='first-letter', content=True, nonincludable=False, subpages=False)
>>> print(page.namespace())
:
>>> page.namespace() == 0
True
>>> page.namespace().id
0

>>> page = pywikibot.Page(site, 'user talk:BinBot')
>>> page.namespace()
Namespace(id=3, custom_name='Szerkesztővita', canonical_name='User talk', aliases=['User vita'], case='first-letter', content=False, nonincludable=False, subpages=True)
>>> print(page.namespace())
User talk:
>>> page.namespace() == 0
False
>>> page.namespace() == 3
True
>>> page.namespace().custom_prefix()
'Szerkesztővita:'
>>> page.namespace() == 'User talk:'
True
>>> page.namespace() == 'Szerkesztővita:'  # *
True
>>> page.namespace().id
3
>>> page.namespace().custom_name
'Szerkesztővita'
>>> page.namespace().aliases
['User vita']

The starred command will give False on any non-Hungarian site, but the value will be True again, if you write user talk in your language.

Any of the values may be got with the dotted syntax as shown in the last three lines.

It is common to get unknown pages from a page generator or a similar iterator and it may be important to know what kind of page we got. Int this example we walk through all the pages that link to an article. page.namespace() will show the canonical (mostly English) names, page.namespace().custom_prefix() the local names and page.namespace().id (without parentheses!) the numerical index of the namespace. To improve the example we think that main, file, template, category and portal are in direct scope of readers, while the others are only important for Wikipedians, and we mark this difference.

def for_readers(ns: int) -> bool:
    return ns in (0, 6, 10, 14, 100)
    # (main, file, template, category and portal)

basepage = pywikibot.Page(site, 'Budapest')
for page in basepage.getReferences(total=110):
    print(page.title(),
          page.namespace(),
          page.namespace().custom_prefix(),
          page.namespace().id,
          for_readers(page.namespace().id)
         )

Content pages and talk pages

edit

Let's have a rest with a much easier exercise! Another frequent task is to switch from a content page to its talk page or vice versa. We have a method to toggle and another to decide if it is in a talk namespace:

>>> page = pywikibot.Page(site, 'Budapest')
>>> page
Page('Budapest')
>>> page.isTalkPage()
False
>>> talk = page.toggleTalkPage()
>>> talk
Page('Vita:Budapest')
>>> talk.isTalkPage()
True
>>> talk.toggleTalkPage()
Page('Budapest')

Note that pseudo namespaces (such as Special and Media, with negative index) cannot be toggled. toggleTalkPage() will always return a Page object except when original page is in these namespaces. In this case it will return None. So if there is any chance your page may be in a pseudo namespace, be prepared to handle errors.

The next example shows how to work with content and talk pages together. Many wikis place a template on the talk pages of living persons' biographies. This template collects the talk pages into a category. We wonder if there are talk pages without the content page having a "Category:Living persons". These pages need attention from users. The first experience is that separate listing of blue pages (articles), green pages[1] (redirects) and red (missing) pages is useful as they need a different approach.

We walk the category (see #Working with categories), get the articles and search them for the living persons' categories by means of a regex (it is specific for Hungarian Wikipedia, not important here). As the purpose is to separate pages by colour, we decide to use the old approach of getting the content (see #Page.get()).

import re
import pywikibot

site = pywikibot.Site()
cat = pywikibot.Category(site, 'Kategória:Élő személyek életrajzai')
regex = re.compile(
    r'(?i)\[\[kategória:(feltehetően )?élő személyek(\||\]\])')
blues = []
greens = []
reds = []

for talk in cat.members():
    page = talk.toggleTalkPage()
    try:
        if not regex.search(page.get()):
            blues.append(page)
    except pywikibot.exceptions.NoPageError:
        reds.append(page)
    except pywikibot.exceptions.IsRedirectPageError:
        greens.append(page)

Note that running on errors on purpose and preferring them to ifs is part of the philosophy of Python.

Notes

edit
  1. Appropriate if there is a possibility in your wiki to mark redirects with green.

Working with users

edit

{{Pywikibot cookbook}} {{Pywikibot}} https://doc.wikimedia.org/pywikibot/master/api_ref/pywikibot.page.html#page.User

User is a subclass of Page. Therefore user.exists() means that user page exists. To determine if the user exists, use user.isRegistered(). These are independent, either may be true without the other.


Last edit of an anon See also revisions

 

Working with categories

edit

 


Task: hu:Kategória:A Naprendszer kisbolygói (minor planets of Solar System) has several subcategories (one level is enough) with reduntantly categorized articles. Let's remove the parent category from the articles in subcategories BUT stubs! Chances are that it could be solved by category.py, after reading the documentation carefully, but for me this time it was faster to hack:

sum = 'Redundáns kategória ki, ld. [[Wikipédia-vita:Csillagászati műhely#Redundáns kategorizálás]]'
cat = pywikibot.Category(site, 'Kategória:A Naprendszer kisbolygói')
for subcat in cat.subcategories():
    if subcat.title(with_ns=False) == 'Csonkok (kisbolygó)':  # Stubs category
        continue
    for page in subcat.articles():
        page.change_category(cat, None, summary=sum)

Creating and reading lists

edit
Published

Creating a list of pages is frequent task. For example

  1. You collect titles to work on because collecting is slow and can be done while you are sleeping.
  2. You want to review the list and make further discussions before you begin the task with your bot.
  3. You want to know the extent of a problem before you begin to write a bot for it.
  4. Listing is the purpose itself. It may be a maintenance list that requires attention from human users. It may be a community task list etc.
  5. Someone asked yo to create a list on which he or she wants to work.

A list may be saved to a file or to a wikipage. listpages.py does something like this, but the input is restricted to builtin page generators and output has a lot of options. If you write an own script, you may want a simple solution in place. Suppose that you have any iterable (list, tuple or generator) called pages that contains your collection.

Something like this:

'\n'.join(['* ' + page.title(as_link=True) for page in pages])

will give an appropriate list that is suitable both for wikipage and file. It looks like this:

* [[Article1]]
* [[Article2]]
* [[Article3]]
* [[Article4]]

On Windows sometimes you get a UnicodeEncodeError when you try to save page names containing non-ASCII characters. In this case codecs will help:

import codecs
with codecs.open('myfile.txt', 'w', 'utf-8') as file:
    file.write(text)

Of course, imports should be on top of your script, this is just a sample. While a file does not require the linked form, it is useful to keep them in the same form so that a list can be copied from a file to a wikipage at any time.

To retrieve your list from page [[Special:MyPage/Mylist]] use:

from pywikibot.pagegenerators import LinkedPageGenerator
listpage = pywikibot.Page(site, 'Special:MyPage/Mylist')
pages = list(LinkedPageGenerator(listpage))

If you want to read the pages from the file to a list, do:

# Use with codecs.open('myfile.txt', 'r', 'utf-8') as file: if you saved with codecs.
with open('myfile.txt') as file:
    text = file.read()
import re
pattern = re.compile(r'\[\[(.*?)\]\]')
pages = [pywikibot.Page(site, m) for m in pattern.findall(text)]

If you are not familiar with regular expressions, just copy, it will work. :-)

Where to save the files?

edit

While introducing the userscripts directory is a great idea to separate your own scripts, using pwb.py your prompt is in the Pywikibot root directory. Once this structure is created so nicely, you may not want to mix your files into Pywikibot system files. Saving it to userscripts requires to give the path every time, and is an unwanted mix again, because there are scripts rather than data.

A possible solution is to create a directory directly under Pywikibot root such as t, which is short for "texts", is one letter long and very unlikely to appear at any time as a Pywikibot system directory:

2023.03.01.  20:35    <DIR>          .
2023.03.01.  20:35    <DIR>          ..
2022.11.21.  01:09    <DIR>          .git
2022.10.12.  07:16    <DIR>          .github
2023.01.30.  13:38    <DIR>          .svn
2023.03.04.  20:07    <DIR>          __pycache__
2022.10.12.  07:35    <DIR>          _distutils_hack
2023.03.03.  14:32    <DIR>          apicache-py3
2023.01.30.  14:53    <DIR>          docs
2023.02.19.  21:28    <DIR>          logs
2022.10.12.  07:35    <DIR>          mwparserfromhell
2022.10.12.  07:35    <DIR>          pkg_resources
2023.01.30.  13:35    <DIR>          pywikibot
2023.01.30.  22:44    <DIR>          scripts
2022.10.12.  07:35    <DIR>          setuptools
2023.02.24.  11:08    <DIR>          t
2023.02.14.  12:15    <DIR>          tests

Now instead 'myfile.txt' you may use 't/myfile.txt' (use / both on Linux and on Windows!) when you save and open files. This is not a big pain, and your saved data will be in a separate directory.

Working with your watchlist

edit
Published

We have a watchlist.py among scripts which deals with the watchlist of the bot. This does not sound too exciting. But if you have several thousand pages on your watchlist, handling it by bot may sound well. To do this you have to run the bot with your own username rather than that of the bot. Either you overwrite it in user-config.py or save the commands in a script and run

python pwb.py -user:MyUserAccount myscript

Loading your watchlist

edit

The first tool we need is site.watched_pages(). This is a page generator, so you may process pages with a for loop or change it to a list. When you edit your watchlist on Wikipedia, talk pages will not appear. This is not the case for site.watched_pages()! It will double the number by listing content pages as well as talk pages. You may want to cut it.

Print the number of watched pages:

print(len(list(site.watched_pages())))

This may take a while as it goes over all the pages. For me it is 18235. Hmm, it's odd, in both meaning. :-) How can a doubled number be odd? This reveals a technical error: decades ago I watched [[WP:AZ]] which was a redirect to a project page, but technically a page in article namespace back then, having the talk page [[Vita:WP:AZ]]. Meanwhile WP: was turned into an alias for Wikipedia namespace, getting a new talk page, and the old one remained there stuck and abandoned, causing this oddity.

If you don't have such a problem, then

watchlist = [page for page in site.watched_pages() if not page.isTalkPage()]
print(len(watchlist))

will do both tasks: turn the generator into a real list and throw away talk pages as dealing with them separately is senseless. Note that site.watched_pages() takes a total=n argument if you want only the first n, but we want to process the whole watchlist now. You should get half of the previous number if everything goes well. For me it raises an error because of the above stuck talk page. (See phab:T331005.) I show it because it is rare and interesting:

pywikibot.exceptions.InvalidTitleError: The (non-)talk page of 'Vita:WP:AZ' is a valid title in another namespace.

So I have to write a loop for the same purpose:

watchlist = []
for page in site.watched_pages():
    try:
        if not page.isTalkPage():
            watchlist.append(page)
    except pywikibot.exceptions.InvalidTitleError:
        pass
print(len(watchlist))

Well, the previous one looked nicer. For the number I get 9109 which is not exactly (n-1)/2 to the previous one, but I won't bother it for now.

A basic statistics

edit

In one way or other, we have at last a list with our watched pages. The first task is to create a statistics. I wonder how many pages are on my list by namespace. I wish I had the data in sqlite, but I don't. So a possible solution:

# Create a sorted list of unique namespace numbers in your watchlist
ns_numbers = sorted(list(set([page.namespace().id for page in watchlist])))
# Create a dictionary of them with a default 0 value
stat = dict.fromkeys(ns_numbers, 0)
for page in watchlist:
    stat[page.namespace().id] += 1
print(stat)

{0: 1871, 2: 4298, 4: 1803, 6: 98, 8: 96, 10: 391, 12: 3, 14: 519, 90: 15, 100: 10, 828: 5}

There is an other way if we steal the technics from BasePage.contributors() (discussed in Useful methods section). We just generate the namespace numbers, and create of them a collections.Counter() object:

from collections import Counter
stat = Counter(p.namespace().id for p in watchlist)
print(stat)

Counter({2: 4298, 0: 1871, 4: 1803, 14: 519, 10: 391, 6: 98, 8: 96, 90: 15, 100: 10, 828: 5, 12: 3})

This is a subclass of dictionaries so may be used as a dict. The difference compared to the previous is that a Counter sorts items by the decreasing number automatically.

Selecting anon pages and unwatch according to a pattern

edit

The above statistics shows that almost the half of my watchlist consists of user pages because I patrol recent changes, welcome people and warn if neccessary. And it is neccessary often. Now I focus on anons:

from pywikibot.tools import is_ip_address
anons = [page for page in watchlist
    if  page.namespace().id == 2 and is_ip_address(page.title(with_ns=False))]

I could use the own method of a User instance to determine if they are anons without importing, but for that I would have to convert pages to Users:

anons = [page for page in watchlist
    if  page.namespace().id == 2
    and pywikibot.User(page).isAnonymous()]

Anyway, len(anons) shows that they are over 2000.

IPv4 addresses starting with 195.199 belong to schools in Hungary. Earlier most of them was static, but nowadays they are dynamic, and may belong to other school every time, so there is no point in keeping them. For unwatching I will use Page.watch():

for page in anons:
    if page.title(with_ns=False).startswith('195.199'):
        print(page.watch(unwatch=True))

With print() in last line I will also see a True for each successful unwatching. Without it only unwatches. This loop will be slower than the previous. For repeated run it will write these Trues again because watchlist is cached. To avoid this and refresh watchlist use site.watched_pages(force=True) which will always reload it.

Watching and unwatching a list of pages

edit

By this time we delt with pages one by one with page.watch() which is a method of the Page object. But if we look into the code, we may discover that this method uses a method of APISite:

Even more exciting, this method can handle complete lists at once, and even better the list items may be strings – this means you don't have to create Page objects of them, just provide titles. Furthermore it supports other sequence types like a generator function, so page generators may be used directly.

To watch a lot of pages if you have the titles, just do this:

titles = []  # Write titles here or create a list by any method
site.watch(titles)

To unwatch a lot of pages if you already have Page objects:

pages = []  # Somehow create a list of pages
site.watch(pages, unwatch=True)

For use of page generators see the second example under #Red pages.

Further ideas for existing pages

edit

With Pywikibot you may watch or unwatch any quantity of pages easily if you can create a list or generator for them. Let your brain storm! Some patterns:

  • Pages in a category
  • Subpages of a page
  • Pages following a title pattern
  • Pages got from logs
  • Pages created by a user
  • Pages from a list page
  • Pages often visited by a returning vandal whose known socks are in a category
  • Pages based on Wikidata queries

API:Watch shows that MediaWiki API may have further parameters such as expiry and builtin page generators. At the time of writing this article Pywikibot does not support them yet. Please hold on.

Red pages

edit

Non-existing pages differ from existing in we have to know the exact titles in advance to watch them.

Watch the yearly death articles in English Wikipedia for next decade so that you see when they are created:

for i in range(2023, 2033):
    pywikibot.Page(site, f'Deaths in {i}').watch()

hu:Wikipédia:Érdekességek has "Did you know?" subpages by the two hundreds. It has other subpages, and you want to watch all these tables until 10000, half of what is blue and half red. So follow the name pattern:

prefix = 'Wikipédia:Érdekességek/'
for i in range(1, 10000, 200):
    pywikibot.Page(site, f'{prefix}{i}{i + 199}').watch()

While English Wikipedia tends to list existing articles, in other Wikipedias list articles are to show all the relevant titles either blue or red. So the example is from Hungarian Wikipedia. Let's suppose you are interested in history of Umayyad rulers. hu:Omajjád uralkodók listája lists them but the articles of Córdoba branch are not yet written. You want to watch all of them and know when a new article is created. You notice that portals are linked from the page, but you want to watch only the articles, so you use a wrapper generator to filter the links.

from pywikibot.pagegenerators import \
    LinkedPageGenerator, NamespaceFilterPageGenerator
basepage = pywikibot.Page(site, 'Omajjád uralkodók listája')
site.watch(NamespaceFilterPageGenerator(LinkedPageGenerator(basepage), 0))

List of ancient Greek rulers differs from the previous: many year numbers are linked which are not to be watched. You exclude them by title pattern.

basepage = pywikibot.Page(site, 'Ókori görög uralkodók listája')
pages = [page for page in 
         NamespaceFilterPageGenerator(LinkedPageGenerator(basepage), 0)
         if not page.title().startswith('Kr. e')]
site.watch(pages)

Or just to watch the red pages in the list:

basepage = pywikibot.Page(site, 'Ókori görög uralkodók listája')
pages = [page for page in  LinkedPageGenerator(basepage) if not page.exists()]
site.watch(pages)

In the first two examples we used standalone pages in a loop, then a page generator, then lists. They all work.

Summary

edit
  • For walking your watchlist use site.watched_pages() generator function. Don't forget to use the -user global parameter if the user-config.py contains your bot's name.
  • For watching and unwatching a single page use page.watch() and page.watch(unwatch=True).
  • For watching and unwatching several pages at once, given them as list of titles, list of page objects or a page generator use site.watch(<iterable>).

Working with dumps

edit

 

Working with logs

edit

Working with Wikidata

edit

 

Using predefined bots as parent classes

edit

See https://doc.wikimedia.org/pywikibot/master/library_usage.html.

 

Working with textlib

edit

{{Pywikibot cookbook}} {{Pywikibot}} Example: https://www.mediawiki.org/wiki/Manual:Pywikibot/Cookbook/Creating_pages_based_on_a_pattern  

Creating pages based on a pattern

edit
Published

Pywikibot is your friend when you want to create a lot of pages that follow some pattern. In the first task we create more than 250 pages in a loop. Then we go on to categories. We prepare a lot of them, but create only as many in one run that we want to fill with articles, in order to avoid a lot of empty categories.

Rules of orthography

edit

Rules of Hungarian orthography have 300 points, several of which have a lot of subpoints marked with letters. There is no letter a without b, and last letter is l. We have templates pointing to these on an outer source. Templates cannot be used in an edit summary, but inner links can, so we create a lot of pages with short inner links that hold these templates. Of course, bigger part is a bot work, but first we have to list the letters. Each letter from b to l gets a list with the numbers of points of which this is the last letter (lines 5–12). For example, 120 is in the list of e, so we create the pages for the 120, 120 a) ... 120 e) points. The idea is to build a title generator (from line 14). (It also could be a page generator, but title was more comfortable.)

The result is at hu:Wikipédia:AKH. page marks the actual subpage which has text, while mainpage with maintext the main page. As we get the titles from the iterator (line 41–), we create the text with the appropriate template and a standard part, we create the page, and add its link to the text of the main page. At the end we save the main page (line 63–).

import pywikibot as p
site = p.Site()
mainpage = p.Page(site, 'WP:AKH')

b = [4, 7, 25, 27, 103, 104, 108, 176, 177, 200, 216, 230, 232, 261, 277, 285, 286, 288, 289, 291]
c = [88, 101, 102, 141, 152, 160, 174, 175, 189, 202, 250, 257, 264, 267, 279, 297]
d = [2, 155, 188, 217, 241, 244, 259, 265,]
e = [14, 120, 249,]
f = [82, 195, 248,]
g = [226,]
i = [263,]
l = [240,]

def gen():
    for j in range(1, 301):
        yield str(j)
        if j in b + c + d + e + f + g + i + l:
            yield str(j) + ' a'
            yield str(j) + ' b'
        if j in c + d + e + f + g + i + l:
            yield str(j) + ' c'
        if j in d + e + f + g + i + l:
            yield str(j) + ' d'
        if j in e + f + g + i + l:
            yield str(j) + ' e'
        if j in f + g + i + l:
            yield str(j) + ' f'
        if j in g + i + l:
            yield str(j) + ' g'
        if j in i + l:
            yield str(j) + ' h'
            yield str(j) + ' i'
        if j in l:
            yield str(j) + ' j'
            yield str(j) + ' k'
            yield str(j) + ' l'

maintext = ''
summary = 'A szerkesztési összefoglalókban használható hivatkozások létrehozása a helyesírási szabályzat pontjaira'

for s in gen():
    print(s)
    title = 'WP:AKH' + s.replace(' ', '')
    li = s.split(' ')
    try:
        s1 = li[0] + '|' + li[1]
        s2 = li[0] + '. ' + li[1] + ')'
    except IndexError:
        s1 = li[0]
        s2 = li[0] + '.'
    templ = '{{akh|' + s1 + '}}\n\n'
    print(title, s1, s2, templ)
    maintext += f'[[{title}]] '
    page = p.Page(site, title)
    print(page)
    text = templ
    text += f'Ez az oldal hivatkozást tartalmaz [[A magyar helyesírás szabályai]] 12. kiadásának {s2} pontjára. A szerkesztési összefoglalókban '
    text += f'<nowiki>[[{title}]]</nowiki> címmel hivatkozhatsz rá, így a laptörténetekből is el lehet jutni a szabályponthoz.\n\n'
    text += 'Az összes hivatkozás listája a [[WP:AKH]] lapon látható.\n\n[[Kategória:Hivatkozások a helyesírási szabályzat pontjaira]]\n'
    print(text)
    page.put(text, summary)

maintext += '\n\n[[Kategória:Hivatkozások a helyesírási szabályzat pontjaira| ]]'
print(maintext)
mainpage.put(maintext, summary)

Categories of notable pupils and teachers

edit

We want to create categories for famous pupils and teachers of Budapest schools based on a pattern. Of course, this is not relevant for each school; first we want to see which article has "famous pupils" and "famous teachers" section which may occur in several forms, so the best thing is to review it by eyes. We also check if the section contains enough notable people to have a category.

In this task we don't bother creating Wikidata items; these categories are huwiki-specific, and creating items in Wikidata by bot needs an approval.

Step 1 – list the section titles of the schools onto a personal sandbox page
We use the extract_sections() function from textlib.py to get the titles. This returns a NamedTuple in which .sections holds the sections as (title, content) tuples, from which element [0] is the desired title with its = signs.
Note that extract_sections() is not a method of a class, just a function, thus it is not aware of site, and must explicitely get it.
The result is here.
>>> import pywikibot
>>> from pywikibot.textlib import extract_sections
>>> site = pywikibot.Site()
>>> cat = pywikibot.Category(site, 'Budapest középiskolái')
>>> text = ''
>>> for page in cat.articles():
...   text += '\n;' + page.title(as_link=True) + '\n'
...   sections = [sec[0] for sec in extract_sections(page.text, site).sections]
...   for sect in sections:
...     text += ':' + sect.replace('=', '').strip() + '\n'
...
>>> pywikibot.Page(site, 'user:BinBot/try').put(text, 'Listing schools of Budapest')
Step 2 – manual work
We go through the schools, remove the unwanted and the subtitles, and mark with :pt after the title if we want to create categories both for pupils and teachers, and with :p, if only for pupils. There could be also a :t, but isn't. This is the result.
We don't want to create a few dozens of empty categories at once because the community may not like it. Rather, we mark the schools we want to work on soon with the beginning of the desired category name and the sortkey, as shown here, and the bot will create the categories if the name is present and the category does not exist yet.
If you don't like the syntax used here,never mind, it's up to you. This is just an example, you can create and parse any syntax, any delimiters.
Step 3 – creating the categories
We read the patterns from the page with a regex, parse them, and create the name and content of the category page (including sortkey within the parent category).
The script creates a common category, then one for the pupils, and then another for the teachers only if necessary.
Next time we can add new names to schools on which we want to work that day; the existing categories will not be changed or recreated.
import re
import pywikibot

site = pywikibot.Site()
base = pywikibot.Page(site, 'user:BinBot/try')
regex = re.compile(r';\[\[(.*?)\]\]:(pt?):(.*?):(.*?)\n')
main = '[[Kategória:Budapesti iskolák tanárai és diákjai iskola szerint|{k}]]\n'
comment = 'Budapesti iskolák diákjainak, tanárainak kategóriái'
cattext = \
 'Ez a kategória a{prefix} [[{school}]] és jogelődjeinek {member} tartalmazza.'

for m in regex.findall(base.text):
    cat = pywikibot.Category(site, m[2] + ' tanárai és diákjai')
    if not cat.exists():
        cat.put(main.format(k=m[3]), comment, minor=False, botflag=False)
    prefix = 'z' if m[0][0] in 'EÓÚ' else ''  # Some Hungarian grammar stuff
    # Pupils
    catp = pywikibot.Category(site, m[2] + ' diákjai')
    if not catp.exists():
        text = cattext.format(prefix=prefix, school=m[0], member='diákjait')
        text += f'\n[[{cat.title()}|D]]\n'
        catp.put(text, comment, minor=False, botflag=False)
    if not 't' in m[1]:
        continue
    # Teachers
    catt = pywikibot.Category(site, m[2] + ' tanárai')
    if not catt.exists():
        text = cattext.format(prefix=prefix, school=m[0], member='tanárait')
        text += f'\n[[{cat.title()}|T]]\n'
        catt.put(text, comment, minor=False, botflag=False)

Ideas

edit