User:QEDK/GSoC 2020/Making regex in Python "just" work

Quite honestly, my summer working with Wikimedia has taught me more in regex than having used it all this time, albeit before now, it’s mostly been when someone hollers at me to write something. In any case, that involved going to regexr and actually verifying if my regex was doing what I thought it was doing, badly written regular expressions will probably be a bottleneck (or downfall, who knows!) in a lot of textual applications. Despite my general viewpoint of “can we not use regex!” — it’s actually more widely used and is more helpful than it gets credit for (enough that it involved me actually putting effort into learning it). Every now and then, I still mix up my ? and +s, causing some amount of snafu and me messaging “why is {1,} working here but not ?” into our Zulip stream, causing a decent bit of embarrassment when I realize my mistake a minute later.

Facts.

I think the real impressiveness is when you try to parse markup with regex and that probably is a level of hell no one should have to face (Dante wrote about this, it’s true). More than often, when you want a solution to “just” work in one specific case, it’s actually a really good solution to use regex, not so much if you’re writing a general-case parser, please don’t do that — chances are someone wrote one already and yes, someone probably wrote one with regex as well.

Learning regex

If you haven’t thought of learning regex, you probably don’t need to learn it yet but it’s a good skill to pick up regardless, really lets you do the really weird greps and text matches that you don’t want to do with multiple, complicated text matches. Unfortunately, there’s no catch-all way to learn regexes, in my opinion at least. I would say the easiest way is to pick it up gradually, open up regexr or regex101 and try writing your own solution for your use-cases. Both of these sites have cheatsheets and reference built-in, as well as explanations for the pattern you write as well as the text that gets matched. I recommend you keep the cheatsheet open and try to write it yourself.

And in the case of Python, the documentation is your best friend. In fact, a lot of standard libraries use it for a lot of things (configparser for example). With complex regexes, it’s often easier to write code that could span multiple functions, while there is a loss of readability, writing properly documented regex goes some way to alleviate that issue.

You never know when regex will save the day!

“match” and “search”

When I started off with Python’s re, this was my biggest mistake probably because they behaved differently and I didn’t figure out why. The documentation makes it “amply” clear but that didn’t stop me from using match where there was no reason to use it. In fact, the documentation literally has an example: https://docs.python.org/3/library/re.html#search-vs-match but I missed it all the same.

The crux is that match only matches from the beginning of the string while search actually searches the whole string for a match, the name implies it just looks for a match but it’s just a bit off — in any case, I was misled because the PCRE flavour of regex basically implements the behaviour of search by default, so I had no reason to believe that match would be unfit for the job (an act of pure folly). Soon enough, I think a friendly soul pointed me to the documentation after I was complaining about the regex function not working and told me what was going wrong (and thus, the regex show goes on).

>>> re.match("def", "abcdef")    # No match
>>> re.search("def", "abcdef")   # Match

Named groups

Regex by default allows groups and newer flavours typically allow named groups as well. Using index-based references, you can use a capturing group like (abc) and then use a back-reference like \1 to refer to the first matching group. Similarly, a non-capturing group would look like (?:abc). Handy, right?

It’s even easier and Pythonic to use named groups, so let’s say you’re trying to get someone’s username from their email address (not a good idea but hey, good enough for an example), you would do something like:

matchobj = re.match("^(?P<id>[^@]+(?=@.+))", email)

That’s a fairly complicated regex (for all the wrong reasons), so let’s just clear it up:

^ matches the beginning of a string and if in MULTILINE mode (signified by the re.M flag in Python), it matches the beginning of every line.
?P<id> is to name the capture group (all things inside parentheses) for later.
[^@] is a character class for all characters except @ (the “except” part of it being the ^), while not necessarily accurate for a validator, it works fine for our purpose.
+ signifies the above token matches one or more times.
?= makes a positive lookahead, to ensure our match actually contains an at sign and some text after that (signified by @.+) but doesn’t actually match that part itself.

Now, we can easily extract the ID from the match object like:

id = matchobj.group("id")

Note that this while this is fine, the variable itself might be of NoneType if it has no match, so it’s important to keep code safety in mind.

While we didn’t strictly need a named group for this purpose, it’s more forthcoming with what your regex is trying to get at, the more transparent your code, the better it will be supported in the long run.

Compilation

If you’re using a lot of regex in a single program, you should ideally compile it. While there isn’t a limit, if you are using more than a “few”, you should compile regex so that your performance doesn’t take a hit. The loss itself should be negligible but it adds up in the long run, especially that compiling itself is so simple.

compiled = re.compile("\w*compilethis\w*", flags=re.M)

And then use the compiled regex to get matches like:

matchobj = compiled.search(string)

The documentation says that the compiled versions of the most recent patterns passed to re.compile() and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry. In most use-cases, you probably won’t need it but if you’re using complex regular expressions multiple times, you should definitely take advantage of compilation.

That’s about it from me, and if you read the docs, you are now officially better than me at regex (admittedly not a high bar to meet). Keep in mind that the re module is a treasure trove of helpful functions that will do your job for you, very simple but important things like substitution and escaping. So, go on and spread the regex movement and remember to tell people to not write regex parsers for XHTML.

This is what happened to the last person who wrote regex to do that.

Want to be a new developer? See New Developers
Want to interact with people of the Wikimedia Outreach community? Come visit us at Zulipchat.
Want to begin learning Rust? Read The Book.

Do let me know in the comments if you have any suggestions! Next time, a progress report. 🎉