User:TJones (WMF)/Notes/April 2018 Conference Trip Report

April 2018 — See TJones_(WMF)/Notes for other projects. The Week of April 9, 2018, I spent three days at two conferences in Charlottesville, VA.

OpenSource Connections Haystack Search Relevance Conference

One of the giant conundrums of the Haystack conference is still unresolved: are we interested in relevance or relevancy? I don't think we will ever know.

The conference schedule is online, and there's a Google Doc where they are collecting links to slides from the talks.

This was a good conference overall, and it's nice to have a conference really focused on normal "human" search and not just log files and the like. (That's important stuff, too, just not what I worry about.) Not every presentation—even the really good ones—leads to obviously actionable ideas, but I do feel like I got a better understanding of a few areas, and some more general ideas to think about applying to our search.

One talk in particular felt like more of an advertisement than a presentation, which was more out of place than usual because this conference really was about sharing ideas to make search better. The keynote address was about encouraging everyone in the search community to share and work together to "commoditize" search functionality in open source, rather than repeatedly re-inventing the wheel. That works well with our mission, since our work is generally out in the open and open source.

Of course, the most interesting and most valuable talk of the entire conference was Erik's talk, From Clicks to Models—The Wikimedia LTR Pipeline (conf page, slides). I did not attend that talk, however, because I saw a preview of it before the conference, and I've had a decent inside view of the whole process for months. I didn't see all the images he added, though, and those are hilarious. His talk was a good overview of everything that's gone into turning our click logs into usable training data. It had more technical details than some of the other talks, but that's a good thing.

Another talk on clickstream analytics ( conf page, slides recapitulated part of our journey from Discernatron to the upcoming search surveys. She was a bit less optimistic on surveys, but largely because of the sparsity of available data, which makes surveys unreliable. I've always known, but now it's really sunk in, that while we have a relatively small search team, we have a huge and valuable resource in the number of users we have.

I really liked a presentation on Word2Vec (conf page, slides). The trail was more exploratory than goal-directed, but informative. It also featured a nice explanation of Word2Vec as "harvesting" the hidden layer of a neural network that predicts the most likely adjacent words. And his goal was to increase recall, rather than precision. It's unclear how best to use Word2Vec vectors in retrieval within Elasticsearch or similar search engine. He had a hack to wedge vectors into the payload of an article for a demo, but it's not scalable. One interesting idea is to put documents into Word2Vec similarity neighborhoods, and then retrieve all docs in the relevant neighborhood and then score them somehow.

There was a good presentation on automatic extraction of keywords and concepts (conf page), which reminds me that I want to work on that at some point.

Dealing with named entities came up in a couple of talks. Named Entity Recognition (NER) (conf page) seems promising and interesting as always, though we still need to figure out how best to integrate it into the search process. Indexing entities is a possibility. For Wikipedia and some other wikis, pushing articles on named entities found in the query up in the result list is a possibility. Or, just recognizing entity matches, or query-entity–title matches as LTR signals.

Another talk was on entity resolution (slides currently require Slack registration) across more structured data sources was interesting. Having done name-matching in the past, I was more interested in all the ethnolinguisticky things it didn't yet do—like matching names and nicknames, or finding likely name manglings or other variants—because it could infer a huge number of such things by linking records in different data stores via other fields.

There was a fun session on search war stories—there are no links or notes, but the best story involved a panicked call from a client whose old search server had started to physically fall apart when moved to a new data center. When chunks start falling off, you probably need to upgrade something.

Another good session was on The gentle art of incorporating "business rules" into results. It's much less important for our search, since no one is trying to promote some particular product to the top of the list. Though the techniques involved could be useful for intelligently boosting items on other criteria (like a recent uptick in popularity).

One session I didn't attend was on synonyms (conf page, slides). I will be checking out the slide deck later, though. If nothing else, we should try out adding movie as a synonym for film for English Wikipedia!

Some other general ideas that came from the conference:

Facets are definitely something I'd like to think more about. They can be useful in increasing diversity of search results, for example.
It might be interesting to run some set of queries against production every month to see how results change.
Another simple automated metric for the impact (as opposed to quality) of any changes is up/down/add/drop. How many results moved up or down how many spaces, how many results were added, how many were dropped—all measured at some cut-off, like 3, 5, 10, 20, etc. RelForge metrics already cover some of this, but these are some potential additional details.
I'd like to explore entities recognition and possibly extraction of synonyms and vocabularies, and figure out ways to use them for retrieval or ranking.

Tom Tom Founders Festival Machine Learning Conference

This was an okay conference, and worth the marginal cost of one more hotel night and $50 for a ticket, given that I was in town to attend Haystack already. There was talk of putting presentations online, but I haven't seen that yet. I've linked to the relevant presentations in hopes that slides will appear in the near future. A couple of the presentations were more ads for the service or company than anything else, which was disappointing, I'll skip over those, and most of the less relevant ones.

That said, there were two interesting presentations that weren't really relevant to search, but they were cool anyway because they were about computer graphics. The first was the keynote on deep learning and computer graphics. Lots of whiz-bang fun stuff, but not relevant to search. Admittedly, it was an ad for the sponsor's GPU, but they were a sponsor of the conference and it was a heck of a lot of fun to watch, so I'll allow it. ;)

Another irrelevant-to-search but fun presentation was on training object recognition for drone images by using Grand Theft Auto. Apparently GTA is so realistic now that it can be used to train machine learning models to recognize cars and people from drone images. The useful part is that GTA provides information, including text descriptions, of every object in the scene, which means you can generate unlimited labeled training data! Interesting side note: GTA images are too clear, so they had to fuzz them up a bit to get make robust models.

A recurring theme of the conference was taking up some sizable chunk of the talk to explain the basics of neural networks. Explaining the basics of neural networks is a good thing, but maybe there should have been a "homework" video or wiki page to watch or read beforehand, or an overview during the conference intro to get everyone up to speed on this widely-used technique.

A promising-looking talk on word embeddings fell victim to this in the biggest way—he didn't have time to finish talking about neural networks, though I already had an overview of word embeddings from Haystack the day before. He did go into lots of details—if he'd had twice the time, this would've been my favorite talk—and shared something I've seen elsewhere: in practice, many NNs don't use activation functions in their hidden layers because it makes the math of training/back propagation a lot easier, and very much amenable to cramming through a GPU (which is super fast at the matrix math when it doesn't have to invert the sigmoid activation function at each layer). Also, if that last sentence didn't mean anything, go back and do your homework. ;)

Another favorite—though he didn't have time to go into tons of details—broke up building complex systems based on neural networks into four components: embed, encode, attend, predict. From the talk description:

Because the input symbols are discrete (letters, words, etc), the first step is "embed": map the discrete symbols into continuous vector representations. Because the input is a sequence, the second step is "encode": update the vector representation for each symbol given the surrounding context. You can't understand a sentence by looking up each word in the dictionary --- context matters. Because the input is hierarchical, sentences mean more than the sum of their parts. This motivates step three, attend: learn a further mapping from a variable-length matrix to a fixed-width vector, which we can then use to predict some specific information about the meaning of the text.

The "attend" or "attention" step was the most nebulous, and he admitted that it is also the least well-researched. But overall it seems to come down to a methodical way to iteratively refine and represent the information—including content and its structural relations—in natural language (or other domains). Definitely worth looking into a bit more.

Another talk on generating annotations through binary decisions also echoed our history from Discernatron through the upcoming search surveys. Annotation is hard; make it easy by breaking it down to approximately binary decisions. She noted that in her interface, "annotation rates of 10-30 decisions per minute are common"—i.e., 2 to 6 seconds each. That's the kind of minimal intrusion I hope the surveys will represent, modulo the surprise factor, since people aren't signing up to annotate things. She also mentioned that Excel spreadsheets are the most common form of annotation, and I had flashbacks to projects long, long ago, as well as my own annotations on TextCat (though I just used a text editor).

I atteneded the NLP track, but it wasn't all day. Some of the other talks I attended were about cyber security, finding the next break out star on social media, and "thwarting machines" that want to find and track you.

The second keynote was on non-analytics anti-patterns in analytics projects, and highlighted other places things can go wrong: not making sure your expected ROI justifies a project, not communicating well with users and testing to make sure you are doing the right thing, and general lack of coordination on analytics projects. Her suggestion—and the apparent trend among big firms, is a Chief Analytics Officer or Chief Data Officer; she used the two interchangeably, but Wikipedia has different articles—in practice the role is probably an amalgamation of the two.

There was an interesting talk on making black-box AI methods explainable. They presented some neat techniques for assessing how much particular features contribute to a specific decision made by an uninterpretable model, like a neural network. It doesn't directly address interactions of features—I think an XOR-type situation would confound it—but it's a big step from complete mystery to some level of understanding along any single dimension.