Do you think we need explicit confidence measures, or are implicit ones good enough? For language detection, TextCat can be configured to give its best guess no matter what, and that was how I started looking at it (to maximize recall), but it can also be configured to be more conservative, and only give an answer that's likely to be right. The work I've been doing recently is focussing on that kind of configuration (the current prod config is in between). So, there's an implicit confidence built into the process, but it's hard to convert into a numerical score (though I'm thinking about ways to do that).
Even with a numerical score, would we want to do something different based on that score, or would it just be a yes/no decision? If it's a yes/no decision, that can probably be pushed back into the module (as with language detection, which can just return no answer if the confidence isn't there). If it's a more complex process based on the score, I worry about maintaining it, and needing different values for different languages or different wikis.
We'd also have to come up with a useful confidence measure in each case. For quote stripping, for example, it's also language dependent. Do we want to run tokenizing on the string to see how many tokens are inside the quotes? For languages with spaces between words, it's easy, for Chinese, it's hard (unless we can get the tokenization for the original query back from Elasticsearch).
If we're looking at a simple set of criteria (setting aside Chinese tokenization for a moment), they could be folded into the initial criteria—"presence of 2+ tokens inside paired double quotes". (Though it's worth noting that single quoted words stop spelling correction, at least, so even quoted single words might do better without quotes.)
In my experience, it's a lot harder to bolt on confidence after the fact if it hasn't been part of the system from the original design, so I'm also worried about the amount of effort. For quotes, I wonder if it's even worth it. They are fairly uncommon, so we wouldn't be wasting a ton of processing time if we just stripped and ran them every time.