User:OrenBochman/Search/NLP Tools


NLP (Natural Language Programming) Tools are increasinly being used at for enhancing user expereience in content based sites. Twitter [1] [2] [3] [4] and other websites are using NLP tools geared at broadening the scope and increasing the power of search by modeling the languages of thier content. They require more sophisticated algorithms greater processing power. Their capabilities can be disruptive and could revolutionize how people use/edit a wiki. For example it could be possible to use such a tool to:

  • search names phonetically or using transliteration
  • spell check an article (using a lexical spell checker augmented by a search engine spell checker)
  • search for entities (people,places,institutions,etc)
  • search and visualise memes in a wiki or an article.
  • search and terms across projects and languages (cross-language search)
  • translate an article from another language.
  • understand to what extent an article overlaps with it's forign language versions.

The tasks required from the NLP tools can be categorised as follows:

  • Pre-processing (sentence chunking)
  • phonology (IPA anotation, tranliteration)
  • morphological (lexical to lemma normalization; pos tagging)
  • syntactic (sentence parsing)
  • semantic (word-sense disambiguation, cross language processing)
  • higher level (named entity detection, relation detection, ontological tagging, reasoning, sentiment)

The Corpus ComponentEdit

Since producing a corpus is the first step for NLP data exploration some kind of tool. The area of corpus linguistics raises the following concerns:

  • corpus size required for the tasks at hand.
  • quality of corpus (noise to signal ratio).
  • format.

Since the NLP workflow will produce great insights into the documents, it will be possible to create superior Nth-generation corpus. How significant this iteration is to be seen - it would be a blessing to harness the benefits while minimizing the reprocessing this entails.

Extracting wiki content would require plenty cleaning up. Corpus should be fully automated.

An alternative to corpus production would be to create a filter chain which would work directly on the compressed dumps. By going in this direction would make integration of other tools like w:R_(programming_language) w:sirlm w:apertium impossible to integrate. This direction

  • For lexical processing a list of sentences would be enough.
  • For Semantic processing paragraphs and document would be useful too.
  • For Cross language processing one would like sample and combine text from different languages.

Simple Corpus format:

#<doc lang="" id="" etc >

Corpus Size & QualityEdit

It would be possible to generate a much larger corpus by scanning the dump with old edits. However in such a case one would be more interested in expanding the corpus with sentences which add contribute lexical information.

Ideally the files should aggregate document of similar quality and size so that it would be possible to process the corpus in parallel and to process the top files or the bottom files.

Corpus Based NLP Work FlowEdit

  1. Dump to Plain Text
    1. Sentence Splitter (requires a Language dependent Sentence Chunker)
    2. Word Normalization (requires a Language dependent Word Chunker - zwnj)
    3. Tag Stop Words
    4. Tag Named entities using n-gram analysis of titles references disambiguation and inter-wiki-links.
  2. Morphology
    1. Induct a morphology
    2. Tag morphological features
      • lemma-ID
      • suffix
      • lexical category Part of speech or full morphological state
    3. second pass for named entities
    4. induct co-locations & compounds
    5. second pass tagging of morphological features
  3. Induct a probabilistic grammar or use an existing one framework
    1. Use Grammar to parse sentences.
    2. Resolve cataphorical references.
    3. Tag logical aspect of sentences
  4. High Level Analysis:
    1. Bad grammar, spelling mistakes, bad writing style.
    2. Tag N.E. (third pass)
    3. Tag N.E. relations.
    4. Merology, Ontology tagging
    5. Sentiment Analysis.
    6. Reason and Extract high level knowledge into database.

For cross language processing Multi-documents from different wikis would have to be united using their inter-wiki information

Corpus Analysis PackagesEdit

Since R now provides tools for doing corpus linguistics. The main advantage of R is that it has the stronger statistical capabilities built in. A suitable tools for converting Wikipedia dump to a corpus should be found or developed. From this three immediate deliverables could be produced

  • Frequency lists - Words, Shingels, Phrases.
  • Co-locations - phrases that are more than the sum of their words.
  • Concordances - KWIC Databases.

what else is of interest ?

Wiktionary AnalyzerEdit

The Lexical workflow described above should produce data which would be loaded into a thread-safe Lucene analyzer.

  • Specialized MediaWiki projects — (named entity)
  • The Wiktionary projects — (cross lingual Wordnet) (One wordnet to rule them all)
  • The Wikipedia projects — (inter-wiki cross language)
  • Hunspell Project — (spelling,thesaurus,hyphenation,grammar)[5] [6]
  • w:apertium (monolingual dictionaries,bilingual dictionaries,morphologies)[7]
  • finite state morphology

Provide a better, faster, smarter search across the Wikimedia project. (Suggestions, spellings, corrections, etc)

The analysis would be deeper. To be done in sensible time development would require

  • Wikimedia project dumps (Openzim or xml) source and HTML
  • Hadoop cluster running advanced Mahut algorithms(SVM,LDA and others)
  • Iterative production of more powerful Lucene analyzer. bootstrapped via simple scripts followed by unsupervised learning cycles (to complete the picture)
  • Since these jobs could easily become intractable (though bugs, bad algorithms)
    • running dev job on wiki subsets
    • job length and current progress/cost estimation are design goals.

Lexical DataEdit

  1. Lemma Extraction
    1. Scan English Wiktionary at the POS section level.
    2. Extract
      1. Lang,
      2. POS,
      3. Lemma information,
      4. Glosses
  2. co-location
  3. proper names
  4. silver bullets
    1. Entropy reducing heuristic -> to induct missing each lemma from free text.
    2. Inducting word-sense based on Topic Map 'Contexts'.
    3. Introducing a disambiguation procedure to Semantic/Lexical/Entities.

Semantic DataEdit

  1. Word Sense enumeration - Via Grep
    1. Word Sense context collection Via Mahut SVD
    2. Word Sense context description(req an algorithm)
  2. Word nets - synonyms,antonyms,
    1. Entity type
    2. Categories

Entity DataEdit

The idea is to generate a database of entities referenced in Wikipedia. Entities are

Bootstrapping via article headers and categories. Once the most obvious entities are listed one proceeds to train classifiers to find the remaining entities via NLP.

  1. category Based Classification of:
    1. People, organizations, companies, bands, nations, imagined, dead etc
    2. Animals, Species, etc
    3. Places, counties, cities,
    4. Dates, Time Lines, Duration
    5. Events, Wars, Treaties, films, awards,
    6. Chemicals, medicine, drugs
    7. commodities, Stocks etc
    8. Publications, Journals, periodicals, Citations etc
    9. External Web locations.
  2. Unsupervised acquisition of More entities
    1. Train a SVM classifiers Mahut via Hadup using tagging/parsing low ambiguity snippets referencing wide choice of terms.
    2. acquire more entities.
  3. Crosswikify Top entities.
    1. Cross wiki links.
    2. Run Mahut LDA via Haddop on Articles/Sections with correlated entities.


Etymologies are not directly important for indexing. They can provide some extra information relating lexemes across languages. However this information might be useful as a secondary source of information for confirming a lexical hypothesis

It should be possible to order lexeme by a time based hierarchy.

  • Loan Words
  • Assimilation
  • Grim's Laws (Vowel and consonant shifts)

could be useful to compress the cross language word tree and offer insights into how irregularity are introduced into languages overtime.

For IR these type of analysis is of little practical use due to

  • rapid phonological
  • semantic divergence. (e.g. English Knight mounted cavalry is a German loan word where it means infantry)

If we find etymological relation in one language it should imply a pattern for other languages

  1. would require a model (loan, analogy, emphasis, assimilation, grimm's law).
    1. requires/implies a phonological distance, semantic distance, word sense.
    2. requires/implies a graph of time, language, location.
    3. historical linguistics rules could be used to refine such a model.

MT translation DataEdit

This is not high priority deliverable since its utility is doubtful.

  • Offline bilingual dictionaries may be of interest. c.f. wikt:en:User:Matthias Buchmeier/trans-en-es.awk
  • As wiktionaries improve they could become a significant contribution where statistical methods fall short.
  • Wikipedias clearly contain large volume of text for generating statistical language models.

Filing in the gapsEdit

During analysis it may be able to do some extra tasks.

  1. Multilingual context sensitive spell-checking the wikis. Both offline and online.
  2. foreign Language Learning spelling dictionary
  3. Identify "missing templates" requires lemma to template mapping data struction and a generalization algorithm
  4. Identify "missing pages/section" in the wiktionary.

Language InstructionEdit

Language Instruction would benefit from a database of about language pairs or groups: [i.e. it could help chart an optimal curriculum for teaching a nth language to a speaker of n-1 languages by producing a graph of list least resistance.

  • Top Frequency Lexical Charts
  • Topical Word Lists
  • Lexical problem areas
  • Word Order (requires lemma-n-gram frequencies)
  • Verbal Phrase/ Verbal Complement Misalignment.


  1. frequency information together with lexical data can be used to make a text compressor optimized for a specific wiki.
  2. this type of compressed text would be faster to search.


the lexical data + frequency + n-gram frequency could be used to make a parametric cross-lingual parser.

Machine TranslationEdit

see also Wikipedia_Machine_Translation_Project

  1. export data into apertium format. cf
  1. a morphological dictionary for language xx called, apertium-sh-en.en.dix etc. which contains the rules of how words in language xx are inflected.
  2. Bilingual dictionaries which contain correspondences between words and symbols in the two languages. called:
  3. language xx to language yy transfer rules: this file has rules for how language xx will be changed into language yy. In our example this will be:
  4. language yy to xx language transfer rules: this file has rules for how language yy will be changed into language xx. In our example this will be: apertium-sh-en.en-sh.t1x
  • apertium likes [w:FSM]s so it could be possible to adapt its morphological data into an efficient spell-cheking dictionary.
  • it format may able to support collocation.
  • it does not seem to have a word-sense notion.

Crowdsourcing WiktionaryEdit

Using existing categories, templates, the semantic wiki extension, and some scripts one could

  • automate generation of morphological information.
  • again it is possible to automate generation of bi-lingual information from wiktionary pages

Topological Translation AlgorithmEdit

  • Statistical translation works by matching parts of sentences. (This has many problems)
  • Requires a large parallel corpus of translated texts. (not available)
  • Assumes that words in a N-gram of words operate. Some languages have free word order.
  • In reality statistical lexical data is sparse.

My idea is to develop a:

  • topological algorithm for wikis
  • based on related documents and their revisions.
  • can use non-paralel categorised sets of translation sets of documents.
  • generates seme lattices I.E. cross lingual semantic algorithm
  • (semes N-nets)
  • with (morphological-state)
  • morpho-syntactic Lagrangian for MT.
  • the lattice should converge due to product theorem

  1. Translation matrix
    1. maps source word-sense to a target word-sense.
    2. translate cross language.
    3. simplify a single language text.
    4. make text clearer via a disambiguate operation
  2. Numbered list item

Unsupervised Morphology InductionEdit

Parametric version unsupervised induction of morphology based on [GoldSmith]


comma separates equivalents, dot separates entries

  1. ALPHA_FULL: [a,A.b,B. ... σ,sz,SZ. ... z,Z]
  2. ALPHA_VOWL: [a.e.i.o.u]
  3. Derive:
  6. BOOTSTRP_SIGN_SEED = (2,2) two suffix of length two
    1. assigns stems n-grams signature
    2. use the signature to detect compounds


  1. Pre-process recode corpus. Cann be done automatically via min-entropy binary codes like in arithmetic coding, but these rules can override for debugability. e.g.
    • [sz<>σ] hungarian encode bi-gram as foreign uni-gram
    • [ssz<>σσ] hungarian long consonant as two foreign uni-grams
  2. Store words in a pat-trie
  3. bootstrap morphology using heuristics.
    1. heuristic 1: split at Li where successorFrequency(i)>= 25 and successorFrequency(i+1) == successorFrequency(i-1) == 1
    2. heuristic 2: seed signatures with stems having 2 suffixes of length 2
  4. repeat;
    1. evaluate signature entropy, and morphology entropy recode signatures for min entropy and max robustness (same thing).
    2. generalize for unknown words.
      1. generalize for unknown stems.
      2. generalize for unknown suffixes.


  1. trie word list (or AWG)
  2. atoms sorted by frequency
    1. stems
    2. suffixes
    3. prefixes
  3. signatures listed by robustness


  1. solve aliphony using Wiktionary
  2. solve aliphony using PLSA or TOPIC map variant
  3. collect lemmas database from templates categories. (Template extraction)
    1. Using hand-built table minor, SED (template) structured text (wiktionary) and free-text miner (Wikipedia) collect
  4. map templates to morphological-state via a "model".
  5. maprduce via w:mahut

Lemma MiningEdit

A couple of themes in searching for lemmas. 1. geometry. (Hamming for equal length) 2. N-gram.

  1. A bootstrap known knowledge.
      1. Lemma base.
      2. unknown base.
    1. parallel multi-pattern matches [1]
      1. affix mode

other iteration modes:

  1. Levinstien clustering using spheres round/between existing lemma and lemma members. lets say R=d(L1,M1) is the max distance between any two known lemma representatives. Words equidistant from L or M within R/2 are lemma candidates.
  2. generalizing the Levinstien distance to complex distance (we could map strings to sort be predictive.

Lemmatiser + Generalized Morphological AnalyzerEdit

Based on the above morphology data structures annotate a given token with the following token_type:

  • PROPER or NAMED_ENTITY - IPA transliteration, entity_type
  • LEMMA - lemmaId or stem or Ngram signature, part of speech, morphological state, Word Sense disambiguation, cross language seme id.

The lemma analysis should go into payloads to be used in downstream analysis tasks.

Cross-Lingual Morphological AnnotationEdit

Can come in different flavours.

  • Numeric - little effort to organize the lemma members, only enumerate them in a consistent manner. They may be arbiterily clustered to provide a minimum description length.
  • Linguistic - bootstrapping the numeric mode with annotation enumerating morphological features. Also an order can be used to arrange the states into clusters.
  • Cross-Lingual same as before but the annotations is a state generalised across many languages. Each language will use a subset of the Cross-Lingual depending on its morphological parameters.

Note: these annotations format (should) be structured to allow simple reg-ex extraction.

e.g. the hungarian has:

Cross-Lingual Annotation for the Hungarian LanguageEdit

Feature Anotation Glossary
Lexical Category {CT:VB; CT:VB_INF;CT:VB_PART; CT:VM_NN;CT:NN;CT:PNN; CT:ADJ; CT:ADV; CT:ART } respectively: verb, infinitive, participle, verbal noun, noun, pronoun, adjective, adverb, article
Definiteness {DF:0; DF:1 } respectively: indefinite, definite
Mood {MD:I;MD:S;MD:C} respectively: indicative, subjunctive, conditional
Number {NM:1;NM:2} respectively: singular, plural
Person {PR:1;PR:2;PR3} respectively: 1st person, 2nd person, 3rd person
Time {TM:R; TM:S; TM:F} respectively: present, past, future
Accusative {ACC:1} respectively: true
Polite {PL:1} respectively: true
Number {NM:1;NM:2} respectively: singular,plural
Possessor {PR:1;PR:2;PR3} respectively: 1st person,2nd person,3rd person
Possessor-Number {PNM:1;PNM:2} respectively: singular,plural
Case {CS:NOM; CS:ACC; CS:DAT; CS:INS; CS:SOC; CS:FAC; CS:CAU; CS:ILL; CS:SUB; CS:ALL; CS:INE; CS:SUP; CS:ADE; CS:ELA; CS:DEL; CS:ABL; CS:TER; CS:FOR; CS:TEM} respectively: Nominative ,Accusative, Dative-genitive,Instrumental, Essive-modal, Translative ,Causal-final, Illative, Sublative, Allative, Inessive, Superessive, Adessive ,Elative ,Delative, Ablative ,Terminative ,Formal,Temporal

Cross-Lingual Annotation Subset for Hungarian VerbsEdit

Lexical Category {CT:VB; CT:VB_INF; CT:VB_PART CT:VM_NN } respectively: verb, infinitive, participle,verbal noun
Definiteness {DF:0; DF:1 } respectively: indefinite, definite
Mood {MD:I,MD:S;MD:C} respectively: indicative,subjunctive,conditional
Number {NM:1,NM:2} respectively: singular,plural
Person {PR:1,PR:2,PR3} respectively: 1st person,2nd person,3rd person
Time {TM:R; TM:S; TM:F} respectively: present, past, future
Accusative {ACC:1} respectively: true
Polite {PL:1} respectively: true

Cross-Lingual Annotation Subset for Hungarian NounsEdit

Lexical Category {CT:NN } respectively: noun
Number {NM:1,NM:2} respectively: singular,plural
Possessor {PR:1,PR:2,PR3} respectively: 1st person,2nd person,3rd person
Possessor-Number {PNM:1,PNM:2} respectively: singular,plural
Case {CS:NOM, CS:ACC, CS:DAT, CS:INS, CS:SOC, CS:FAC, CS:CAU, CS:ILL, CS:SUB, CS:ALL, CS:INE, CS:SUP, CS:ADE, CS:ELA, CS:DEL, CS:ABL, CS:TER, CS:FOR, CS:TEM} respectively: Nominative, Accusative, Dative-genitive, Instrumental, Essive-modal, Translative, Causal-final, Illative, Sublative, Allative, Inessive, Suppressive, Adessive, Elative, Delative, Ablative, Terminative, Formal, Temporal


Once such a description is available it can be used to define states and signatures.

  • A state is the full morphological description of an inflected word.
  • A signature as explained above is a relation associating pos with permitted sates
    • Transitive Verb <CT-VB*,DF:1;MD:*,N:*,P:*,A:*,PL:*>


  1. IR.
    By supplying a lemma-id and a morphological-state one provide superior search capabilities in certain languages.
  2. A cross-language morphological description facilitates analysis from a more generalised forrm (Generic parametrized algorithms)
  3. A cross-language annotations would be useful for machine translation. In this sense morphology is viewed as a semi group generated as a Cartesian product of its feature subsets.
  4. Compression.
    Since in reality feature availability varies across languages matrix is not only sparse but also within a language feature availability is co-dependent. (a verb has time but a noun does not) Therefore the minimal sparse matrix can be extracted and used to compress morphological state. But to create such a compression scheme it is necessary to collect statistics showing lemma frequency, and feature dependency.

Phonological IndexingEdit

Phonological Indexing is the notion of also indexing the sounds of words. Naively speaking if all the text is indexed phonologically, as is the query, it should permit to search using voice. However in reality there are many complexities and one would prefer to restrict sound to where it can make the greatest impact. This boils down to words whose sound representation is most important - generally names or in English, proper nouns.

using Wiktionary data, information in Illustrations of the IPA[8] it could be possible to provide phonological transcription for many languages. These could be used in two ways.

  • (cross language) sound based search of proper nouns.
  • search via voice query for mobile devices
  • generation of audio version of articles in various language using synthetic voices.


Writing systems are categorized into two main type - Logo-graphic and Phonetic. Logo-graphic systems generally provide a glyph per word. In contrast Phonetic systems use a much smaller set of symbol which generally reflects the phonology of the language at a certain time. In some cases language reform keeps spelling inline with development in phonology. In other cases the alphabet contains historical artifacts from by gone erase (English)

Using a phonological description of a language it is possible to provide a set of context sensitive production rules which can be used to provide a transcription of the language. Once a Phonological description of the corpus/lexicon is available it can be used to improve search. This phonological rules, lexicon, and morphological productions can be used to make a Finite State Model of the language. These are highly compact and efficient description of the morphological lexicon which can both analyze a word or or generate a given state from the citation form. For such languages generating an IPA should be possible. While developing FST morphologies is an expert task Apertium has such models and if integrated they could be used in search.

To provide an IPA Transcription

  1. . create a text to IPA transcription grammar
  2. . create an interpreter
  3. . allow override IPA from database
  4. create test sets based on illustrations of the IPA

IPA has broad and narrow transcription styles. The narrow allows a higher degree of accuracy in the description.

  1. . allow deletion of certain sound patterns based on user's preferences.
    • Sound Search Preferences will be set using a sequence of questions based on minimal set
      • do you pronounce cot and caught in the same way?
      • do you pronounce r in new york
  2. plug these values into a similarity for phonology

Phonological Similarity AlgorithmEdit

The algorithmic challenge would be to design an IPA based similarity and ranking. w:Soundex, w:Double Metaphone

  • that is robust to small changes in input (spelling)
  • respects the sound difference of input and index languages.
  • respect input transliteration from different languages.
  • gracefully degrades precision. (e.g. prefer a narrow match, a broad match, a clustered phoneme match, edit distance for typos.

MDL methodologyEdit

Collect and encode information based on its ability to reduce description length.

  • (sparse) w:phonotactic matricies. Two models -
    • boolean binary - generate a table with + indicating pair availability
    • probablistic model
      • lexicon norming
      • corpus norming via power law
      • foreign influence via cumulative statistics based on 10 partitions of the most frequent 50,0000 words.
      • (Subject to scaling by corpus size.

  • General phonotactics matrix.
    • lexeme initial, medial ,final position <consonant,consonant> and <vowels,vowel> pairs.
  • Cluster phonotactics matrix.
    • morpheme initial, medial ,final position <consonant,consonant> and <vowels,vowel> pairs.
  • Sound Assimilation rules (also useful for FSM creation).

See AlsoEdit

Dump RelatedEdit

Some Options from apertium

Some DirectoriesEdit


Corpus ToolsEdit

Concordancing SoftwareEdit