Natural Language Processing Dictionary

Posted by Abhishek Sinha
Abhishek Sinha

Old opened book with characters flying out of pages

Bag of Words
The bag-of-words model is a simplifying representation technique wherein a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Bottom-up parser
The bottom-up parser has a parsing technique that starts bottom-up by grouping words into phrases, and phrases into higher-level phrases, until a complete sentence has been found.

Chomsky hierarchy
Chomsky Hierarchy is a technique to classify a selected text into formal grammar types. A language grammar is defined into 4 types:
Regular – Type 3
Context Free – Type 2
Context Sensitive – Type 1
Recursively Enumerable – Type 0

Corpus in NLP means the collection of text or other digital datasets across languages. Example – Wikipedia is a text corpus of an online encyclopedia.

Hidden Markov Model
Hidden Markov Model is a class of probabilistic graphical model that predicts a sequence of unknown variables from a set of known variables.

Information Retrieval
Information retrieval is the process of accessing and retrieving the most appropriate information from text base on a particular query, using context-based indexing or metadata.

Knowledge Base
A Knowledge Base is the collection of information known to an NLP system.

It is the process to normalize a word and remove its inflection form to arrive at the base word. Example - Base word for “gone” or “went” is “go”

A lexicon is a vocabulary of a language or subject along with its usage. For example medical terminology.

NER (Named Entity Recognition)
Named entity recognition (NER) is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes. Named Entities are specific entities in a text document that are more informative and have a unique context that represents real-world objects like people, places, organizations, etc which are often denoted by proper names.

N-gram is a contiguous sequence of N tokens from a given text. N-Gram based models predict the most probable word(s) that might follow the entered text sequence. Such a model is useful in many NLP applications including Speech Recognition, Machine Translation, and AutoComplete.

Normalization is a technique of processing text to transform it into a canonical form. It can be a series of text such as abbreviation expansion, contraction, conversion to the same case (upper/lower), remove stop words, etc. Text generated after normalization results in uniform and predictable processing of transformed raw text.

An ordinal is a number that denotes rank or position rather than value. Example – First, Second

Parse Tree
A Parse Tree is a logical representation of a selected text according to some language grammar rules.

POS (Part Of Speech) Tagging
Parts Of Speech Tagging is a task of labeling each word in a sentence with its part of speech such as nouns, verbs, adverbs, adjectives, etc.

Referential Ambiguity
Referential Ambiguity is a scenario that happens when a word or phrase may have more than one meaning in a given sentence context.

Regular Expression
A Regular Expression or Regex is a set of characters, or a pattern, which is used to find sub-text in a given text. In other words, it is a powerful text searching tool that is used for a variety of reasons in NLP such as feature extraction from text, string replacement, etc.

Semantic Analysis
Semantic Analysis is a technique to determine the actual meaning of the selected text. It goes beyond the syntax of the text by deriving its true meaning.

Sentiment Analysis
Sentiment Analysis technique analyzes a body of text for understanding the opinion expressed in it. Typically, the sentiment is quantified as a positive, negative, neutral score called a Polarity. Sentiment analysis works best on a text that has a subjective context than on text with only an objective context.

Similarity Measures
Similarity Measures is a technique in NLP to find the similarity among texts. Some of the Similarity Measures techniques are Jaccard, Smith-Waterman, Levenshtein.

Statistical Language Modeling
SLM is the technique of building probabilistic models that are able to predict an entire word sequence in a sentence based on words that already precedes it. It assigns probability scores to various possible text sequences.

Stemming is the process of removing its affixes to derive a word stem. Example – “going” word stem is “go”.

Stop Words
Stop Words are those words that are redundant in deriving the complete meaning of the selected text since these words contribute little to the overall meaning. Example - “the”, “a”

Syntactic Analysis
Syntactic analysis or Parsing is a technique to check the text for meaningfulness with respect to the rules of formal grammar.

Tokenization is the technique of splitting large texts into smaller tokens. Larger chunks of text can be tokenized into sentences and sentences can be tokenized into words.

Top-down parser
The top-down parser has a parsing technique that starts from the top of the parse tree and then moves down until it arrives at the input.

Universal Quantifiers
They are overgeneralizing words such as never, always, all, etc. used in a sentence.

A WH-question sentence describes a question whose answer is descriptive and not limited to Yes/No.

A word is a unit of a language used to create a phrase, sentence.

Word Segmentation
See Tokenization

A YN-question sentence describes a question whose answer can be Yes/No

Let us know if you think we should add any other terms!


Did you enjoy this content? Follow our linkedin page!


Looking for similar content?

Abhishek Sinha

Written by Abhishek Sinha