Basics of Applied Natural Language Processing (NLP)

Introduction

This article serves as a gentle introduction to natural language processing and how it makes life better.


Natural Language Processing (NLP) 
is a field that is becoming popular like AI or machine learning. It is related but not quite the same. NLP is mostly concerned with taking natural language as input and feeding it to machines or using machines to produce natural language output.

For a general introduction to Artificial Intelligence I suggest you start with: Artificial Intelligence & Everyday Life! Part 1 — What is AI and an exploration of Machine Learning

As kids, we acquired the skills necessary to master our mother tongue and some of us learnt many foreign languages. And just like that, natural language processing involves making computers work with a new language the same way we as kids did. Today we are faced with a need to make computers understand how humans operate and think, and deal with ambiguity in our language.

In summary, NLP is the bridge between what a human understands — such as English — to what a computer can understand and execute on — such as data and processing algorithms, typically written in a programming language like Python.

See this previous blog post for an introduction to Python as a programing language.

 

 

Photo by Andy Kellyon Unsplash

In the rest of this blog, let’s talk about the two main stages of NLP pipeline — techniques computers use for (1) natural language understanding and (2) natural language generation.

Natural Language Understanding (NLU)

Natural Language Understanding (NLU) is the sub-area where computers try to understand human language input. Human communication has English pronouns, verbs and adjectives, so the computer program pares it down to bare essentials because machines have no emotions or formality.

To understand a sentence spoken by a human, at a high level the computer program needs to do:

  • Lexical analysis: Take the text input and break it into paragraphs, sentences, and words.
  • Syntactic analysis: Structurally parse the sentence out and derive relationships.
  • Semantic analysis:Derive meaning from the text.
  • Pragmatic analysis: Apply some real-world knowledge for more wholistic meaning.

Above logical steps themselves have implementation sub-steps, and some of the initial ones are given below. Full and detailed steps and their complexities are intentionally left out in favor of conceptual understanding.

  • Tokenization: break sentence into individual words, for further processing.
  • Stemming: basic way of cutting letters to derive the root word. For example, converting the word alerted to alert by removing a commonly known suffix “ed”.
  • Lemmatization: compared to stemming, lemmatization is a more advanced way of getting to the dictionary form of the root word.
  • POS-tagging: This is simply labeling the word into different parts of speech such as noun, pronoun, adjective, article, adverb, etc.

Does the jargon scare you?

Don’t let it! These are terms that linguists and literature majors would recognize and relate to better than a software engineer.

In this article we are not dealing with the technical details of each of these. Also, suffice it to say that NLP is not a mature domain and lot of work is being done to attack this problem from various angles and see how to make computers understand things the way humans do.

There is a sort of conversion between man and machine going on beneath the scenes to enable and facilitate NLP. Let me explain. When humans are very comfortable with a great deal of ambiguity and we don’t use precision as much as machines do. Machines however get into a tizzy when faced with ambiguity because machines always deal with decision points and flow of control in a more binary sense.

That being so, obviously in the NLP world there is a need to make these two things talk to each other, one is ambiguous text, another is precise data or mathematical values. The former is the kind of English text humans use daily like “Go fetch the dog some food” or “Could you get me some coffee” which the computers would store as “food,dog” or “coffee”. We will cover techniques for such Intent and Entity detection in a future blog on how the computer can understand and execute the desired task.

Next, let’s talk about natural language generation, i.e. how the computer will generate and reply back to your request “Sure, I will get you some coffee.

Natural Language Generation (NLG)

Natural language generation (NLG) is a scheme of making computers produce text output that grandma can understand. Once the English sentence is produced then it can be presented to the user in a variety of ways like a chat message or email or even a voice message with some bot reading out the text hereby generated.

I am deliberately leaving out going into much technical details here because, as I said at the outset, this is a very gentle introduction. Over the next few months we are going to explore this topic in detail and also see practical ways of using machine learning and natural language algorithms to properly structure unstructured data.

NLP and me

My first exposure to NLP was academic, originating from my CS undergrad background going back two decades. Since then, most of my industry experience has been primarily machine data analytics. Log parsing and log analytics over the years kept me on a track that was by all means, not NLP. But in a way it was not too far off from it either because there are a lot of parallels. After all, log parsing and analytics were all about how to extract fields, values and eventually meaning and bigger picture from the log sentence.

The analogy I like to use is that NLP and log parsing are like rails of the same track that curve and move along together without meeting. With NLP, the computer is trying to understand the human’s language, and with log analytics, the human is trying to make sense of terabytes of system and application logs. Its safe to say though that some NLP academicians, linguists and researchers may disagree with the analogy.

Meanwhile, I founded Query.AI where we are traversing this track and working to unite the rails. This first blog is a quick primer to NLP itself, and as stated we will get to the log analytics aspects later in this blog series.

For now, I hope you've enjoyed reading and I invite you to follow us on this journey.

Concluding remarks

In today’s world the whole idea is to teach computers the way humans think and operate. Evidently it is a hard thing to do and NLP is no different. In the articles that follow we shall explore new technical domains and see how to apply techniques to deal with natural language in such a way that machines understand and vice versa. Bridging the gap between us(humans) & them(machines)!

 

Posted by Dhiraj Sharan

Dhiraj is the founder and CEO of Query.AI. He is an innovator and expert developer with 18 years of problem solving and solutions development in cybersecurity including over 10 patents. He has lead engineering for companies like ArcSight, HPE, Niara and Aruba.