Searching with Elasticsearch

Posted by Niraj Markandey
Niraj Markandey

In this blog, we will cover the Elasticsearch basics and answer questions including:

  • What is Elasticsearch?
  • How does Elasticsearch work?
  • What does it mean to be powered by Lucene?
  • How does Elasticsearch store data?
  • How does Elasticsearch "search"?

Close up of a bookshelf in library

 

So what is Elasticsearch?

Elasticsearch is a part of the Elastic tech-stack. It is used in combination with visualization and ingestion tools, Kibana and Logstash, and Beats. Elasticsearch is the "middle engine," a real-time search analytics engine, enabling you to store, search, analyze, and explore your data. 

Key Features:

  • Elasticsearch has a distributed document (data) store.
  • It also provides real-time analytics
  • Elasticsearch is highly scalable
  • Elasticsearch also provides JSON based REST API to access its functionality from any web client of your choice or even from the command line.
  • As Elasticsearch is a distributed system, it also provides API to manage and monitor the system.

 

How does Elasticsearch work?

Elasticsearch is powered by Apache Lucene, an open-source, text search engine built in Java, to search its data it stores. 

 

How does Elasticsearch store data?

For a full-text search, Elasticsearch uses a data structure called an inverted index. Inverted indexes are indexes created based on the unique words found in the content stored in their repositories. In this case, anything stored in Elasticsearch contributes to the index. 

Here is how an inverted index works:

Document 1
QueryAI Decentralized Data Access & Analysis

 

Document 2
QueryAI helps you unlock the power of your data.

To create the inverted index, we first split the statement into separate words, which are called tokens (or terms).

Once the tokens are determined we apply filters to increase searchability 

  • Removing stop words (the, in, etc. of the English word)
  • Lowercasing (To make search case insensitive)
  • stemming (using root words. "Foxes" will get converted to "fox")
  • Synonymous (jumped and leap are synonyms and are indexed as just the single term jump)

After applying the above rules, we get:

tokens Present in Document 1 Present in Document 2
queryai yes yes
decentralize yes no
data yes yes
access yes no
analysis yes no
help no yes
unlock no yes
power no yes

When a user searches, the same filters are used on the search string. When a user searches QueryAI, it will get lowercased to queryai before searching.

 

For Example:
Search: queryai 
Result: As it is present in both the documents search result will bring up both.

 

Search: queryai data power
Result: As the "queryai" snippet is present in both documents, it will bring both up. However, since document 2 also matched on "data power," Elasticsearch will rank document 2 as a higher match percentage.

 

In summary

Elasticsearch is a real-time search analytics engine, enabling you to store, search, analyze, and explore your data. In this blog, we covered how Elasticsearch uses Lucene and how it searches through data. In the next blog, we will discuss other components of the Elastic tech stack: Kibana and Logstash.

 

Did you enjoy this content? Follow our linkedin page!

Query.AI is a decentralized data analysis technology that unlocks the power of your organization’s data, simplifying access and analysis across your platforms and locations, without data duplication. With Query.AI, you can analyze your enterprise data in a language-, location-, and platform-neutral way to gain cost-effective, consistent security operations and eliminate complexity.

 

Looking for similar content? Check out:

Common Elasticsearch terminology (used in the article)

Document:
Elasticsearch stores the entire JSON object after indexing. Each of the stored objects is individually called a document.

Index: 
For a user, the index is a place to store related documents. 

Inverted Index:
An index created based on the unique words used in stored documents.

Full-text search:
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria.

 

Reference:

https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html

http://www.lucenetutorial.com/basic-concepts.html

https://en.wikipedia.org/wiki/Full-text_search

Niraj Markandey

Written by Niraj Markandey

Niraj is a Senior Software Engineer presently working at Query.AI with a demonstrated history of working in the computer software industry. He likes to build scalable and resilient solutions for cloud-based products.