Skip to content
Alexander Holbreich
Go back

Internal data structures of Elasticsearch

If you start working intensively with Elasticsearch you cannot get around the understanding of internal data structures of it. Here i’ll try to make this very simple for you.

Inverted Index

Key Characteristics of Inverted Index

An inverted index is a basic memory structure. It consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Consider the following structure.

Term      Doc_1  Doc_2  ...| Doc_X
----------------------------------
hello   |   X   |  X
world   |   X   |  X
java    |       |  X
perl    |   X   |       
golang  |       |       ... |  X
...
----------------------------------

Here for every term a list of documents containing that term. Now, if we want to search for “world perl”, we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
-------------------------
world   |   X   |  X
perl    |   X   |
------------------------
Total   |   2   |  1

Both documents match, but the first document has more matches than the second. Keep in mind on indexing the values are subject to tokenization and normalization - a process called analysis.

Doc Values

Key Characteristics of Doc Values

While indexing Elasticsearch adds the elements or tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar storage called Doc Values.

Doc      Terms
-----------------------------------------------------------------
Doc_1 | hello, world, perl 
Doc_2 | hello, world, java
Doc_3 | We, need, more, golang, tutorials
-----------------------------------------------------------------

Doc values are used in several Use Cases in Elasticsearch:

When the “working set” is smaller than the available memory on a node, the OS will naturally keep all the doc values hot in memory, leading to very fast access. When the “working set” is much larger than available memory, the OS will naturally start to page doc-values on/off the disk.

Fielddata

Key Characteristics of Fielddata

Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.

Instead, text fields use a query-time in-memory data structure called field data. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from the disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Warning: Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so, since they are quite a memory and computation expensive.

P.S. Did I forgot something? Your comment is welcome!


Share this post on:

Previous Post
Elasticsearch: Working with Indices
Next Post
Elasticsearch cluster configuration: What i've learned