Thoughts on Elasticsearch

Let me say in the beginning: Elasticsearch is great for searching. Currently, I’m busy with the improvement of some searches on millions of objects, therefore I’m getting close to Elasticsearch. I stumbled over the existing cluster (3 x 64gb ram, 32 Core) for logs (ELK stack) that looks like a good place to look for existing data on that and to build new indices (document collections). However, first I need to get to know Elasticsearch more closely.

What I’ve learned so far

in the first week

Search performance is really impressive. Even by searching on not optimized raw log indices with billions (yes billions) of documents, you can still get result’s in acceptable time.
Search queries (Search APIs) are expressive, but it takes time to understand them. A lot of time!.
As well you should care about mappings and types and understand how they indexed. This also takes time.
Not appropriate index structure can also affect you soon (Too much or too few shards)
If logstash is used, it should be well understood as well and you should develop logstash config (filters) test-driven from the start!
Sometimes you want to create new indices from data that exist in one another. This can be done by great Reindex API. And this is done in the background, while ES stays responsive. For example, I was able to create a new index with 4.198.761 elements out of the source index with >120.000.000 elements by executing a REST call (see below) on the Reindex API. It took 30 minutes.

Examples

Some examples for those who never saw it. Typical REST calls.

Search query

POST /index/_search
{   "_source":  ["entry_id","contract_id", "name","description", "score", "country"],
    
    "from" : 10, "size" : 200,
    "sort" : [{ "@timestamp" : {"order" : "asc"} }],
    "query": {
      "bool" : {
       "must": [
           { "match_phrase": { "entry_type": "score processing" } },
           { "term" :{ "contract_id" : "1000"} },
           { "range" : { "@timestamp" : {  "gte": "17:08:2017", "lte": "17:08:2017", "format": "dd:MM:yyyy" } } },
           { "match": { "name": "fantastic" } }
           ]      
      }
    }
}

We see a bool-query with only one must boolean clause that contains several expressions: match, match_phrase, term, range

New type mapping

PUT /index_name/_mapping/type_name
{
       "type_name" : {
            "properties" : {
                "entry_id" : { "type" : "long" },
                "key" : { "type" : "text" },
                "name" : { "type" : "text" },
                "sescriotuion" : { "type" : "text" },
                "country" : { "type" : "text" },
                "@timestamp" : { "type" : "date", "format": "date_optional_time||yyyy-MM-dd HH:mm:ss" },
                "state"  : { "type" : "byte" },
                "contract_id": {  "type" : "long" },
            }
        }
   
}

This would create new type type_name inside the index index_name

Re-indexing

POST /_reindex
{
  "source": {
    "index": "logstash-2017.08.17",
    "_source":  ["entry_id","contract_id", "name","description", "score", "country"]
    
    "sort": { "@timestamp": "desc" },
    "query": {
      "bool" : {"must": [{ "match_phrase": { "entry_type": "score processing" } }]
      }
    }
  },
  "dest": {
    "index": "new_index", "type":"new_type"
  }
}

This Query creates new_index and fills it with elements that are matching the query section

Challenges

Struggling with the search. I still don’t know how to retrieve (all) elements, but only one child for the same (parent) id - kinda group by. And I don’t know is it even possible to retrieve all elements, but again “group by” child for same parent id field and I want to specify a group by function.

Going further what i need is for example new synthetic fields while grouping:

inDate -> max(child.timestamp)
outDate -> min(child.timestamp).

I have no clue how to achieve that yet.

There are as well not that many examples of advanced queries. Also question to search queries on StackOverflow or on Elastics’ Discuss platform are not well answered or answered at all, which wounder me a bit.

The same applies to Reindex. Probably I would like to use the same “GROUP BY” expression to rebuild the new index and to insert new fields, it looks like it’s possible with “Pipelines”, but not tried so far and not easy to understand without examples.

If you have some tips for beginners or any other feedback, please comment.