Why disabling the _source field in elasticsearch is a bad idea (generally)

The _source field is an predefined field which ES (elasticsearch) manages. It contains the entire document, just as it was sent for indexing. Here is an example of what you get back when you search on ES:

LM-SJC-00872983:data sravindran$ curl -XGET 'localhost:9200/wiki/english/_search?size=1'
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 855,
    "max_score": 1.0,
    "hits": [{
      "_index": "wiki",
      "_type": "english",
      "_id": "AVjdGkz54qZQsxl5SktF",
      "_score": 1.0,
      "_source": {
        "title": "Let's Go All the Way (song)",
        "abstract": "}}"
      }
    }]
  }
}

As you see here, the default behavior (where _source is enabled) is that ES gives you back the entire document. You can request parts of the document by specifying the fields you need using the fields param, like this:

LM-SJC-00872983:data sravindran$ curl -XGET 'localhost:9200/wiki/english/_search?_source=title&size=1&pretty=1'
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 855,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "wiki",
      "_type" : "english",
      "_id" : "AVjdGkz54qZQsxl5SktF",
      "_score" : 1.0,
      "_source" : {
        "title" : "Let's Go All the Way (song)"
      }
    } ]
  }
}

ES retrieves the document and extracts out the fields you have requested for from the _source field.

So given that you know what it is, here are reasons on why you should never disable it :

  1. The update API requires _source field to be enabled :
    Update works by retrieving the document’s _source first, before it applies your
    change. So, disabling it means updates just wont work !
  2. The reindex API requires _source to be enabled :

    ES’s reindex API copies data from 1 index to another. The copying relies on the
    _source being enabled, so disabling it means the reindex API just wont work.

Its still usecase dependent though. If your usecase does not rely on updates and is pretty much static data, you can go ahead and disable it.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s