I love Elasticsearch. It’s a great system that appears to be gaining more traction as a datastore which means only more good things can be on the horizon. In an effort to share what I learned in using the system, I wanted to publish out notes I had taken when building some of my clusters (and collected by others). This entry is long, but well worth the read if you are new to Elasticsearch. The documentation for ES is great, but lacks a lot of real-world implementation details. To put these notes into perspective, my largest cluster was the following:
- 4 nodes inside of EC2
- RAM size of 32GB (16GB on the heap)
- Magnetic storage (300GB a server)
- 157 indexes (partitioned based on hashed values for easy routing and even distribution)
- 410 shards
- ~300M documents
- Steady loading of ~10K documents every 10-30 seconds through bulk transactions
After crashing the setup more times than I could remember, I finally established a green cluster that’s been up for over 60 days (Amazons rolling restarts killed off the uptime before that) with no issues. The notes below were written as I learned Elasticsearch over a months period of time and were updated post-running a successful cluster.
RTFM and Free Documentation
This is usually the first thing anyone will tell you when dealing with a new system, but it’s especially true for Elasticsearch and their documentation actually helps (go figure). Many problems encountered during the development process could have been avoided had the write documentation been read first. Aside from the manual, read the list of sites below (entirely) before jumping into the systems deployment.
One thing worth mentioning when it comes to the documentation is that some of the best/worst practices may vary for your setup. For example, the documentation harps on paging when using the scroll searches. The documentation explains how paging is basically executing the same search over and over for all the shards, nomalizing and then returning back the subset based on your page size. It’s advised to use a different query pattern, but in practice, this is unlikely to be an issue if you are working with fairly small datasets. Read the documentation, but use it as guidance, not gospel.
Search != Database
Elasticsearch appears to have gained enough traction in some of the database communities to adopt it as a primary datastore. This concept makes a lot of sense especially if you are already throwing your entire document into Elasticsearch anyway. However, it’s worth noting that Elasticsearch was not built to be a database from the start, so certain concepts need to be translated. Unlike the world of databases, the searching community tends to use different words to describe their processes or concepts, so keep an eye out when reading documentation.
Additionally, if you are used to a database then you know all about large sweeping queries and tend to give them little thought when executing. Before making large queries within Elasticsearch, understanding of result counts needs to be considered. Depending on how you sharded your environment, this understanding needs to occur even more. Elasticsearch will always steal all the RAM on the system and if you are not careful, it’s extremely easy to fill the JVM heap which will then send the cluster into chaos. When possible, always filter your queries and use the count API to understand how much data may be returned (aka put on the heap) before actually running the query. These same concepts should also extend into some of the Elasticsearch specifics like aggregations. At the end of the day, you are smarter than Elasticsearch which is merely nothing more than your slave.
You just want to test your data out, right? Good, one node works fine. In fact, it’s best to stage everything in a local setup if at all possible. When actually setting up your production environment, one node is not enough and it’s best to start with three. One node is prone to being overloaded and it’s more of a pain to expand once that node is filled up rather then just starting with a few smaller nodes. Additionally, one node tends to skew the query/data size/complexity issues that arise as you “scale” your cluster. It’s good for experimentation, but best avoided.
Do it and do more than you need. You can NOT add more shards in the future. It’s one of the immutable aspects of an index. Take a look at the video mentioned in the first section for an idea of how to plan your shards. The talk covers a few different design patterns you could follow to help plan your shards. Though it depends on your setup, also consider hashing aspects of your data and creating an equal amount of shards per hexadecimal value. Doing so ensures balance and it should easily scale out as time goes on and fills with data.
This should really be called a schema because that’s essentially what it is and it also can NOT change later. When you start Elasticsearch, you can instantly throw data at the nodes and run some quick searches, but that is not going to cut it for any production system. You must sit down and work out a schema where your data is mapped properly to the different field types. Consider using multi-fields as much as possible as they allow you to experiment with your incoming data while not having to adjust your mapping. Once you have a mapping you think will work, test it and really test it. Have some mock data, run some queries and try to hit everything you are interested in querying later.
Not sure how to test your data? No problem, there are plug-ins to help you out. Install something like inquisitor (https://github.com/polyfractal/elasticsearch-inquisitor) to help you understand how your data is being indexed and what analyzer/tokenizer will work best for you. Even when you think you have a good mapping and have tested it a number of times, test again. If you need to change your mapping later, it requires a full re-index which means your data needs to go from one index, to a new one. This not only takes time and resources, but also introduces more complexity as you need to properly alias and account for changes inside of your applications.
You may never find the need to build your own analyzers, but you must understand how they work as it’s an integral process to Elasticsearch. When your data is sent to your index and marked as “analyzed”, it runs through a process that potentially changes how your data is stored. This is standard and best described in this blog – https://www.found.no/foundation/beginner-troubleshooting/.
Elasticsearch provides a client for almost every modern day programming language. You might fall in love with just using CURL which is fine for indexes, but once you get to larger data inputs, you can quickly run into HTTP throttling issues. The clients provided by Elasticsearch account for the small nuances of the system and ensure your data is going where it should. If you roll your own, that’s fine, but you will likely end up ditching it later on.
When a cluster of nodes come up, a master is elected and serves as the brain of the cluster. Lose your master, lose your brain. If you ever run into an OOM (out of memory) error, just hope it’s not on your master node. If it is, not a huge deal, but the rest of your cluster will lie to you and say everything is fine. If you suspect an issue with your cluster (data indexing is slowed or stopped, queries take forever, timeouts occur), check all your logs to identify which node died. If it happens to be the master, you will need to individually execute the commands you normally would across the cluster, on each machine. This means disabling reallocation and issuing a proper shutdown command for each node. Also, consider using multiple nodes that merely act as routers and contain no data. This helps to provide some stability to the cluster operations and helps to ensure if you did lose your master, that the data is not being sent all over the place while you attempt to fix the issue.
Unlike most systems, Elasticsearch does not like to be shutdown via the daemon commands. When running a cluster, there are a number of actions occurring without your knowledge. One of those actions happens to be the balancing of shards/replicas within the cluster itself. If you are planning to bring down all nodes, make sure to disable reallocation across the entire cluster. Without doing so, your cluster will begin moving shards to compensate for the shutdown nodes. Even worse, you could accidentally shutdown while this reallocation is happening causing your shards to be completely unassigned. Shutting down must be planned.
Local System Changes
If you are setting up a cluster, dedicate the machines to the cause completely. Consider doing the following items on each machine as part of your cluster:
- Disable swap on the local OS – http://askubuntu.com/questions/214805/how-do-i-disable-swap
- Force a memory lock within the ES configuration – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-service.html
- Allocate 50% of your RAM to the Java heap – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-service.html
- Name your nodes something that makes sense
- Add a mounted drive to store all your index data (and update the path settings)
- Throw a couple small scripts on each node to dump healthy and critical details (helpful if things fall apart)
- Adjust max file descriptors in your limits.conf and set variables to ensure they survive reboot – http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
- Change the field cache size within the ES configuration – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
- Set circuit breakers to avoid OOM issues – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html
Elasticseach in EC2
Running your cluster within EC2 requires a bit of modification to your elasticsearch configuration and AWS account. Below is a summarized process flow for getting your cluster to function properly within EC2.
- Setup a separate IAM user who has limited access
- Associate the IAM user with a group that has a custom policy allowing for the following – EC2:Describe*
- Take note of the application access and secret keys
- Create a custom security group (take note of the name because you will need it later) that allows for incoming rules of the following:
- All nodes to talk to each other through ports 9200-9400
- All 10.0.0.0/8 network to talk over the same ports (seems weird, but EC2 needs it when routing on public IP)
- Add your instances to the group and power them on
- Install the Elasticsearch AWS Cloud Plugin (https://github.com/elasticsearch/elasticsearch-cloud-aws)
- Edit your elasticsearch configuration to include the following:
- discovery.type: ec2
- discovery.zen.ping.multicast.enabled: false c. discovery.ec2.groups: <group_name>
- discovery.ec2.tag.group: <tag_name>
- discovery.ec2.host_type: public_ip
- plugin.mandatory: cloud-aws
- cloud.aws.access_key: <access_key>
- cloud.aws.secret_key: <secret_key>
- script.disable_dynamic: true
- network.publish_host: _ec2:publicIpv4
As noted above and by the configuration, these settings force Elasticsearch to use the publicly routable addresses for publishing and communication. Despite how this sounds, it appears Amazon doesn’t care to listen to your instructions and will still communicate over the private IP address space. In fact, if you disable the public IP setting and instead use the default of private IP, you will observe the exact opposite when monitoring traffic between nodes. To avoid errors when interacting with the cluster outside of AWS, it’s best to use the public IP address and just allow for the internal class A network to move freely between nodes. As long as your discovery is locked down to a specific group, this should not cause any issues. Also, double check your policies within AWS to avoid exposing your nodes out to the open internet.