Costin Leau on Elasticsearch, BigData and Hadoop – InfoQ

ByCostin Leau apache, architecture, database, elasticsearch, enterprise, json, libraries, number Comments Off

Elasticsearch is an open source, distributed real-time search and analytics engine for the cloud. It’s built on Apache Lucene search engine library and provides full text search capabilities, multi-language support, a query language, support for geolocation, context aware did-you-mean suggestions, autocomplete and search snippets.

Elasticsearch supports RESTful API using JSON over HTTP for all of its operations, whether it’s search, analytics or monitoring. In addition, native clients for different languages like Java, PHP, Perl, Python, and Ruby are available. Elasticsearch is available for use under the Apache 2 license. The first milestone of elasticsearch-hadoop 1.3.M1 was released in early October.

InfoQ spoke with Costin Leau from Elasticsearch team about the search and analytics engine and how it integrates with Hadoop and other Big Data technologies.

InfoQ: Hi Costin, can you describe what Elasticsearch is and how it helps with Big Data requirements?

Elasticsearch is a scalable, highly-available, open-source search and analytics engine based on Apache Lucene. It is easy to “dig” through your data and to “zoom” in and out – all in real-time. At Elasticsearch, we’ve put a lot of work into delivering a good user experience out of the box. We set good defaults that make it easy to get started, but we also give you full access, when you need it, to customize virtually every aspect of the engine.

For example, you can use it to search your data, from the typical queries (‘find all items X that match Y’) to filtering (or “views” in Elasticsearch terms), highlighted search snippets which provide context for each result, geolocation (‘find all items with Z miles’), did-you-mean suggestions and powerful aggregations (Elasticsearch’s “facets”) such as date histograms or statistics.

Elasticsearch can both search and store your data. It offers a semi-structured, schema-free, JSON based model; you can just toss JSON documents at it and Elasticsearch will automatically detect your data types and index your documents, or you can customize the schema mapping to suit your purposes, e.g. boosting individual fields or documents, custom full text analysis, etc.

You can start with a small instance on your laptop or take it to the cloud with tens or hundreds of instances, all with minimal changes. Elasticsearch will automatically scale horizontally and grow with your app.

It runs on the JVM and uses JSON over a RESTful HTTP interface, so any client/language can interact with it. There are a plethora of clients and framework integrations in various languages that provide native APIs and dedicated DSLs to minimize ‘friction’ and maximize performance.

Elasticsearch is a great fit for “Big Data” because its scalable, distributed nature allows it to search – and store – vast amounts of information in near real-time. Through the Elasticsearch-Hadoop project, we are enabling Hadoop users (including Hive, Pig and Cascading) to enhance their workflow with a full-blown search engine. We give them a rich language to ask better questions in order to get clearer answers, significantly faster.

InfoQ: Elasticsearch is used for real-time full text search. Can you tell us how real-time full text search differs from traditional data search?

In layman’s terms, traditional search is a subset of full text search.

Search as implemented by most data stores is based on metadata or on parts of the original data; for efficiency reasons, a subset of data that is considered relevant is indexed (such as the entry id, name, etc…) and the rest is ignored. This results in a small index when compared to the data size, but one that doesn’t fully cover the data set. Full text search alleviates this problem by indexing and searching the entire corpus at the expense of increased need for storage.

Traditional search is typically associated with structured data because it is easier for the user to know what is relevant and what is not; however, when you look at today’s requirements, most data is unstructured. Now, you store all data once and then, when necessary, look at it many times across several different formats and structures; a full-text search approach becomes mandatory in such cases, as you can no longer afford to just ignore data.

Elasticsearch supports both structured data search and full text search. It provides a wide variety of query options from keywords, Boolean queries, filters and fuzzy search just to name a few, all exposed via a rich query language.

Note that Elasticsearch provides more than simple full text search with features such as:

Geolocation: Find results based on their location.
Aggregation/Facets: aggregate your data as you query it: e.g. Find the countries that visit your site for a certain article or the tags on a given day. As aggregations are computed in real-time, the aggregations change when queries change; in other words, you get immediate feedback on your data set.

InfoQ: What are the design considerations when using Elasticsearch?

Data is king so focus on that. In order for Elasticsearch to work with the data the way you want to, it needs to understand your ‘requirements’. While it can make best effort guesses about your data, your domain knowledge is invaluable in configuring your setup to support your requirements. It all boils down to data granularity or how the data is organized. To give you an example, take the logging case which seems to be quite common; it’s better to break down the logs into time periods – so you end up with an index per month, or per week or even per day, etc. – instead of having them all under one big index. This separation makes it easy to handle spikes in growth and the removal or archiving of old data.

InfoQ: Can you discuss the design and architecture patterns supported by the Elasticsearch engine?

An index consists of multiple shards, each of which is a “mini” search engine in its own right; an index is really a virtual namespace which points at a number of shards. Having multiple shards makes it easy to scale out by just adding more nodes. Having replica shards – copies of each primary shard – provides high availability and increased read throughput.

Querying an index is a distributed operation, meaning Elasticsearch has to query one copy of each shard in the index and collate the results into a single result set. Querying multiple indices is just an extension of the same process. This approach allows for enormous flexibility when provisioning your data store.

With the domain specific knowledge that a user has about their application, it is easy to optimize queries to only hit relevant shards. This can make the same hardware support even greater load.

InfoQ: How does Elasticsearch support data scalability?

Elasticsearch has a distributed nature in order to be highly-available and scalable. From a top-level view, Elasticsearch stores documents (or data records) under indices (or collections). Each collection is broken down into multiple pieces called shards; the bigger an index is, the more shards you want to allocate. (Don’t be afraid to overdo it, shards are cheap.) Shards are spread distributed equally across an Elasticsearch cluster depending on your settings and size, for two reasons:

For redundancy reasons: By default, Elasticsearch uses one copy for each shard so in case a node goes down, there’s a backup ready to take its place.
For performance reasons: Each query is made on an index and is run in parallel across its shards. This workflow is the key component for improving performance; if things are slow, simply add more machines to the cluster, and Elasticsearch will automatically distribute the shards, and their queries, across the new nodes.

This approach gives organizations the freedom to scale both vertically (if a node is slow, upgrade the hardware) and horizontally (if a cluster is slow, add more nodes to increase its size).

InfoQ: What are the limitations or cautions of using this solution?

The main challenge that we see is with users moving from a SQL world to what you could call a contextual search one. For retrieving individual data entries (the typical get), things are still the same – specify the id and get the data back; however, when it comes to data exploration there are different constructs to be used, from the type of analysis performed to what type of search or matching algorithm is used, e.g. fuzzy queries.

InfoQ: Can you talk about the advantages of using Elasticsearch along with Hadoop technology?

Hadoop by design is a distributed, batch-oriented platform for processing large data sets. While it’s a very powerful tool, its batch nature means it takes some time to produce the results. Further, the user needs to code all operations from scratch. Libraries like Hive and Pig help, but don’t solve the problem completely; imagine reimplementing geolocation in Map/Reduce.

With Elasticsearch, you can leave search to the search engine and focus on the other parts, such as data transformation. The Elasticsearch-Hadoop project provides native integration with Hadoop so there is no gap for the user to bridge; we provide dedicated InputFormat and OutputFormat for vanilla Map/Reduce, Taps for reading and writing data in Cascading, and Storages for Pig and Hive so you can access Elasticsearch just as if the data were in HDFS.

Usually, data stores integrated into Hadoop tend to become a bottleneck due to the number of requests generated by the tasks running in the cluster for each job. The distributed nature of the Map/Reduce model fits really well on top of Elasticsearch because we can correlate the number of Map/Reduce tasks with the number of Elasticsearch shards for a particular query. So every time a query is run, the system dynamically generates a number of Hadoop splits proportional to the number of shards available so that the jobs are run in parallel. Your Hadoop cluster can scale alongside Elasticsearch, and vice-versa.

Furthermore, the integration enables cluster co-locations by exposing shard information to Hadoop. Job tasks are run on the same machines as the Elasticsearch shards themselves, eliminating network traffic and improving performance through data locality. We actually recommend running Elasticsearch and Hadoop clusters on the same machines for this very reason, especially as they complement each other in terms of resource usage (IO vs. CPU).

Last but not least, Elasticsearch provides near real-time responses (think milliseconds) that significantly improve a Hadoop job’s execution and the cost associated with it, especially when running on ‘rented resources’ such as Amazon EMR.

InfoQ: Is there any integration between Spring Framework and Elasticsearch?

Yes, check out the Spring Data Elasticsearch project on Github. The project was started by our community members Biomed Central and we are happy to participate in the development process with them by using and improving it. The project provides the well-known Spring template as a high-level abstraction, as Repository support on top of Elasticsearch and extensive configuration through XML, JavaConfig and CDI. We are currently looking into aggregating existing integrations under the same umbrella, most notably David Pilato’s spring-elasticsearch.

About the Interviewee

Costin Leau is an engineer at ElasticSearch, currently working with NoSQL and Big Data technologies. An open-source veteran, Costin led various Spring projects and authored an OSGi specification.

Sections
Enterprise Architecture Operations & Infrastructure Architecture & Design Development
Topics
Database Infrastructure SpringSource Spring Hadoop Java Distributed Systems Big Data Hadoop ElasticSearch Architecture VMWare Spring

More here:

Costin Leau on Elasticsearch, BigData and Hadoop – InfoQ

Author

Costin Leau