Big Data is the latest buzzword, and as such everybody is using it to their convenience. This has inevitably created some confusion. In this blog post we will try to clear it up.
Big data is about two things: large sets of typically unstructured data and some relatively new techniques to deal with this kind of data. To get a good perspective we need to start by reviewing relational databases.
In general, the traditional way of handling data is by using a relational database. When a database is used for on-line transaction processing we tend to see a separate setup for analytics. This is commonly known as a data warehouse, which provides processing relief from the main database. It also has some analytical or so-called business intelligence tool. Large relational databases tend to be expensive propositions, as the costs of the processing units and the disks are very high.
Relational databases are based on what’s called “early structure binding”. What this means is that you have to know what questions are going to be asked to the database so that you can design the schema, tables and relations. Any new questions that don’t fit this schema require some modification of the schema that usually implies a fair amount of time and good technical skills.
These restrictions of relational databases can be considered the price to pay for having a system that can be considered fully transactional, that is, it fully complies with the ACID properties.
Let’s move to the actual “big data”. It can be broken down in two parts. The first one is what we call our digital footprint. This is all our emails, the blogs we read and possibly write, tweets, Foursquare check-ins, Facebook entries, etc. The second part is machine data, such as the log files generated by all those computers supporting our digital footprint. But there is also of plenty other machine data such as the one obtained by sensors that present us with real time flight tracking, etc.
Most of this data is unstructured, which can be loosely defined as a variable number of fields of variable size, which can be or not present. Big data also tends to be large, very large. Just think of the web access log files of a popular web site. It can generate a few megabytes per day, maybe even by the hour. Additionally, this data tends not to be mission critical. Not only that, in general it does not require the functionalities offered by a fully transactional system. After all, most of the times all we do with it run some analytics.
Now that we have a better understanding of big data, let’s flip the relational database paradigm from centralized, high performance, fully transactional processing to a distributed processing with a higher latency that might comply with just one or two of the ACID properties, and sometimes none.
Big data tools such as Hadoop and Splunk are based on this other paradigm, distributed processing of data that is also distributed. These tools are designed to work on commodity hardware, and are resilient enough to handle the failures expected from cheap hardware. But these tools have a higher latency when processing this data and they have dropped the support of many (or all) the ACID properties. Just think of it as the price to pay for dealing with very large unstructured data.
This is what big data is all about, a different paradigm for processing data.
One last thought, these big data tools can also handle structured data, which could also be small, so don’t place limitations on the functionalities of these tools.
On the next blog post we will explain in more detail the big data tools and their underlying techniques.