By Brad Anderson, Liaison Technologies
Big data is an umbrella term for a multitude of new capabilities that are being used for storage and computing operations at scale. These capabilities are allowing organizations to store massive amounts of data, in disparate formats, and perform both batch and real-time analyses upon them.
The forces driving big data into the mainstream are the ever-decreasing cost of storage and processing, coupled with the open-source enhancements of distributed systems techniques and software. Companies have realized that data storage is on the verge of being limitless and they no longer need to be as judicious about what kinds of data they store. This realization has led to the storage of all manner of data in addition to the traditional structured data found in relational databases. Sometimes unstructured or semi-structured, this type of data encompasses emails, social media feeds, clickstreams, sensor data, videos and more. Further, the questions companies can ask of their data to realize value have become more complex. But, the time window for analysis completion has remained the same or shrunk, due to the massively parallel computation big data systems provide.
To organize all this limitless data, structured and unstructured, new tools have emerged. We no longer have just one hammer in our toolkit—the relational database—with which to fashion data. There are now a myriad of systems, thanks to big organizations, many of them Internet giants (Google, Yahoo!, Amazon, Facebook, & LinkedIn to name a few). Out of a need to scale storage and compute tasks confronting them, they made new data systems to satisfy these specific use cases. The most common characteristics of these systems are that they are not row-based or relational databases, they have horizontal linear scalability (just add more nodes), and they expect components and nodes to fail so they are fault tolerant.
Most of these tools have been open-sourced and, years later, are being adopted by other businesses that recognize their value. Examples of these non-relational databases are key-value pair databases (e.g. Riak), document databases (e.g. CouchDB, MongoDB), columnar databases (e.g. HBase, Cassandra), graph databases (e.g. Titan, Neo4J), distributed queues (e.g. Kafka, Kestrel) and spatial databases. There are also new computing tools such as Hadoop and Spark that allow immense amounts of data to be processed, and Storm, Samza and Spark Streaming which analyze data in near real time, something that was previously only possible with supercomputers.
The ability to store unlimited amounts of disparate data in order to perform endless analysis in batch and real time is the allure of big data. But is it for everyone? I’ll address that in my next post as we begin to dive into use cases.