Cartoon : IoT and Big Data

2015-10__streamingpoker

Inernet of Things‘ or ‘Connected Things‘ (or what ever the buzz word of the day is 🙂  might be the ‘killer app’ for Big Data.
The amount of data that would need to be ingested and analyzed makes this a prime candidate for Big Data technologies.

When we think of Big Data, we usually think Hadoop.  How ever Hadoop may not be the best fit for IoT.  Let’s remember Hadoop was developed  to analyze large amounts of data in batch mode.  And it does that very well.

IoT workloads are very different.  Here are some differences:

  • Most IoT data comes in as ‘streams’ (continuously incoming data).  Hadoop is designed to handle ‘data on disk’
  • IoT events need to be analyzed in ‘real time’.  When we say real time, it can be milli seconds  or in seconds.  Either way this kind of low latency is not achievable by Hadoop and Map Reduce
  • IoT events may be queried in real time (e.g.   10 latest events from a sensor ).

We are seeing new technologies emerging to handle IoT specific workloads.  Here are few:

Storm
Storm probably was the first ‘stream processing platform’.  It came out of twitter and now an Apache project.
Storm site

Apache Samza
Framework built on Kafka and YARN
Samza site

Spark Streaming
Streaming framework implemented by the popular Spark framework
Spark site

Apache Flink
Flink is a recent framework focused on streaming workloads.
Flink site

Apache NiFi
NiFi is developed by NSA (National Security Agency) and open sourced.
Onyara – a company started by NiFi committers , is acquired by HortonWorks.
Hortonworks is releasing Data Flow product based on NiFi.
NiFi website

Kudu
Kudu is developed by Cloudera.  Kudu is a storage system that is suited for real time updates of data.
Kudu site