Big Data Essentials Bootcamp

Overview

Big Data needs proper tools and skills, and this workshop brings you “from zero to hero,” that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive, and finally real-time manner. The workshop also lays foundations for proper analytics, allowing us to extract insights from data.

What You Will Learn:

Hadoop: HDFS, MapReduce, Pig, Hive
Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Audience

Developers

Duration

5 days

Format

lectures (50%) and hands-on labs (50%).

Prerequisites

comfortable with Java programming language (most programming exercises are in java)
comfortable in a Linux environment (be able to navigate Linux command line, edit files using vi/nano)

Lab environment

Zero Install: There is no need to install Hadoop, Spark, etc. software on students’ machines! Working clusters and environments will be provided for students.

Students will need the following

an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster

Detailed outline

Hadoop

- Introduction to Hadoop
- Hadoop history, concepts
- ecosystem
- distributions
- High-level architecture
- Hadoop myths
- Hadoop challenges
- hardware/software

HDFS Overview

- concepts (horizontal scaling, replication, data locality, rack awareness)
- architecture (Namenode, Secondary NameNode, DataNode)
- data integrity
- future of HDFS: Namenode HA, Federation
- lab exercises

MapReduce Overview

- MapReduce concepts
- phases: driver, mapper, shuffle/sort, reducer
- thinking in MapReduce
- future of MapReduce (yarn)
- lab exercises

Pig

- pig vs java vs MapReduce
- pig Latin language
- user-defined functions
- understanding pig job flow
- basic data analysis with Pig
- complex data analysis with Pig
- multi datasets with Pig
- advanced concepts
- lab exercises

Hive

- hive concepts
- architecture
- data types
- Hive data management
- hive vs SQL
- lab exercises

Spark

- Spark Basics
- Background and history
- Spark and Hadoop
- Spark concepts and architecture
- Spark ecosystem (core, spark SQL, mlib, streaming)
- First look at Spark
- Spark in local mode
- Spark web UI
- Spark shell
- Analyzing dataset – part 1
- Inspecting RDDs
- RDDs In Depth
- Partitions
- RDD Operations / transformations
- RDD types
- MapReduce on RDD
- Caching and persistence
- Sharing cached RDDs
- Spark API programming
- Introduction to Spark API / RDD API
- Submitting the first program to Spark
- Debugging/logging
- Configuration properties

Spark Streaming

- Streaming overview
- Streaming operations
- Sliding window operations
- Writing spark streaming applications

NoSQL

- Introduction to Big Data / NoSQL
- NoSQL overview
- CAP theorem
- When is NoSQL appropriate
- NoSQL ecosystem

Cassandra Basics

- Cassandra nodes, clusters, datacenters
- Keyspaces, tables, rows, and columns
- Partitioning, replication, tokens
- Quorum and consistency levels
- Labs

Cassandra drivers

- Introduction to Java driver
- CRUD (Create / Read / Update, Delete) operations using Java client
- Asynchronous queries
- Labs

Data Modeling – part 1

- introduction to CQL
- CQL Datatypes
- creating keyspaces & tables
- Choosing columns and types
- Choosing primary keys
- Data layout for rows and columns
- Time to live (TTL), create, insert, update
- Querying with CQL
- CQL updates
- Labs

Data Modeling – part 2

- Creating and using secondary indexes
- Denormalization and join avoidance
- composite keys (partition keys and clustering keys)
- Time series data
- Best practices for time series data
- Counters
- Lightweight transactions (LWT)

Data Modeling Labs: Group design sessions

- multiple use cases from various domains are presented
- students work in groups to come up with designs and models
- discuss various designs, analyze decisions
- Lab: implement ‘Netflix’ data models, generate data