Big Data Essentials Bootcamp
Overview
Big Data needs proper tools and skills, and this workshop brings you “from zero to hero,” that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive, and finally real-time manner. The workshop also lays foundations for proper analytics, allowing us to extract insights from data.
What You Will Learn:
- Hadoop: HDFS, MapReduce, Pig, Hive
- Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
- NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling
Audience
Developers
Duration
5 days
Format
lectures (50%) and hands-on labs (50%).
Prerequisites
- comfortable with Java programming language (most programming exercises are in java)
- comfortable in a Linux environment (be able to navigate Linux command line, edit files using vi/nano)
Lab environment
Zero Install: There is no need to install Hadoop, Spark, etc. software on students’ machines! Working clusters and environments will be provided for students.
Students will need the following
- an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
- a browser to access the cluster
Detailed outline
Hadoop
-
- Introduction to Hadoop
- Hadoop history, concepts
- ecosystem
- distributions
- High-level architecture
- Hadoop myths
- Hadoop challenges
- hardware/software
HDFS Overview
-
- concepts (horizontal scaling, replication, data locality, rack awareness)
- architecture (Namenode, Secondary NameNode, DataNode)
- data integrity
- future of HDFS: Namenode HA, Federation
- lab exercises
MapReduce Overview
-
- MapReduce concepts
- phases: driver, mapper, shuffle/sort, reducer
- thinking in MapReduce
- future of MapReduce (yarn)
- lab exercises
Pig
-
- pig vs java vs MapReduce
- pig Latin language
- user-defined functions
- understanding pig job flow
- basic data analysis with Pig
- complex data analysis with Pig
- multi datasets with Pig
- advanced concepts
- lab exercises
Hive
-
- hive concepts
- architecture
- data types
- Hive data management
- hive vs SQL
- lab exercises
Spark
-
- Spark Basics
- Background and history
- Spark and Hadoop
- Spark concepts and architecture
- Spark ecosystem (core, spark SQL, mlib, streaming)
- First look at Spark
- Spark in local mode
- Spark web UI
- Spark shell
- Analyzing dataset – part 1
- Inspecting RDDs
- RDDs In Depth
- Partitions
- RDD Operations / transformations
- RDD types
- MapReduce on RDD
- Caching and persistence
- Sharing cached RDDs
- Spark API programming
- Introduction to Spark API / RDD API
- Submitting the first program to Spark
- Debugging/logging
- Configuration properties
Spark Streaming
-
- Streaming overview
- Streaming operations
- Sliding window operations
- Writing spark streaming applications
NoSQL
-
- Introduction to Big Data / NoSQL
- NoSQL overview
- CAP theorem
- When is NoSQL appropriate
- NoSQL ecosystem
Cassandra Basics
-
- Cassandra nodes, clusters, datacenters
- Keyspaces, tables, rows, and columns
- Partitioning, replication, tokens
- Quorum and consistency levels
- Labs
Cassandra drivers
-
- Introduction to Java driver
- CRUD (Create / Read / Update, Delete) operations using Java client
- Asynchronous queries
- Labs
Data Modeling – part 1
-
- introduction to CQL
- CQL Datatypes
- creating keyspaces & tables
- Choosing columns and types
- Choosing primary keys
- Data layout for rows and columns
- Time to live (TTL), create, insert, update
- Querying with CQL
- CQL updates
- Labs
Data Modeling – part 2
-
- Creating and using secondary indexes
- Denormalization and join avoidance
- composite keys (partition keys and clustering keys)
- Time series data
- Best practices for time series data
- Counters
- Lightweight transactions (LWT)
Data Modeling Labs: Group design sessions
-
- multiple use cases from various domains are presented
- students work in groups to come up with designs and models
- discuss various designs, analyze decisions
- Lab: implement ‘Netflix’ data models, generate data