Big Data Essentials Bootcamp

Build systems processing massive amounts of data (archival, batch, interactive, real‑time) and lay foundations for analytics to extract insights from data.

Get Course Info

Audience: Developers

Duration: 5 days

Format: Lectures (50%) and hands‑on labs (50%)

Overview

Big Data needs proper tools and skills, and this workshop brings you "from zero to hero," that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive, and finally real‑time manner. The workshop also lays foundations for proper analytics, allowing us to extract insights from data.

Objective

Build systems processing massive amounts of data (archival, batch, interactive, real‑time) and lay foundations for analytics to extract insights from data.

What You Will Learn

Hadoop: HDFS, MapReduce, Pig, Hive
Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Course Details

Audience: Developers

Duration: 5 days

Format: Lectures (50%) and hands‑on labs (50%)

Prerequisites:

Comfortable with Java programming language (most programming exercises are in Java) • Comfortable in a Linux environment (be able to navigate Linux command line, edit files using vi/nano)

Setup: Zero‑install cloud lab • Working Hadoop/Spark/NoSQL clusters provided • SSH client (Linux & Mac shipped, Putty for Windows) • Modern browser for cluster access

Detailed Outline

Introduction to Hadoop
Hadoop history, concepts, ecosystem, distributions
High‑level architecture & myths
Hadoop challenges (hardware/software)

Horizontal scaling, replication, data locality, rack awareness
Architecture (NameNode, Secondary NameNode, DataNode)
Data integrity
Future of HDFS: NameNode HA, Federation
Lab exercises

MapReduce concepts
Phases: driver, mapper, shuffle/sort, reducer
Thinking in MapReduce
Future of MapReduce (YARN)
Lab exercises

Pig vs. Java vs. MapReduce
Pig Latin language & UDFs
Understanding Pig job flow
Data analysis with Pig (basic & complex, multi‑dataset, advanced)
Lab exercises

Hive concepts & architecture
Data types & Hive data management
Hive vs. SQL
Lab exercises

Spark basics, background & history
Spark and Hadoop
Spark concepts & architecture
Spark ecosystem (core, Spark SQL, MLlib, Streaming)
First look: local mode, web UI, shell
Analyzing dataset – part 1
Inspecting RDDs, RDD partitions & operations
MapReduce on RDD, caching & persistence
Sharing cached RDDs
Spark API programming (RDD API), submitting programs, debugging/logging, configuration properties

Streaming overview & operations
Sliding‑window operations
Writing Spark Streaming applications

Introduction to Big Data / NoSQL
NoSQL overview, CAP theorem
When is NoSQL appropriate
NoSQL ecosystem

Nodes, clusters, datacenters
Keyspaces, tables, rows & columns
Partitioning, replication, tokens
Quorum & consistency levels
Labs

Java driver introduction
CRUD operations via Java client
Asynchronous queries
Labs

Introduction to CQL & datatypes
Creating keyspaces & tables
Choosing columns, types & primary keys
Row/column layout, TTL, CRUD, querying & updates
Labs

Secondary indexes
Denormalisation & join avoidance
Composite keys (partition & clustering keys)
Time‑series data & best practices
Counters & lightweight transactions (LWT)

Group design sessions with multiple use‑cases
Implement 'Netflix' data models, generate data, analyse decisions