Skip to course content

Big Data Essentials Bootcamp

Build systems processing massive amounts of data (archival, batch, interactive, real‑time) and lay foundations for analytics to extract insights from data.

Get Course Info

Audience: Developers

Duration: 5 days

Format: Lectures (50%) and hands‑on labs (50%)

Overview

Big Data needs proper tools and skills, and this workshop brings you "from zero to hero," that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive, and finally real‑time manner. The workshop also lays foundations for proper analytics, allowing us to extract insights from data.

Objective

Build systems processing massive amounts of data (archival, batch, interactive, real‑time) and lay foundations for analytics to extract insights from data.

What You Will Learn

  • Hadoop: HDFS, MapReduce, Pig, Hive
  • Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
  • NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Course Details

Audience: Developers

Duration: 5 days

Format: Lectures (50%) and hands‑on labs (50%)

Prerequisites:

Comfortable with Java programming language (most programming exercises are in Java) • Comfortable in a Linux environment (be able to navigate Linux command line, edit files using vi/nano)

Setup: Zero‑install cloud lab • Working Hadoop/Spark/NoSQL clusters provided • SSH client (Linux & Mac shipped, Putty for Windows) • Modern browser for cluster access

Detailed Outline

  • Introduction to Hadoop
  • Hadoop history, concepts, ecosystem, distributions
  • High‑level architecture & myths
  • Hadoop challenges (hardware/software)
  • Horizontal scaling, replication, data locality, rack awareness
  • Architecture (NameNode, Secondary NameNode, DataNode)
  • Data integrity
  • Future of HDFS: NameNode HA, Federation
  • Lab exercises
  • MapReduce concepts
  • Phases: driver, mapper, shuffle/sort, reducer
  • Thinking in MapReduce
  • Future of MapReduce (YARN)
  • Lab exercises
  • Pig vs. Java vs. MapReduce
  • Pig Latin language & UDFs
  • Understanding Pig job flow
  • Data analysis with Pig (basic & complex, multi‑dataset, advanced)
  • Lab exercises
  • Hive concepts & architecture
  • Data types & Hive data management
  • Hive vs. SQL
  • Lab exercises
  • Spark basics, background & history
  • Spark and Hadoop
  • Spark concepts & architecture
  • Spark ecosystem (core, Spark SQL, MLlib, Streaming)
  • First look: local mode, web UI, shell
  • Analyzing dataset – part 1
  • Inspecting RDDs, RDD partitions & operations
  • MapReduce on RDD, caching & persistence
  • Sharing cached RDDs
  • Spark API programming (RDD API), submitting programs, debugging/logging, configuration properties
  • Streaming overview & operations
  • Sliding‑window operations
  • Writing Spark Streaming applications
  • Introduction to Big Data / NoSQL
  • NoSQL overview, CAP theorem
  • When is NoSQL appropriate
  • NoSQL ecosystem
  • Nodes, clusters, datacenters
  • Keyspaces, tables, rows & columns
  • Partitioning, replication, tokens
  • Quorum & consistency levels
  • Labs
  • Java driver introduction
  • CRUD operations via Java client
  • Asynchronous queries
  • Labs
  • Introduction to CQL & datatypes
  • Creating keyspaces & tables
  • Choosing columns, types & primary keys
  • Row/column layout, TTL, CRUD, querying & updates
  • Labs
  • Secondary indexes
  • Denormalisation & join avoidance
  • Composite keys (partition & clustering keys)
  • Time‑series data & best practices
  • Counters & lightweight transactions (LWT)
  • Group design sessions with multiple use‑cases
  • Implement 'Netflix' data models, generate data, analyse decisions

Ready to Get Started?

Contact us to learn more about this course and schedule your training.