Big Data Essentials Bootcamp

Looking for team training?

We offer excellent trainer-led courses.

contact-us

Big Data Essentials Bootcamp

Overview

Big Data needs proper tools and skills, and this workshop brings you “from zero to hero,” that is, provides the student with the necessary knowledge of Hadoop, Spark, and NoSQL. With these three fundamentals, you will be able to build systems processing massive amounts of data, in archival, batch, interactive and finally real-time manner. The workshop also lays foundations for proper analytics, allowing to extract insights from data.

What You Will Learn:

  • Hadoop: HDFS, MapReduce, Pig, Hive
  • Spark: Spark core, SparkSQL, Spark Java API, Spark Streaming
  • NoSQL: Cassandra/HBase architecture, Java API, drivers, data modeling

Audience: developers

Duration: 5 days

Format: lectures (50%) and hands-on labs (50%).

Prerequisites

  • comfortable with Java programming language (most programming exercises are in java)
  • comfortable in Linux environment (be able to navigate Linux command line, edit files using vi / nano)

Lab environment

Zero Install: There is no need to install Hadoop, Spark, etc. software on students’ machines! Working clusters and environments will be provided for students.

Students will need the following

  • an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • a browser to access the cluster

Detailed outline

Hadoop

  • Introduction to Hadoop
    Hadoop history, concepts
    ecosystem
    distributions
    High-level architecture
    Hadoop myths
    Hadoop challenges
    hardware / softwareHDFS Overview
    concepts (horizontal scaling, replication, data locality, rack awareness)
    architecture (Namenode, Secondary NameNode, DataNode)
    data integrity
    future of HDFS : Namenode HA, Federation
    lab exercisesMapReduce Overview
    MapReducee concepts
    phases : driver, mapper, shuffle/sort, reducer
    thinking in MapReduce
    future of mapreduce (yarn)
    lab exercisesPig
    pig vs java vs MapReduce
    pig latin language
    user defined functions
    understanding pig job flow
    basic data analysis with Pig
    complex data analysis with Pig
    multi datasets with Pig
    advanced concepts
    lab exercisesHive
    hive concepts
    architecture
    data types
    Hive data management
    hive vs sql
    lab exercisesSparkSpark BasicsBackground and history
    Spark and hadoop
    Spark concepts and architecture
    Spark eco system (core, spark sql, mlib, streaming)
    First look at Spark
    Spark in local mode
    Spark web UI
    Spark shell
    Analyzing dataset – part 1
    Inspecting RDDsRDDs In DepthPartitions
    RDD Operations / transformations
    RDD types
    MapReduce on RDD
    Caching and persistence
    Sharing cached RDDsSpark API programming

Introduction to Spark API / RDD API
Submitting the first program to Spark
Debugging / logging
Configuration properties

Spark Streaming

Streaming overview
Streaming operations
Sliding window operations
Writing spark streaming applications

NoSQL

Introduction to Big Data / NoSQL
NoSQL overview
CAP theorem
When is NoSQL appropriate
NoSQL ecosystem
Cassandra Basics
Cassandra nodes, clusters, datacenters
Keyspaces, tables, rows and columns
Partitioning, replication, tokens
Quorum and consistency levels
Labs

Cassandra drivers
Introduction to Java driver
CRUD (Create / Read / Update, Delete) operations using Java client
Asynchronous queries
Labs

Data Modeling – part 1
introduction to CQL
CQL Datatypes
creating keyspaces & tables
Choosing columns and types
Choosing primary keys
Data layout for rows and columns
Time to live (TTL), create, insert, update
Querying with CQL
CQL updates
Labs

Data Modeling – part 2
Creating and using secondary indexes
Denormalization and join avoidance
composite keys (partition keys and clustering keys)
Time series data
Best practices for time series data
Counters
Lightweight transactions (LWT)

Data Modeling Labs : Group design sessions
multiple use cases from various domains are presented
students work in groups to come up designs and models
discuss various designs, analyze decisions
Lab : implement ‘Netflix’ data models, generate data