Hadoop+Spark

June 09, 2021

Hadoop is a mature Big Data environment, with Hive is the de-facto standard for the SQL interface. Today, the computations in Hadoop are usually done with Spark. Spark offers an optimized compute engine that includes batch, and real-time streaming, and machine learning.

This course covers Hadoop 3, Hive 3, and Spark 3.

Duration:

5 days

Audience:

Business analysts, Software developers, Managers

Prerequisites:

Basics of SQL
Exposure to software design
Basics of Python

Lab environment:

Working environment provided in the browser
Zero Install: There is no need to install software on students’ machines.

Course Outline

Why Hadoop?

* The motivation for Hadoop
* Use cases and case studies about Hadoop

The Hadoop platform

* MapReduce, HDFS, YARN
* New in Hadoop 3
* Erasure Coding vs 3x replication

Hive Basics

* Defining Hive Tables
* SQL Queries over Structured Data
* Filtering / Search
* Aggregations / Ordering
* Partitions
* Joins
* Text Analytics (Semi-Structured Data)

New in Hive 3

* ACID tables
* Hive Query Language (HQL)
* How to run a good query?
* How to troubleshoot queries?

HBase

* Basics
* HBase tables – design and use
* Phoenix driver for HBase tables

Sqoop

* Tool
* Architecture
* Use

The big picture

* How Hadoop fits into your architecture
* Hive vs HBase with Phoenix vs Excel

Spark Introduction

Big Data, Hadoop, Spark
Spark concepts and architecture
Spark components overview
Labs: Installing and running Spark

First Look at Spark

Spark shell
Spark web UIs
Analyzing dataset – part 1
Labs: Spark shell exploration

Spark Data structures

Partitions
Distributed execution
Operations: transformations and actions
Labs: Unstructured data analytics using RDDs

Caching

Caching overview
Various caching mechanisms available in Spark
In memory file systems
Caching use cases and best practices
Labs: Benchmark of caching performance

DataFrames and Datasets

DataFrames Intro
Loading structured data (JSON, CSV) using DataFrames
Using schema
Specifying schema for DataFrames
Labs: DataFrames, Datasets, Schema

Spark SQL

Spark SQL concepts and overview
Defining tables and importing datasets
Querying data using SQL
Handling various storage formats : JSON / Parquet / ORC
Labs: querying structured data using SQL; evaluating data formats

Spark and Hadoop

Hadoop + Spark architecture
Running Spark on Hadoop YARN
Processing HDFS files using Spark
Spark & Hive

Spark API

Overview of Spark APIs in Scala / Python
The life cycle of an Spark application
Spark APIs
Deploying Spark applications on YARN
Labs: Developing and deploying an Spark application

Spark ML Overview

Machine Learning primer
Machine Learning in Spark: MLib / ML
Spark ML overview (newer Spark2 version)
Algorithms overview: Clustering, Classifications, Recommendations
Labs: Writing ML applications in Spark

GraphX

GraphX library overview
GraphX APIs
Create a Graph and navigating it
Shortest distance
Pregel API
Labs: Processing graph data using Spark

Spark Streaming

Streaming concepts
Evaluating Streaming platforms
Spark streaming library overview
Streaming operations
Sliding window operations
Structured Streaming
Continuous streaming
Spark & Kafka streaming
Labs: Writing spark streaming applications

Workshops (Time permitting)

These are group workshops
Attendees will work on solving real-world data analysis problems using Spark