Hadoop+Spark

© Elephant Scale

June 09, 2021

Hadoop is a mature Big Data environment, with Hive is the de-facto standard for the SQL interface. Today, the computations in Hadoop are usually done with Spark. Spark offers an optimized compute engine that includes batch, and real-time streaming, and machine learning.

This course covers Hadoop 3, Hive 3, and Spark 3.

Duration:

  • 5 days

Audience:

  • Business analysts, Software developers, Managers

Prerequisites:

  • Basics of SQL
  • Exposure to software design
  • Basics of Python

Lab environment:

  • Working environment provided in the browser
  • Zero Install: There is no need to install software on students’ machines.

Course Outline

Why Hadoop?

* The motivation for Hadoop
* Use cases and case studies about Hadoop

The Hadoop platform

* MapReduce, HDFS, YARN
* New in Hadoop 3
    * Erasure Coding vs 3x replication

Hive Basics

* Defining Hive Tables
* SQL Queries over Structured Data
* Filtering / Search
* Aggregations / Ordering
* Partitions
* Joins
* Text Analytics (Semi-Structured Data)

New in Hive 3

* ACID tables
* Hive Query Language (HQL)
    * How to run a good query?
    * How to troubleshoot queries? 

HBase

* Basics
* HBase tables – design and use
* Phoenix driver for HBase tables

Sqoop

* Tool
* Architecture
* Use

The big picture

* How Hadoop fits into your architecture
* Hive vs HBase with Phoenix vs Excel

Spark Introduction

  • Big Data, Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs: Installing and running Spark

First Look at Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Spark Data structures

  • Partitions
  • Distributed execution
  • Operations: transformations and actions
  • Labs: Unstructured data analytics using RDDs

Caching

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

DataFrames and Datasets

  • DataFrames Intro
  • Loading structured data (JSON, CSV) using DataFrames
  • Using schema
  • Specifying schema for DataFrames
  • Labs: DataFrames, Datasets, Schema

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs: querying structured data using SQL; evaluating data formats

Spark and Hadoop

  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Spark API

  • Overview of Spark APIs in Scala / Python
  • The life cycle of an Spark application
  • Spark APIs
  • Deploying Spark applications on YARN
  • Labs: Developing and deploying an Spark application

Spark ML Overview

  • Machine Learning primer
  • Machine Learning in Spark: MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

GraphX

  • GraphX library overview
  • GraphX APIs
  • Create a Graph and navigating it
  • Shortest distance
  • Pregel API
  • Labs: Processing graph data using Spark

Spark Streaming

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark & Kafka streaming
  • Labs: Writing spark streaming applications

Workshops (Time permitting)

  • These are group workshops
  • Attendees will work on solving real-world data analysis problems using Spark