Big Data Analytics with Apache Spark 3

Overview

We are living in an era of ‘big data’. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language and using Jupyter environment. This course covers the latest features in Spark version 3.

What You Will Learn

Spark ecosystem
New features in Spark3
Spark Shell
Spark Data structures (RDD / Dataframe / Dataset)
Spark SQL
Modern data formats and Spark
Spark API
Spark & Hadoop & Hive
Spark ML overview
Graph processing in Spark
Spark Streaming
Bonus: Spark performance tuning
Bonus: Delta Lake

Audience

Developers, Data analysts, and business analysts

Duration

3 days

Pre-requisites

Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.

Lab Environment

A cloud-based lab environment will be provided to students, no need to install anything on the laptop

Students will need the following

A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
Chrome browser

Detailed Outline

Spark Introduction

Big Data stacks, Hadoop, Spark
Spark-3 new features
Spark concepts and architecture
Spark components overview
Labs: Installing and running Spark

First Look at Spark

Spark shell
Spark web UIs
Analyzing dataset – part 1
Labs: Spark shell exploration

Spark Data structures

Partitions
Distributed execution
Operations: transformations and actions
Labs: Unstructured data analytics

Caching

Caching overview
Various caching mechanisms available in Spark
In memory file systems
Caching use cases and best practices
Labs: Benchmark of caching performance

Dataframes / Datasets

Dataframes Intro
Loading structured data (JSON, CSV) using Dataframes
Specifying schema for Dataframes
Labs: Dataframes, Datasets, Schema

Spark SQL

Spark SQL concepts and overview
Defining tables and importing datasets
Querying data using SQL
Handling various storage formats : JSON / Parquet / ORC
Adaptive Query Engine (AQE) (Spark 3 feature)
Labs: querying structured data using SQL; evaluating data formats

Spark and Hadoop

Hadoop Primer
Hadoop + Spark architecture
Running Spark on Hadoop
Processing HDFS files using Spark
Spark & Hive
Labs: Spark and Hive

Spark API

Overview of Spark APIs in Scala / Python
The life cycle of an Spark application
Spark APIs
Deploying Spark applications on YARN
Labs: Developing and deploying an Spark application

Spark ML Overview

Spark ML overview
Algorithms overview: Clustering, Classifications, Recommendations
Labs: Writing ML applications in Spark

Graph Processing

Graph processing libraries: GraphX and GraphFrames
Creating and analyzing graphs
Labs: Processing graph data using Spark

Spark Streaming

Spark streaming library overview
Streaming operations
Structured Streaming
Continuous streaming
Spark & Kafka streaming
Labs: Writing spark streaming applications

Bonus: Spark Performance Tuning

Best practices for Spark programming
Common pitfalls to watch out for
Latest optimizers in Spark3
Lab: Tuning Spark queries

Bonus: Delta Lake (Spark3)

Introduction to Delta Lake
Delta lake architecture
Lab: Exploring Delta Lake

Workshops (Time permitting)

Attendees will work on solving real-world data analysis problems using Spark