Apache Spark 3 Essentials

Overview

We are living in an era of ‘big data’. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language and using Jupyter environment. This course covers the latest features in Spark version 3.

What You Will Learn

Spark ecosystem
New features in Spark3
Spark Shell
Spark Data structures (RDD / Dataframe / Dataset)
Spark SQL
Modern data formats and Spark
Spark API

Audience

Developers, Data analysts, and business analysts

Duration

2 days

Skill level

Introductory to Intermediate

Pre-requisites

Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.

Lab Environment

A cloud-based lab environment will be provided to students, no need to install anything on the laptop

Students will need the following

A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
Chrome browser

Detailed Outline

Spark Introduction

Big Data stacks, Hadoop, Spark
Spark-3 new features
Spark concepts and architecture
Spark components overview
Labs : Installing and running Spark

First Look at Spark

Spark shell
Spark web UIs
Analyzing dataset – part 1
Labs: Spark shell exploration

Spark Data structures

Partitions
Distributed execution
Operations : transformations and actions
Labs : Unstructured data analytics

Caching

Caching overview
Various caching mechanisms available in Spark
In memory file systems
Caching use cases and best practices
Labs: Benchmark of caching performance

Dataframes / Datasets

Dataframes Intro
Loading structured data (json, CSV) using Dataframes
Specifying schema for Dataframes
Labs : Dataframes, Datasets, Schema

Spark SQL

Spark SQL concepts and overview
Defining tables and importing datasets
Querying data using SQL
Handling various storage formats : JSON / Parquet / ORC
Adaptive Query Engine (AQE) (Spark 3 feature)
Labs : querying structured data using SQL; evaluating data formats

Spark and Hadoop

Hadoop Primer
Hadoop + Spark architecture
Running Spark on Hadoop
Processing HDFS files using Spark
Spark & Hive
Labs: Spark and Hive

Spark API

Overview of Spark APIs in Scala / Python
Life cycle of an Spark application
Spark APIs
Deploying Spark applications on YARN
Labs : Developing and deploying an Spark application