Spark and PySpark on Azure

Overview

Apache Spark is a powerful, open-source processing engine for big data processing. It is optimized for speed, ease of use, and sophisticated analytics.

This hands-on course in Apache Spark is geared toward technical business professionals who wish to solve real-world data problems using Apache Spark.

This class is taught in Python language and using Jupyter environment. This course covers the latest features of Spark version 3.

What You Will Learn

Spark ecosystem
PySpark and its place in the Spark ecosystem
New features in Spark3
How to use Databricks Spark on Azure
Spark Shell
Spark Data structures (RDD / Dataframe / Dataset)
Spark SQL
Modern data formats and Spark
Spark API
Spark & Hadoop & Hive
Spark ML overview
Graph processing in Spark
Spark Streaming
Bonus: Spark performance tuning
Bonus: Delta Lake

Duration

Three Days

Audience

Developers, Data analysts, and business analysts

Skill level

Introductory to Intermediate

Prerequisites

Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.

Lab Environment

We will be using students’ Azure accounts
Alternatively, we will provide training accounts on Azure

Students will need the following

A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
Chrome browser

Detailed Course Outline

Spark Introduction

Big Data stacks, Hadoop, Spark
Spark 3 new features
Spark concepts and architecture
Spark components overview
Labs: Starting a Databricks cluster on Azure

First Look at Spark

Spark shell
Spark web UIs
Analyzing dataset – part 1
Labs: Spark shell exploration

Spark Data structures

Partitions
Distributed execution
Operations: transformations and actions
Labs: Unstructured data analytics

Caching

Caching Overview
Various caching mechanisms available in Spark
In memory file systems
Caching use cases and best practices
Labs: Benchmark of caching performance

Dataframes / Datasets

Dataframes Intro
Loading structured data (JSON, CSV) using Dataframes
Specifying schema for Dataframes
Labs: Dataframes, Datasets, Schema

Spark SQL

Spark SQL concepts and overview
Defining tables and importing datasets
Querying data using SQL
Handling various storage formats: JSON / Parquet / ORC
Adaptive Query Engine (AQE) (Spark 3 feature)
Labs: querying structured data using SQL; evaluating data formats

Native integrations with Azure services

Azure Data Factory
Azure Data Lake Storage
Azure Machine Learning
Power BI

Spark API

Overview of Spark APIs in Scala / Python
The life cycle of a Spark application
Spark APIs
Deploying Spark applications on YARN
Labs: Developing and deploying a Spark application

Spark ML Overview

Spark ML overview
Algorithms overview: Clustering, Classifications, Recommendations
Labs: Writing ML applications in Spark

Graph Processing

Graph processing libraries: GraphX and GraphFrames
Creating and analyzing graphs
Labs: Processing graph data using Spark

Spark Streaming

Spark streaming library overview
Streaming operations
Structured Streaming
Continuous streaming
Spark & Kafka streaming
Labs: Writing spark streaming applications

Bonus: Spark Performance Tuning

Best practices for Spark programming
Common pitfalls to watch out for
Latest optimizers in Spark3
Lab: Tuning Spark queries

Bonus: Delta Lake

Introduction to Delta Lake
Delta lake architecture
Lab: Exploring Delta Lake

Workshops (Time permitting)

Attendees will work on solving real-world data analysis problems using Spark