Spark and PySpark on Azure

Overview

Apache Spark is a powerful, open-source processing engine for big data processing. It is optimized for speed, ease of use, and sophisticated analytics.

This hands-on course in Apache Spark is geared toward technical business professionals who wish to solve real-world data problems using Apache Spark.

This class is taught in Python language and using Jupyter environment. This course covers the latest features of Spark version 3.

What You Will Learn

  • Spark ecosystem
  • PySpark and its place in the Spark ecosystem
  • New features in Spark3
  • How to use Databricks Spark on Azure
  • Spark Shell
  • Spark Data structures (RDD / Dataframe / Dataset)
  • Spark SQL
  • Modern data formats and Spark
  • Spark API
  • Spark & Hadoop & Hive
  • Spark ML overview
  • Graph processing in Spark
  • Spark Streaming
  • Bonus: Spark performance tuning
  • Bonus: Delta Lake

Duration

Three Days

Audience

Developers, Data analysts, and business analysts

Skill level

Introductory to Intermediate

Prerequisites

  • Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.

Lab Environment

  • We will be using students’ Azure accounts
  • Alternatively, we will provide training accounts on Azure

Students will need the following

  • A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
  • Chrome browser

Detailed Course Outline

Spark Introduction

  • Big Data stacks, Hadoop, Spark
  • Spark 3 new features
  • Spark concepts and architecture
  • Spark components overview
  • Labs: Starting a Databricks cluster on Azure

First Look at Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Spark Data structures

  • Partitions
  • Distributed execution
  • Operations: transformations and actions
  • Labs: Unstructured data analytics

Caching

  • Caching Overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

Dataframes / Datasets

  • Dataframes Intro
  • Loading structured data (JSON, CSV) using Dataframes
  • Specifying schema for Dataframes
  • Labs: Dataframes, Datasets, Schema

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats: JSON / Parquet / ORC
  • Adaptive Query Engine (AQE) (Spark 3 feature)
  • Labs: querying structured data using SQL; evaluating data formats

Native integrations with Azure services

  • Azure Data Factory
  • Azure Data Lake Storage
  • Azure Machine Learning
  • Power BI

Spark API

  • Overview of Spark APIs in Scala / Python
  • The life cycle of a Spark application
  • Spark APIs
  • Deploying Spark applications on YARN
  • Labs: Developing and deploying a Spark application

Spark ML Overview

  • Spark ML overview
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

Graph Processing

  • Graph processing libraries: GraphX and GraphFrames
  • Creating and analyzing graphs
  • Labs: Processing graph data using Spark

Spark Streaming

  • Spark streaming library overview
  • Streaming operations
  • Structured Streaming
  • Continuous streaming
  • Spark & Kafka streaming
  • Labs: Writing spark streaming applications

Bonus: Spark Performance Tuning

  • Best practices for Spark programming
  • Common pitfalls to watch out for
  • Latest optimizers in Spark3
  • Lab: Tuning Spark queries

Bonus: Delta Lake

  • Introduction to Delta Lake
  • Delta lake architecture
  • Lab: Exploring Delta Lake

Workshops (Time permitting)

  • Attendees will work on solving real-world data analysis problems using Spark