Big Data Analytics with Apache Spark 3
Overview
We are living in an era of ‘big data’. Spark is a popular platform for analyzing big data. This course introduces Apache Spark to students. This class is taught with Python language and using Jupyter environment. This course covers the latest features in Spark version 3.
What You Will Learn
- Spark ecosystem
- New features in Spark3
- Spark Shell
- Spark Data structures (RDD / Dataframe / Dataset)
- Spark SQL
- Modern data formats and Spark
- Spark API
- Spark & Hadoop & Hive
- Spark ML overview
- Graph processing in Spark
- Spark Streaming
- Bonus: Spark performance tuning
- Bonus: Delta Lake
Audience
Developers, Data analysts, and business analysts
Duration
3 days
Pre-requisites
Basic knowledge of Python language and Jupyter notebooks is preferred but not mandatory Even if you haven’t done any Python programming, Python is such an easy language to learn quickly. We will provide Python resources.
Lab Environment
A cloud-based lab environment will be provided to students, no need to install anything on the laptop
Students will need the following
- A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
- Chrome browser
Detailed Outline
Spark Introduction
- Big Data stacks, Hadoop, Spark
- Spark-3 new features
- Spark concepts and architecture
- Spark components overview
- Labs: Installing and running Spark
First Look at Spark
- Spark shell
- Spark web UIs
- Analyzing dataset – part 1
- Labs: Spark shell exploration
Spark Data structures
- Partitions
- Distributed execution
- Operations: transformations and actions
- Labs: Unstructured data analytics
Caching
- Caching overview
- Various caching mechanisms available in Spark
- In memory file systems
- Caching use cases and best practices
- Labs: Benchmark of caching performance
Dataframes / Datasets
- Dataframes Intro
- Loading structured data (JSON, CSV) using Dataframes
- Specifying schema for Dataframes
- Labs: Dataframes, Datasets, Schema
Spark SQL
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats : JSON / Parquet / ORC
- Adaptive Query Engine (AQE) (Spark 3 feature)
- Labs: querying structured data using SQL; evaluating data formats
Spark and Hadoop
- Hadoop Primer
- Hadoop + Spark architecture
- Running Spark on Hadoop
- Processing HDFS files using Spark
- Spark & Hive
- Labs: Spark and Hive
Spark API
- Overview of Spark APIs in Scala / Python
- The life cycle of an Spark application
- Spark APIs
- Deploying Spark applications on YARN
- Labs: Developing and deploying an Spark application
Spark ML Overview
- Spark ML overview
- Algorithms overview: Clustering, Classifications, Recommendations
- Labs: Writing ML applications in Spark
Graph Processing
- Graph processing libraries: GraphX and GraphFrames
- Creating and analyzing graphs
- Labs: Processing graph data using Spark
Spark Streaming
- Spark streaming library overview
- Streaming operations
- Structured Streaming
- Continuous streaming
- Spark & Kafka streaming
- Labs: Writing spark streaming applications
Bonus: Spark Performance Tuning
- Best practices for Spark programming
- Common pitfalls to watch out for
- Latest optimizers in Spark3
- Lab: Tuning Spark queries
Bonus: Delta Lake (Spark3)
- Introduction to Delta Lake
- Delta lake architecture
- Lab: Exploring Delta Lake
Workshops (Time permitting)
- Attendees will work on solving real-world data analysis problems using Spark