Spark for Developers


This course will introduce Apache Spark. The students will learn how  Spark fits into the Big Data ecosystem, and how to use Spark for data analysis.

This class is taught with either Python language or Scala language.

A language primer can be offered if needed.

What You Will Learn

  • Spark ecosystem
  • Spark Shell
  • Spark Data structures (RDD, DataFrame, Dataset)
  • Spark SQL
  • Modern data formats and Spark
  • Spark API
  • Spark & Hadoop & Hive
  • Spark ML overview
  • GraphX
  • Spark Streaming


Developers, Architects


3 days


  • Developer background

Lab Environment

We provide a complete lab environment in the cloud.  No need to install Spark on your laptop.

See below for what to bring.

What to Bring:

A reasonably modern laptop.  Need to be able to connect to cloud services. Laptops with overly restrictive firewalls are not recommended

Detailed Outline:

  1. Spark Introduction
    • Big data, Hadoop, Spark
    • Spark concepts and architecture
    • Spark components overview
    • Labs: installing and running Spark
  2. The first look at Spark
    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration
  3. Spark Data structures
    • Partitions
    • Distributed execution
    • Operations: transformations and actions
    • Labs: Unstructured data analytics using RDDs
  4. Caching
    • Caching overview
    • Various caching mechanisms available in Spark
    • In memory file systems
    • Caching use cases and best practices
    • Labs: Benchmark of caching performance
  5. DataFrames and Datasets
    • DataFrames Intro
    • Loading structured data (JSON, CSV) using DataFrames
    • Using schema
    • Specifying schema for DataFrames
    • Labs: DataFrames, Datasets, Schema
  6. Spark SQL
    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats: JSON, Parquet, ORC
    • Labs: querying structured data using SQL; evaluating data formats
  7. Spark and Hadoop
    • Hadoop Primer: HDFS, YARN
    • Hadoop + Spark architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
    • Spark & Hive
  8. Spark API
    • Overview of Spark APIs in Scala / Python
    • The lifecycle of a Spark application
    • Spark APIs
    • Deploying Spark applications on YARN
    • Labs: Developing and deploying a Spark application
  9. Spark ML Overview
    • Machine Learning primer
    • Machine Learning in Spark: MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms overview: Clustering, Classifications, Recommendations
    • Labs: Writing ML applications in Spark
  10. GraphX
    • GraphX library overview
    • GraphX APIs
    • Create a Graph and navigating it
    • Shortest distance
    • Pregel API
    • Labs: Processing graph data using Spark
  11. Spark Streaming
    • Streaming concepts
    • Evaluating Streaming platforms
    • Spark streaming library overview
    • Streaming operations
    • Sliding window operations
    • Structured Streaming
    • Continuous streaming
    • Spark & Kafka streaming
    • Labs: Writing spark streaming applications
  12. Workshops
    • These are team workshops
    • Attendees will work on solving real-world data analysis problems using Spark