Spark for Developers

Looking for team training?

We offer excellent trainer-led courses.


Spark for Developers


This course will introduce Apache Spark. The students will learn how to use Spark for data analysis and write Spark applications.

Completely updated for latest Spark version 2.x!
Spark version 2 has lots of changes compared to v1.  This course covers the latest Spark v2 features.


Learn Spark eco-system

What You Will Learn

  • Spark Shell
  • Spark internals
  • Spark Data structures: RDDs, DataFrames, Datasets
  • Spark APIs
  • Spark SQL
  • Spark and Hadoop
  • Spark MLLib
  • Spark Graphx
  • Spark streaming


Developers / Data Analysts


3 days


  • Familiarity with either Java / Scala / Python language (our labs in Scala and Python – we provide a quick Scala introduction)
  • Basic understanding of Linux development environment (command line navigation/running commands)

Lab Environment

We provide the complete lab environment in the cloud.  No need to install Spark on your laptop.
See below for what to bring.

What to Bring:

  • A reasonably modern laptop.  Need to be able to connect to cloud services. Laptops with overly restrictive firewalls are not recommended)
  • ssh client (For Windows use Putty / SecureCRT; Mac and Linux come with ssh clients)
  • Chrome browser with Markdown Preview Plus plugin

Detailed Outline:

  1. Scala primer
    • A quick introduction to Scala
    • Labs: Getting know Scala
  2. Spark Basics
    • Big Data, Hadoop, Spark
    • What’s new in Spark v2
    • Spark concepts and architecture
    • Spark ecosystem (core, SparkSQL, MLLib, streaming)
    • Labs: Installing and running Spark
  3. Spark Shell
    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration
  4. RDDs (Condensed coverage)
    • RDDs concepts
    • RDD Operations / transformations
    • Labs: Unstructured data analytics using RDDs
  5. Data model concepts
    • Partitions
    • Distributed processing
    • Failure handling
    • Caching and persistence
  6. Spark Dataframes & Datasets
    • Intro to Dataframe / Dataset
    • Programming in Dataframe / Dataset API
    • Loading structured data using DataFrames
    • Labs: DataFrames, Datasets, Caching
  7. Spark SQL
    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats : JSON / Parquet / ORC
    • Labs: querying structured data using SQL; evaluating data formats
  8. Spark API programming
    • Introduction to Spark  API
    • Submitting the first program to Spark
    • Debugging/logging
    • Configuration properties
    • Labs: Programming in Spark API, Submitting jobs
  9. Spark and Hadoop
    • Hadoop Primer: HDFS / YARN
    • Hadoop + Spark architecture
    • Running Spark on YARN
    • Processing HDFS files using Spark
    • Spark & Hive
  10. Machine Learning (ML and MLLib)
    • Machine Learning primer
    • Machine Learning in Spark: MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms: Clustering, Classifications, Recommendations
    • Labs: Writing ML applications in Spark
  11. GraphX
    • GraphX library overview
    • GraphX APIs
    • Labs: Processing graph data using Spark
  12. Spark Streaming
    • Streaming concepts
    • Evaluating Streaming platforms
    • Spark streaming library overview
    • Streaming operations
    • Sliding window operations
    • Structured Streaming
    • Continuous streaming
    • Spark & Kafka streaming
    • Labs: Writing spark streaming applications
  13. Spark in the real world
    • Highlight some Spark use cases in the real world