Spark V2 for Developers

Looking for team training?

We offer excellent trainer-led courses.

contact-us

Spark V2 For Developers

Overview:

This course will introduce Apache Spark. The students will learn how  to use Spark for data analysis and write Spark applications.

Completely updated for latest Spark version 2.x!
Spark version 2 has lots of changes compared to v1.  This course covers the latest Spark v2 features.

Objective

Learn Spark eco-system

What You Will Learn

  • Spark Shell
  • Spark internals
  • Spark Data structures : RDDs, Dataframes, Datasets
  • Spark APIs
  • Spark SQL
  • Spark and Hadoop
  • Spark MLLib
  • Spark Graphx
  • Spark streaming

Audience :

Developers / Data Analysts

Duration :

3 days

Pre-requisites

  • Familiarity with either Java / Scala / Python language (our labs in Scala and Python – we provide a quick Scala introduction)
  • Basic understanding of Linux development environment (command line navigation / running commands)

Lab Environment

We provide the complete lab environment in the cloud.  No need to install Spark on your laptop.
See below for what to bring.

What to Bring:

  • A reasonably modern laptop.  Need to be able to connect to cloud services. Laptops with overly restrictive firewalls are not recommended)
  • ssh client (For Windows use Putty / SecureCRT ; Mac and Linux come with ssh clients)
  • Chrome browser with Markdown Preview Plus plugin

Detailed Outline:

  1. Scala primer
    • A quick introduction to Scala
    • Labs : Getting know Scala
  2. Spark Basics
    • Big Data, Hadoop, Spark
    • What’s new in Spark v2
    • Spark concepts and architecture
    • Spark eco system (core, spark sql, mlib, streaming)
    • Labs : Installing and running Spark
  3. Spark Shell
    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration
  4. RDDs (Condensed coverage)
    • RDDs concepts
    • RDD Operations / transformations
    • Labs : Unstructured data analytics using RDDs
  5. Data model concepts
    • Partitions
    • Distributed processing
    • Failure handling
    • Caching and persistence
  6. Spark Dataframes & Datasets
    • Intro to Dataframe / Dataset
    • Programming in Dataframe / Dataset API
    • Loading structured data using Dataframes
    • Labs : Dataframes, Datasets, Caching
  7. Spark SQL
    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats : JSON / Parquet / ORC
    • Labs : querying structured data using SQL; evaluating data formats
  8. Spark API programming (Scala / Python)
    • Introduction to Spark  API
    • Submitting the first program to Spark
    • Debugging / logging
    • Configuration properties
    • Labs : Programming in Spark API, Submitting jobs
  9. Spark and Hadoop
    • Hadoop Primer : HDFS / YARN
    • Hadoop + Spark architecture
    • Running Spark on YARN
    • Processing HDFS files using Spark
    • Spark & Hive
  10. Machine Learning (ML / MLib)
    • Machine Learning primer
    • Machine Learning in Spark : MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms : Clustering, Classifications, Recommendations
    • Labs : Writing ML applications in Spark
  11. GraphX
    • GraphX library overview
    • GraphX APIs
    • Labs : Processing graph data using Spark
  12. Spark Streaming
    • Streaming concepts
    • Evaluating Streaming platforms
    • Spark streaming library overview
    • Streaming operations
    • Sliding window operations
    • Structured Streaming
    • Continuous streaming
    • Spark & Kafka streaming
    • Labs : Writing spark streaming applications
  13. Spark in the real world
    • Highlight some Spark use cases in real world