CALL NOW 713-568-9753
Spark V2 for Data Analysts

Upcoming Classes

Ideal for small teams and individuals

see-schedule

Looking For Private Training?

We offer on-site, customized trainings.

contact-us

Spark v2 For Data Analysts

Overview:

This course will introduce Apache Spark. The students will learn how  Spark fits  into the Big Data ecosystem, and how to use Spark for data analysis.

Completely updated for latest Spark version 2.x!
Spark version 2 has lots of changes compared to v1.  This course covers the latest Spark v2 features.

What You Will Learn

  • Scala primer
  • Spark Shell
  • Spark Data structures (RDD / Dataframe / Dataset)
  • Spark SQL
  • Spark & Hadoop
  • Spark MLLib (3rd day)
  • Spark Graphx (3rd day)

Audience :

Data Analysts , Business Analysts

Duration :

2-3 days (depending on coverage required)

Pre-requisites

  • Analyst background (familiarity with SQL, Scripting ..etc)
  • Basic understanding of Linux development environment (basic command line navigation / editing files / running programs)

Lab Environment

We provide the complete lab environment in the cloud.  No need to install Spark on your laptop.
See below for what to bring.

What to Bring:

 

Detailed Outline:

  1. Scala primer
    • A quick introduction to Scala
    • Labs : Getting know Scala
  2. Spark Basics
    • Big Data , Hadoop, Spark
    • Spark concepts and architecture
    • Spark eco system (core, spark sql, mlib, streaming)
    • Labs : Installing and running Spark
  3. First Look at Spark
    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration
  4. RDDs (condensed coverage)
    • RDDs concepts
    • Partitions
    • RDD Operations / transformations
    • Labs : Unstructured data analytics using RDDs
  5. Dataframes / Datasets
    • Understanding newer Dataset API
    • Dataframes
    • Loading structured data using Dataframes
    • Caching and persistence
    • Labs : Dataframes, Datasets, Caching
  6. Spark SQL
    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats : JSON / Parquet / ORC
    • Labs : querying structured data using SQL; evaluating data formats
  7. Spark and Hadoop
    • Hadoop Primer : HDFS / YARN
    • Hadoop + Spark architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
    • Spark & Hive
  8.  Machine Learning (ML) (day – 3)
    • Machine Learning primer
    • Machine Learning in Spark : MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms : Clustering, Classifications, Recommendations
    • Labs : Writing ML applications
  9.  GraphX (day – 3)
    • GraphX library overview
    • GraphX APIs
    • Labs : Processing graph data using Spark