Data Analytics With Hadoop and Spark
Overview:
This course will introduce Apache Spark. The students will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis.
This class is taught with Python language and using Jupyter environment
What You Will Learn
- Spark ecosystem
- Spark Shell
- Spark Data structures (RDD / Dataframe / Dataset)
- Spark SQL
- Modern data formats and Spark
- Spark & Hadoop & Hive
Audience :
Data Analysts , Business Analysts
Duration :
2-3 days (depending on coverage required)
Pre-requisites
- Analyst background (familiarity with SQL, Scripting ..etc)
Lab Environment
We provide the complete lab environment in the cloud. No need to install Spark on your laptop.
See below for what to bring.
What to Bring:
A reasonably modern laptop. I need to be able to connect to cloud services. Laptops with overly restrictive firewalls are not recommended)
Detailed Outline:
Spark Introduction
-
- Big Data, Hadoop, Spark
- Spark concepts and architecture
- Spark components overview
- Labs : Installing and running Spark
First Look at Spark
-
- Spark shell
- Spark web UIs
- Analyzing dataset – part 1
- Labs: Spark shell exploration
Spark Data structures
-
- Partitions
- Distributed execution
- Operations: transformations and actions
- Labs: Unstructured data analytics using RDDs
Caching
-
- Caching overview
- Various caching mechanisms available in Spark
- In memory file systems
- Caching use cases and best practices
- Labs: Benchmark of caching performance
Dataframes / Datasets
-
- Dataframes Intro
- Loading structured data (json, CSV) using Dataframes
- Using schema
- Specifying schema for Dataframes
- Labs : Dataframes, Datasets, Schema
Spark SQL
-
- Spark SQL concepts and overview
- Defining tables and importing datasets
- Querying data using SQL
- Handling various storage formats : JSON / Parquet / ORC
- Labs: querying structured data using SQL; evaluating data formats
Spark and Hadoop
-
- Hadoop Primer: HDFS / YARN
- Hadoop + Spark architecture
- Running Spark on Hadoop YARN
- Processing HDFS files using Spark
- Spark & Hive
Workshops
-
- These are group workshops
- Attendees will work on solving real-world data analysis problems using Spark