Skip to course content

Data Analytics With Hadoop And Spark

Learn Spark fundamentals and its place in the Hadoop ecosystem.

Get Course Info

Audience: Data Analysts, Business Analysts

Duration: 2-3 days (depending on coverage required)

Overview

This course will introduce Apache Spark. The students will learn how Spark fits into the Big Data ecosystem, and how to use Spark for data analysis. This class is taught with Python language and using Jupyter environment

Objective

Learn Spark fundamentals and its place in the Hadoop ecosystem.

What You Will Learn

  • Spark ecosystem
  • Spark Shell
  • Spark Data structures (RDD / Dataframe / Dataset)
  • Spark SQL
  • Modern data formats and Spark
  • Spark & Hadoop & Hive

Course Details

Audience: Data Analysts, Business Analysts

Duration: 2-3 days (depending on coverage required)

Prerequisites:

Analyst background (familiarity with SQL, Scripting ..etc)

Setup: Cloud-based lab environment • Modern laptop able to reach cloud services

Detailed Outline

  • Big Data, Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs : Installing and running Spark
  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration
  • Partitions
  • Distributed execution
  • Operations: transformations and actions
  • Labs: Unstructured data analytics using RDDs
  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance
  • Dataframes Intro
  • Loading structured data (json, CSV) using Dataframes
  • Using schema
  • Specifying schema for Dataframes
  • Labs : Dataframes, Datasets, Schema
  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats : JSON / Parquet / ORC
  • Labs: querying structured data using SQL; evaluating data formats
  • Hadoop Primer: HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive
  • Group workshops solving real-world data analysis problems using Spark

Ready to Get Started?

Contact us to learn more about this course and schedule your training.