Big Data Analytics With Hadoop

Overview

Apache Hadoop is a popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads into the traditional BI analytics world. This course will introduce an analyst to the core components of the Hadoop ecosystem and its analytics

What You Will Learn

  • Understanding Hadoop ecosystem
  • Data storage using HDFS
  • Data warehousing and querying using Hive

Audience

Business Analysts, Developers

Duration

2 days

Format

Lectures and hands-on labs.

Prerequisites

  • programming background with databases / SQL
  • basic knowledge of Linux

Lab environment

Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

  • a SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • a browser to access the cluster.

Detailed outline

 Hadoop ecosystem

    • Hadoop overview
      • distributions
      • high level architecture
      • hardware / software
      • Labs : first look at Hadoop
    • HDFS Overview
      • concepts (horizontal scaling, replication, data locality)
      • architecture (Namenode,  Data node)
      • Demo : Interacting with HDFS
    • YARN Overview
      • YARN operating system
      • Demo : Running applications on YARN program

 Hive

    • hive concepts & architecture
    • SQL support in Hive
    • Data warehousing in Hive
    • data types
    • table creation and queries
    • partitions
    • joins
    • modern data formats
    • text analytics
    • Hive performance
    • labs (multiple)