Big Data Analytics With Hadoop

Overview

Apache Hadoop is a popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads into the traditional BI analytics world. This course will introduce an analyst to the core components of the Hadoop ecosystem and its analytics

What You Will Learn

Understanding Hadoop ecosystem
Data storage using HDFS
Data warehousing and querying using Hive

Audience

Business Analysts, Developers

Duration

2 days

Format

Lectures and hands-on labs.

Prerequisites

programming background with databases / SQL
basic knowledge of Linux

Lab environment

Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

a SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
a browser to access the cluster.

Detailed outline

Hadoop ecosystem

- Hadoop overview
  - distributions
  - high level architecture
  - hardware / software
  - Labs : first look at Hadoop
- HDFS Overview
  - concepts (horizontal scaling, replication, data locality)
  - architecture (Namenode, Data node)
  - Demo : Interacting with HDFS
- YARN Overview
  - YARN operating system
  - Demo : Running applications on YARN program

Hive

- hive concepts & architecture
- SQL support in Hive
- Data warehousing in Hive
- data types
- table creation and queries
- partitions
- joins
- modern data formats
- text analytics
- Hive performance
- labs (multiple)