Hadoop for Developers

Overview

Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to the Hadoop ecosystem.

Objective

Learn Hadoop Ecosystem and tools for Big Data analytics

What You Will Learn

Hadoop & Big Data
HDFS
YARN
Hive
HBase

Audience:

Developers

Duration:

Three days

Format :

Lectures and hands-on labs. (50% – 50%)

Prerequisites

Developer background
Comfortable with SQL and Java programming language (HBase labs are in Java)
Comfortable in a Linux environment

Lab environment

Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

An SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
A browser to access the cluster. We recommend Chrome browser

Detailed outline

Introduction to Hadoop

- Hadoop Ecosystem
- Hadoop Distributions
- High-level architecture
- Hardware and software
- Lab: first look at Hadoop

HDFS

- Design and architecture
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Daemons: Namenode, Secondary Namenode, Datanode
- Communications and heart-beats
- Data integrity
- Read and write path
- Namenode High Availability (HA), Federation
- Labs: Interacting with HDFS

YARN

- YARN Concepts and architecture
- Resource Manager, Node Manager
- Writing YARN applications
- Labs: Running a sample YARN program

Hive

- Architecture and design
- Hive Data types
- HQL
- Creating Hive tables and querying
- Partitions
- Joins
- Text processing
- Labs: various labs on processing data with Hive

HBase

- Concepts and architecture
- HBase vs RDBMS vs Cassandra
- HBase Java API
- Time series data on HBase
- Schema design
- Labs: Interacting with HBase using shell; programming in HBase Java API ; Schema design exercise