Hadoop for Developers
Overview
Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to the Hadoop ecosystem.
Objective
Learn Hadoop Ecosystem and tools for Big Data analytics
What You Will Learn
- Hadoop & Big Data
- HDFS
- YARN
- Hive
- HBase
Audience:
Developers
Duration:
Three days
Format :
Lectures and hands-on labs. (50% – 50%)
Prerequisites
- Developer background
- Comfortable with SQL and Java programming language (HBase labs are in Java)
- Comfortable in a Linux environment
Lab environment
Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.
Students will need the following
- An SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
- A browser to access the cluster. We recommend Chrome browser
Detailed outline
Introduction to Hadoop
-
- Hadoop Ecosystem
- Hadoop Distributions
- High-level architecture
- Hardware and software
- Lab: first look at Hadoop
HDFS
-
- Design and architecture
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Daemons: Namenode, Secondary Namenode, Datanode
- Communications and heart-beats
- Data integrity
- Read and write path
- Namenode High Availability (HA), Federation
- Labs: Interacting with HDFS
YARN
-
- YARN Concepts and architecture
- Resource Manager, Node Manager
- Writing YARN applications
- Labs: Running a sample YARN program
Hive
-
- Architecture and design
- Hive Data types
- HQL
- Creating Hive tables and querying
- Partitions
- Joins
- Text processing
- Labs: various labs on processing data with Hive
HBase
-
- Concepts and architecture
- HBase vs RDBMS vs Cassandra
- HBase Java API
- Time series data on HBase
- Schema design
- Labs: Interacting with HBase using shell; programming in HBase Java API ; Schema design exercise