Hadoop for Developers


Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. This course will introduce a developer to the Hadoop ecosystem.


Learn Hadoop Ecosystem and tools for Big Data analytics

What You Will Learn

  • Hadoop & Big Data
  • HDFS
  • YARN
  • Hive
  • HBase




Three days

Format :

Lectures and hands-on labs. (50% – 50%)


  • Developer background
  • Comfortable with SQL and Java programming language (HBase labs are in Java)
  • Comfortable in a Linux environment

Lab environment

Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.

Students will need the following

  • An SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
  • A browser to access the cluster. We recommend Chrome browser

Detailed outline

Introduction to Hadoop

    • Hadoop Ecosystem
    • Hadoop Distributions
    • High-level architecture
    • Hardware and software
    • Lab: first look at Hadoop


    • Design and architecture
    • Concepts (horizontal scaling, replication, data locality, rack awareness)
    • Daemons: Namenode, Secondary Namenode, Datanode
    • Communications and heart-beats
    • Data integrity
    • Read and write path
    • Namenode High Availability (HA), Federation
    • Labs: Interacting with HDFS


    • YARN Concepts and architecture
    • Resource Manager, Node Manager
    • Writing YARN applications
    • Labs: Running a sample YARN program


    • Architecture and design
    • Hive Data types
    • HQL
    • Creating Hive tables and querying
    • Partitions
    • Joins
    • Text processing
    • Labs: various labs on processing data with Hive


    • Concepts and architecture
    • HBase vs RDBMS vs Cassandra
    • HBase Java API
    • Time series data on HBase
    • Schema design
    • Labs: Interacting with HBase using shell;   programming in HBase Java API ; Schema design exercise