Big Data Analytics With Hadoop
Overview
Apache Hadoop is a popular framework for processing Big Data. Hadoop provides rich and deep analytics capability, and it is making in-roads into the traditional BI analytics world. This course will introduce an analyst to the core components of the Hadoop ecosystem and its analytics
What You Will Learn
- Understanding Hadoop ecosystem
- Data storage using HDFS
- Data warehousing and querying using Hive
Audience
Business Analysts, Developers
Duration
2 days
Format
Lectures and hands-on labs.
Prerequisites
- programming background with databases / SQL
- basic knowledge of Linux
Lab environment
Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.
Students will need the following
- a SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
- a browser to access the cluster.
Detailed outline
Hadoop ecosystem
-
- Hadoop overview
- distributions
- high level architecture
- hardware / software
- Labs : first look at Hadoop
- HDFS Overview
- concepts (horizontal scaling, replication, data locality)
- architecture (Namenode, Data node)
- Demo : Interacting with HDFS
- YARN Overview
- YARN operating system
- Demo : Running applications on YARN program
- Hadoop overview
Hive
-
- hive concepts & architecture
- SQL support in Hive
- Data warehousing in Hive
- data types
- table creation and queries
- partitions
- joins
- modern data formats
- text analytics
- Hive performance
- labs (multiple)