Hadoop for Administrators
Overview
Apache Hadoop is the most popular framework for processing Big Data on clusters of servers. In this course, attendees will learn about the business benefits and use cases for Hadoop and its ecosystem, how to plan cluster deployment and growth, how to install, maintain, monitor, troubleshoot, and optimize Hadoop.
What You Will Learn
- Hadoop & Big Data
- Installing Hadoop
- Managing and Monitoring Hadoop
- Loading data in HDFS
- Managing ecosystem
- Securing Hadoop
Audience
Hadoop administrators
Duration
three days/four days
Format
Lectures and hands-on labs, approximate balance 60% lectures, 40% labs.
Prerequisites
- comfortable with basic Linux system administration
- basic scripting skills
Knowledge of Hadoop and Distributed Computing is not required but will be introduced and explained in the course.
Lab environment
Zero Install: There is no need to install Hadoop software on students’ machines! A working Hadoop cluster will be provided for students.
Students will need the following
- an SSH client (Linux and Mac already have ssh clients, for Windows Putty is recommended)
- a browser to access the cluster. We recommend Firefox browser with FoxyProxy extension installed
Detailed outline
Introduction
-
- Hadoop history, concepts
- Ecosystem
- Distributions
- High-level architecture
- Hadoop myths
- Hadoop challenges (hardware/software)
- Labs: discuss your Big Data projects and problems
Planning and installation
-
- Selecting software, Hadoop distributions
- Sizing the cluster, planning for growth
- Selecting hardware and network
- Rack topology
- Installation
- Multi-tenancy
- The directory structure, logs
- Benchmarking
- Labs: cluster install, run performance benchmarks
HDFS operations
-
- Concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring
- Command-line and browser-based administration
- Adding storage, replacing defective drives
- Labs: getting familiar with HDFS command lines
Data ingestion
-
- Flume for logs and other data ingestion into HDFS
- Sqoop for importing from SQL databases to HDFS, as well as exporting back to SQL
- Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Using S3 as complementary to HDFS
- Data ingestion best practices and architectures
- Labs: setting up and using Flume, the same for Sqoop
MapReduce operations and administration
-
- Parallel computing before MapReduce: compare HPC vs Hadoop administration
- MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- MapReduce UI walkthrough
- Mapreduce configuration
- Job config
- Optimizing MapReduce
- Fool-proofing MR: what to tell your programmers
- Labs: running MapReduce examples
YARN: new architecture and new capabilities
-
- YARN design goals and implementation architecture
- New actors: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: investigate job scheduling
Advanced topics
-
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, upgrading Hadoop
- Backup, recovery and business continuity planning
- Oozie job workflows
- Hadoop high availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: set up monitoring
Optional tracks
-
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Cloudera distribution environment (CDH5)
- Ambari for cluster administration, monitoring, and routine tasks; installation, use. In this track, all exercises and labs are performed within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0)