Apache Hive

Overview

Hive is the de-facto standard for data warehousing Hadoop. This course starts with standard Hive setup and operations, continues into Advanced Hive use, discusses performance and execution engines, and ends with a practical workshop.

This course is intended for data scientists and software engineers. It gives them a practical level of experience, achieved through a combination of about 50% lecture, 50% lab work.

Audience

Data Scientists, Developers, Administrators

Duration

2 days

Prerequisites

Familiarity with SQL
Browser – the course is taught in Cloudera Hue or Hortonworks Views
Command line helpful but not required

Lab environment

Working environment will be provided. Students will only need a browser.
Zero Install: there is no need to install software on students’ machines.

Course Outline:

Hive Basics
- Defining Hive Tables
- SQL Queries over Structured Data
- Filtering / Search
- Aggregations / Ordering
- Partitions
- Joins
- Text Analytics (Semi-Structured Data)
Hive Advanced
- Transformation, Aggregation
- Working with Dates, Timestamps, and Arrays
- Converting Strings to Date, Time, and Numbers
- Create new Attributes, Mathematical Calculations, Windowing Functions
- Use Character and String Functions
- Binning and Smoothing
- Processing JSON Data
- Execution Engines (Tez, MR, Spark)
Impala (for Cloudera track)
- Architecture
- Impala joins and other SQL specifics
TEZ (for Hortonworks track)
- Architecture
- Performance and use
Bonus Workshop
- Students will work in teams to do this end-to-end workshop
- Setup a data warehouse with Hive
- Query and analyze data with Hive and Spark