Apache Hive

Overview

Hive is the de-facto standard for data warehousing Hadoop. This course starts with standard Hive setup and operations, continues into Advanced Hive use, discusses performance and execution engines, and ends with a practical workshop.

This course is intended for data scientists and software engineers. It gives them a practical level of experience, achieved through a combination of about 50% lecture, 50% lab work.

Audience

Data Scientists, Developers, Administrators

Duration

2 days

Prerequisites

  • Familiarity with SQL
  • Browser – the course is taught in Cloudera Hue or Hortonworks Views
  • Command line helpful but not required

Lab environment

  • Working environment will be provided. Students will only need a browser.
  • Zero Install: there is no need to install software on students’ machines.

Course Outline:

  • Hive Basics
    • Defining Hive Tables
    • SQL Queries over Structured Data
    • Filtering / Search
    • Aggregations / Ordering
    • Partitions
    • Joins
    • Text Analytics (Semi-Structured Data)
  • Hive Advanced
    • Transformation, Aggregation
    • Working with Dates, Timestamps, and Arrays
    • Converting Strings to Date, Time, and Numbers
    • Create new Attributes, Mathematical Calculations, Windowing Functions
    • Use Character and String Functions
    • Binning and Smoothing
    • Processing JSON Data
    • Execution Engines (Tez, MR, Spark)
  • Impala (for Cloudera track)
    • Architecture
    • Impala joins and other SQL specifics
  • TEZ (for Hortonworks track)
    • Architecture
    • Performance and use
  • Bonus Workshop
    • Students will work in teams to do this end-to-end workshop
    • Setup a data warehouse with Hive
    • Query and analyze data with Hive and Spark