Apache Hive
Overview
Hive is the de-facto standard for data warehousing Hadoop. This course starts with standard Hive setup and operations, continues into Advanced Hive use, discusses performance and execution engines, and ends with a practical workshop.
This course is intended for data scientists and software engineers. It gives them a practical level of experience, achieved through a combination of about 50% lecture, 50% lab work.
Audience
Data Scientists, Developers, Administrators
Duration
2 days
Prerequisites
- Familiarity with SQL
- Browser – the course is taught in Cloudera Hue or Hortonworks Views
- Command line helpful but not required
Lab environment
- Working environment will be provided. Students will only need a browser.
- Zero Install: there is no need to install software on students’ machines.
Course Outline:
- Hive Basics
- Defining Hive Tables
- SQL Queries over Structured Data
- Filtering / Search
- Aggregations / Ordering
- Partitions
- Joins
- Text Analytics (Semi-Structured Data)
- Hive Advanced
- Transformation, Aggregation
- Working with Dates, Timestamps, and Arrays
- Converting Strings to Date, Time, and Numbers
- Create new Attributes, Mathematical Calculations, Windowing Functions
- Use Character and String Functions
- Binning and Smoothing
- Processing JSON Data
- Execution Engines (Tez, MR, Spark)
- Impala (for Cloudera track)
- Architecture
- Impala joins and other SQL specifics
- TEZ (for Hortonworks track)
- Architecture
- Performance and use
- Bonus Workshop
- Students will work in teams to do this end-to-end workshop
- Setup a data warehouse with Hive
- Query and analyze data with Hive and Spark