Machine Learning With Apache Spark

Overview

Machine Learning (ML) is changing the world. To use ML effectively, one needs to understand the algorithms and how to utilize them. This course provides an introduction into the most popular machine learning algorithms.

We will also use Apache Spark as our ML platform. Apache Spark provides scalable ML platform, that makes it possible to analyze large amount of data.

This course teaches Machine Learning from a practical perspective. In-depth coverage of Math / Stats is beyond the scope of this course.

What you will learn:

Spark ecosystem
Spark ML Library
ML Concepts
Regressions
- Linear Regression
- Logistic Regressions
Classifications
- Naive Bayes
- SVM
- Decision Trees
- Random Forest
Clustering algorithms (K-Means)
Principal Component Analysis (PCA)
Recommendations

Audience

Data Analysts, Software Engineers, Data scientists

Duration

Four Days

Skill Level

Beginner to Intermediate

Industry Use Cases Covered

We will study and solve some of the most common industry use cases; listed below

Finance
- Predicting house prices
- Predicting loan defaults at Prosper
- Predicting income from customs data
Health care
- Predicting diabetes outcome
Customer service
- Predicting customer turnover
Text analytics
- Spam classification
Travel
- Predicting Uber demand
Politics
- Predicting election contributions
Recommendations
- Predicting movie ratings
- Recommending songs
Other
- Predicting wine quality
- Predicting college admissions

Prerequisites

Good programming background
familiarity with Python would be a plus, but not required
No machine learning knowledge is assumed
No Spark knowledge is assumed

Lab environment

Cloud-based lab environment will be provided to students, no need to install anything on the laptop

Students will need the following

A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
Chrome browser

Detailed Course Outline

Day 1
- Spark and Spark ML
Day 2,3,4
- Machine Learning

Spark

Spark ecosystem
Spark data models
Spark ML

Machine Learning (ML) Overview

Machine Learning landscape
Understanding Deep Learning use cases
Understanding AI / Machine Learning / Deep Learning
Data and AI
AI vocabulary
Hardware and software ecosystem
Understanding types of Machine Learning (Supervised / Unsupervised / Reinforcement)

ML in Python and Spark

Spark ML Overview
Introduction to Jupyter notebooks
Lab: Working with Jupyter + Python + Spark
Lab: Spark ML utilities

Feature Engineering and Exploratory Data Analysis (EDA)

Preparing data for ML
Statistics Primer
Data cleanup
Extracting features, enhancing data
Visualizing Data
Labs:
- Data cleanup
- Exploring data
- Visualizing data

Machine Learning Concepts

Training and Testing
Gradient Descent
Overfitting / Under-fitting
Cross validation, bootstrapping
Confusion Matrix
ROC curve, Area Under Curve (AUC)

Linear regression

Linear Regression
Errors, Residuals
Multiple Linear Regression
Evaluating model performance
Labs:
- Use case: House price estimates

Logistic Regression

Understanding Logistic Regression
Calculating Logistic Regression
Evaluating model performance
Labs:
- Credit card application
- college admissions

Classification: SVM (Supervised Vector Machines)

SVM concepts and theory
SVM with kernel
Labs: -Customer churn data

Classification: Decision Trees & Random Forests

Classification and Regression Trees (CART) introduction
Decision Tree concepts
Pruning trees
Gini index
Bias Variance Tradeoff
Random Forest concepts
Random Forests features and examples
Labs:
- Predicting loan defaults
- Estimating election contributions

Classification: Naive Bayes

Naive Bayes theory
Running Naive Bayes algorithm
Evaluating model performance
Lab
- Spam filtering

Unsupervised Algorithms

Overview of unsupervised algorithms
Supervised vs. unsupervised
Understanding unsupervised algorithms

Unsupervised: Clustering: K-Means

Theory behind K-Means
Running K-Means algorithm
Estimating the performance
Labs:
- Predicting Uber demand
- Clustering shopping trips

Unsupervised: Principal Component Analysis (PCA)

Understanding dimensions
‘Curse of dimensionality’
Reducing dimensions
Overview of Principal Component Analysis (PCA)
Eigen vectors and values
Implementing PCA algorithm
Labs:
- Predicting wine quality
- Predicting income from census data

Recommendations

Recommendation use cases
Recommender systems
Collaborative Filtering (CF)
Implementing CF algorithm
Lab:
- Movie ratings recommendation
- Songs rating recommendation

Final workshop (time permitting)

This is a group workshop
Each group will analyze a couple of real world datasets and run ML algorithms
Each group will present their findings to the class