Machine Learning With Apache Spark


This course teaches doing Machine Learning at Scale with the popular Apache Spark framework.

This course is intended for data scientists and software engineers.   We assume no previous knowledge of Machine Learning – We teach popular Machine Learning algorithms from scratch.

For each machine learning concept, we first discuss the foundations, its applicability, and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

This course is taught using Spark & Python.

Highlights of this course

  • No previous knowledge of machine learning is required
  • Not just learn the APIs, learn the theory behind it
  • Work with real-world datasets from Uber, Netflix, Walmart, Prosper ..etc


  • Learn  popular machine learning algorithms, their applicability, and limitations
  • Practice the application of these methods in the Spark machine learning environment
  • Learn practical use cases and limitations of algorithms

What you will learn

  • ML Concepts
  • Regressions
  • Classifications
  • Clustering
  • Principal Component Analysis (PCA)
  • Recommendations

3 days
Data Scientists and Software Engineers


  • If students are new to Apache Spark, we can offer one day of ‘Introduction to Spark’ training
  • programming background
  • familiarity with Python would be a plus, but not required
  • No machine learning knowledge is assumed

Lab environment:

Working Spark environment will be provided for students.  Students would only need an SSH client and a browse.

Zero Install: There is no need to install software on students’ machines.

Detailed Course Outline:

Section 1: Machine Learning (ML) Overview

  • Machine Learning landscape
  • Machine Learning applications
  • Understanding ML algorithms & models

Section 2: ML in Python and Spark

  • Spark ML Overview
  • Introduction to Jupyter notebooks
  • Lab: Working with Jupyter + Python + Spark
  • Lab: Spark ML utilities

Section 3: Machine Learning Concepts

  • Statistics Primer
  • Covariance, Correlation, Covariance Matrix
  • Errors, Residuals
  • Overfitting / Underfitting
  • Cross-validation, bootstrapping
  • Confusion Matrix
  • ROC curve, Area Under Curve (AUC)
  • Lab: Basic stats

Section 4: Feature Engineering (FE)

  • Preparing data for ML
  • Extracting features, enhancing data
  • Data cleanup
  • Visualizing Data
  • Lab: data cleanup
  • Lab: visualizing data

Section 5: Linear regression

  • Simple Linear Regression
  • Multiple Linear Regression
  • Running LR
  • Evaluating LR model performance
  • Lab
  • Use case: House price estimates

Section 6: Logistic Regression

  • Understanding Logistic Regression
  • Calculating Logistic Regression
  • Evaluating model performance
  • Lab
  • Use case: credit card application, college admissions

Section 7: Classification: SVM (Supervised Vector Machines)

  • SVM concepts and theory
  • SVM with kernel
  • Lab
  • Use case: Customer churn data

Section 8: Classification: Decision Trees & Random Forests

  • Theory behind trees
  • Classification and Regression Trees (CART)
  • Random Forest concepts
  • Labs
  • Use case: predicting loan defaults, estimating election contributions

Section 9: Classification: Naive Bayes

  • Theory
  • Lab
  • Use case: spam filtering

Section 10: Clustering (K-Means)

  • Theory behind K-Means
  • Running K-Means algorithm
  • Estimating the performance
  • Lab
  • Use case: grouping cars data, grouping shopping data

Section 11: Principal Component Analysis (PCA)

  • Understanding PCA concepts
  • PCA applications
  • Running a PCA algorithm
  • Evaluating results
  • Lab
  • Use case: analyzing retail shopping data

Section 12: Recommendations (Collaborative filtering)

  • Recommender systems overview
  • Collaborative Filtering concepts
  • Lab
  • Use case: movie recommendations, music recommendations

Section 13: Performance 

  • Best practices for scaling and optimizing Apache Spark
  • Memory caching
  • Testing and validation

Section 14: Final workshop (time permitting)

Students will analyze a couple of datasets and run ML algorithms.
This is done as a group exercise.  Each group will present their findings to the class.