Machine Learning With Apache Spark


Machine Learning (ML) is changing the world. To use ML effectively, one needs to understand the algorithms and how to utilize them. This course provides an introduction into the most popular machine learning algorithms.

We will also use Apache Spark as our ML platform. Apache Spark provides scalable ML platform, that makes it possible to analyze large amount of data.

This course teaches Machine Learning from a practical perspective. In-depth coverage of Math / Stats is beyond the scope of this course.

What you will learn:

  • Spark ecosystem
  • Spark ML Library
  • ML Concepts
  • Regressions
    • Linear Regression
    • Logistic Regressions
  • Classifications
    • Naive Bayes
    • SVM
    • Decision Trees
    • Random Forest
  • Clustering algorithms (K-Means)
  • Principal Component Analysis (PCA)
  • Recommendations


Data Analysts, Software Engineers, Data scientists


Four Days

Skill Level

Beginner to Intermediate

Industry Use Cases Covered

We will study and solve some of the most common industry use cases; listed below

  • Finance
    • Predicting house prices
    • Predicting loan defaults at Prosper
    • Predicting income from customs data
  • Health care
    • Predicting diabetes outcome
  • Customer service
    • Predicting customer turnover
  • Text analytics
    • Spam classification
  • Travel
    • Predicting Uber demand
  • Politics
    • Predicting election contributions
  • Recommendations
    • Predicting movie ratings
    • Recommending songs
  • Other
    • Predicting wine quality
    • Predicting college admissions


  • Good programming background
  • familiarity with Python would be a plus, but not required
  • No machine learning knowledge is assumed
  • No Spark knowledge is assumed

Lab environment

Cloud-based lab environment will be provided to students, no need to install anything on the laptop

Students will need the following

  • A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
  • Chrome browser

Detailed Course Outline

  • Day 1
    • Spark and Spark ML
  • Day 2,3,4
    • Machine Learning


  • Spark ecosystem
  • Spark data models
  • Spark ML

Machine Learning (ML) Overview

  • Machine Learning landscape
  • Understanding Deep Learning use cases
  • Understanding AI / Machine Learning / Deep Learning
  • Data and AI
  • AI vocabulary
  • Hardware and software ecosystem
  • Understanding types of Machine Learning (Supervised / Unsupervised / Reinforcement)

ML in Python and Spark

  • Spark ML Overview
  • Introduction to Jupyter notebooks
  • Lab: Working with Jupyter + Python + Spark
  • Lab: Spark ML utilities

Feature Engineering and Exploratory Data Analysis (EDA)

  • Preparing data for ML
  • Statistics Primer
  • Data cleanup
  • Extracting features, enhancing data
  • Visualizing Data
  • Labs:
    • Data cleanup
    • Exploring data
    • Visualizing data

Machine Learning Concepts

  • Training and Testing
  • Gradient Descent
  • Overfitting / Under-fitting
  • Cross validation, bootstrapping
  • Confusion Matrix
  • ROC curve, Area Under Curve (AUC)

Linear regression

  • Linear Regression
  • Errors, Residuals
  • Multiple Linear Regression
  • Evaluating model performance
  • Labs:
    • Use case: House price estimates

Logistic Regression

  • Understanding Logistic Regression
  • Calculating Logistic Regression
  • Evaluating model performance
  • Labs:
    • Credit card application
    • college admissions

Classification: SVM (Supervised Vector Machines)

  • SVM concepts and theory
  • SVM with kernel
  • Labs: -Customer churn data

Classification: Decision Trees & Random Forests

  • Classification and Regression Trees (CART) introduction
  • Decision Tree concepts
  • Pruning trees
  • Gini index
  • Bias Variance Tradeoff
  • Random Forest concepts
  • Random Forests features and examples
  • Labs:
    • Predicting loan defaults
    • Estimating election contributions

Classification: Naive Bayes

  • Naive Bayes theory
  • Running Naive Bayes algorithm
  • Evaluating model performance
  • Lab
    • Spam filtering

Unsupervised Algorithms

  • Overview of unsupervised algorithms
  • Supervised vs. unsupervised
  • Understanding unsupervised algorithms

Unsupervised: Clustering: K-Means

  • Theory behind K-Means
  • Running K-Means algorithm
  • Estimating the performance
  • Labs:
    • Predicting Uber demand
    • Clustering shopping trips

Unsupervised: Principal Component Analysis (PCA)

  • Understanding dimensions
  • ‘Curse of dimensionality’
  • Reducing dimensions
  • Overview of Principal Component Analysis (PCA)
  • Eigen vectors and values
  • Implementing PCA algorithm
  • Labs:
    • Predicting wine quality
    • Predicting income from census data


  • Recommendation use cases
  • Recommender systems
  • Collaborative Filtering (CF)
  • Implementing CF algorithm
  • Lab:
    • Movie ratings recommendation
    • Songs rating recommendation

Final workshop (time permitting)

  • This is a group workshop
  • Each group will analyze a couple of real world datasets and run ML algorithms
  • Each group will present their findings to the class