Machine Learning With Apache Spark
Overview
Machine Learning (ML) is changing the world. To use ML effectively, one needs to understand the algorithms and how to utilize them. This course provides an introduction into the most popular machine learning algorithms.
We will also use Apache Spark as our ML platform. Apache Spark provides scalable ML platform, that makes it possible to analyze large amount of data.
This course teaches Machine Learning from a practical perspective. In-depth coverage of Math / Stats is beyond the scope of this course.
What you will learn:
- Spark ecosystem
- Spark ML Library
- ML Concepts
- Regressions
- Linear Regression
- Logistic Regressions
- Classifications
- Naive Bayes
- SVM
- Decision Trees
- Random Forest
- Clustering algorithms (K-Means)
- Principal Component Analysis (PCA)
- Recommendations
Audience
Data Analysts, Software Engineers, Data scientists
Duration
Four Days
Skill Level
Beginner to Intermediate
Industry Use Cases Covered
We will study and solve some of the most common industry use cases; listed below
- Finance
- Predicting house prices
- Predicting loan defaults at Prosper
- Predicting income from customs data
- Health care
- Predicting diabetes outcome
- Customer service
- Predicting customer turnover
- Text analytics
- Spam classification
- Travel
- Predicting Uber demand
- Politics
- Predicting election contributions
- Recommendations
- Predicting movie ratings
- Recommending songs
- Other
- Predicting wine quality
- Predicting college admissions
Prerequisites
- Good programming background
- familiarity with Python would be a plus, but not required
- No machine learning knowledge is assumed
- No Spark knowledge is assumed
Lab environment
Cloud-based lab environment will be provided to students, no need to install anything on the laptop
Students will need the following
- A reasonably modern laptop with unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
- Chrome browser
Detailed Course Outline
- Day 1
- Spark and Spark ML
- Day 2,3,4
- Machine Learning
Spark
- Spark ecosystem
- Spark data models
- Spark ML
Machine Learning (ML) Overview
- Machine Learning landscape
- Understanding Deep Learning use cases
- Understanding AI / Machine Learning / Deep Learning
- Data and AI
- AI vocabulary
- Hardware and software ecosystem
- Understanding types of Machine Learning (Supervised / Unsupervised / Reinforcement)
ML in Python and Spark
- Spark ML Overview
- Introduction to Jupyter notebooks
- Lab: Working with Jupyter + Python + Spark
- Lab: Spark ML utilities
Feature Engineering and Exploratory Data Analysis (EDA)
- Preparing data for ML
- Statistics Primer
- Data cleanup
- Extracting features, enhancing data
- Visualizing Data
- Labs:
- Data cleanup
- Exploring data
- Visualizing data
Machine Learning Concepts
- Training and Testing
- Gradient Descent
- Overfitting / Under-fitting
- Cross validation, bootstrapping
- Confusion Matrix
- ROC curve, Area Under Curve (AUC)
Linear regression
- Linear Regression
- Errors, Residuals
- Multiple Linear Regression
- Evaluating model performance
- Labs:
- Use case: House price estimates
Logistic Regression
- Understanding Logistic Regression
- Calculating Logistic Regression
- Evaluating model performance
- Labs:
- Credit card application
- college admissions
Classification: SVM (Supervised Vector Machines)
- SVM concepts and theory
- SVM with kernel
- Labs: -Customer churn data
Classification: Decision Trees & Random Forests
- Classification and Regression Trees (CART) introduction
- Decision Tree concepts
- Pruning trees
- Gini index
- Bias Variance Tradeoff
- Random Forest concepts
- Random Forests features and examples
- Labs:
- Predicting loan defaults
- Estimating election contributions
Classification: Naive Bayes
- Naive Bayes theory
- Running Naive Bayes algorithm
- Evaluating model performance
- Lab
- Spam filtering
Unsupervised Algorithms
- Overview of unsupervised algorithms
- Supervised vs. unsupervised
- Understanding unsupervised algorithms
Unsupervised: Clustering: K-Means
- Theory behind K-Means
- Running K-Means algorithm
- Estimating the performance
- Labs:
- Predicting Uber demand
- Clustering shopping trips
Unsupervised: Principal Component Analysis (PCA)
- Understanding dimensions
- ‘Curse of dimensionality’
- Reducing dimensions
- Overview of Principal Component Analysis (PCA)
- Eigen vectors and values
- Implementing PCA algorithm
- Labs:
- Predicting wine quality
- Predicting income from census data
Recommendations
- Recommendation use cases
- Recommender systems
- Collaborative Filtering (CF)
- Implementing CF algorithm
- Lab:
- Movie ratings recommendation
- Songs rating recommendation
Final workshop (time permitting)
- This is a group workshop
- Each group will analyze a couple of real world datasets and run ML algorithms
- Each group will present their findings to the class