# Machine Learning With Apache Spark

This course teaches doing Machine Learning at Scale with the popular Apache Spark framework.

This course is intended for data scientists and software engineers. ** We assume no previous knowledge of Machine Learning – We teach popular Machine Learning algorithms from scratch**.

For each machine learning concept, we first discuss the foundations, its applicability, and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

This course is taught using** Spark & Python.**

## Highlights of this course

- No previous knowledge of machine learning is required
- Not just learn the APIs, learn the theory behind it
- Work with real-world datasets from Uber, Netflix, Walmart, Prosper ..etc

**Objectives** :

- Learn popular machine learning algorithms, their applicability, and limitations
- Practice the application of these methods in the Spark machine learning environment
- Learn practical use cases and limitations of algorithms

## What you will learn

- ML Concepts
- Regressions
- Classifications
- Clustering
- Principal Component Analysis (PCA)
- Recommendations

**Duration**

3 days

**Audience**

Data Scientists and Software Engineers

**Prerequisites**

- If students are new to Apache Spark, we can offer one day of ‘Introduction to Spark’ training
- programming background
- familiarity with Python would be a plus, but not required
- No machine learning knowledge is assumed

**Lab environment:**

Working Spark environment will be provided for students. Students would only need an SSH client and a browse.

Zero Install: There is no need to install software on students’ machines.

**Detailed Course Outline:**

**Section 1: Machine Learning (ML) Overview**

- Machine Learning landscape
- Machine Learning applications
- Understanding ML algorithms & models

**Section 2: ML in Python and Spark**

- Spark ML Overview
- Introduction to Jupyter notebooks
- Lab: Working with Jupyter + Python + Spark
- Lab: Spark ML utilities

**Section 3: Machine Learning Concepts**

- Statistics Primer
- Covariance, Correlation, Covariance Matrix
- Errors, Residuals
- Overfitting / Underfitting
- Cross-validation, bootstrapping
- Confusion Matrix
- ROC curve, Area Under Curve (AUC)
- Lab: Basic stats

**Section 4: Feature Engineering (FE)**

- Preparing data for ML
- Extracting features, enhancing data
- Data cleanup
- Visualizing Data
- Lab: data cleanup
- Lab: visualizing data

**Section 5: Linear regression**

- Simple Linear Regression
- Multiple Linear Regression
- Running LR
- Evaluating LR model performance
- Lab
- Use case: House price estimates

**Section 6: Logistic Regression**

- Understanding Logistic Regression
- Calculating Logistic Regression
- Evaluating model performance
- Lab
- Use case: credit card application, college admissions

**Section 7: Classification: SVM (Supervised Vector Machines)**

- SVM concepts and theory
- SVM with kernel
- Lab
- Use case: Customer churn data

**Section 8: Classification: Decision Trees & Random Forests**

- Theory behind trees
- Classification and Regression Trees (CART)
- Random Forest concepts
- Labs
- Use case: predicting loan defaults, estimating election contributions

**Section 9: Classification: Naive Bayes**

- Theory
- Lab
- Use case: spam filtering

**Section 10: Clustering (K-Means)**

- Theory behind K-Means
- Running K-Means algorithm
- Estimating the performance
- Lab
- Use case: grouping cars data, grouping shopping data

**Section 11: Principal Component Analysis (PCA)**

- Understanding PCA concepts
- PCA applications
- Running a PCA algorithm
- Evaluating results
- Lab
- Use case: analyzing retail shopping data

**Section 12: Recommendations (Collaborative filtering)**

- Recommender systems overview
- Collaborative Filtering concepts
- Lab
- Use case: movie recommendations, music recommendations

**Section 13: Performance **

- Best practices for scaling and optimizing Apache Spark
- Memory caching
- Testing and validation

**Section 14: Final workshop (time permitting)**

Students will analyze a couple of datasets and run ML algorithms.

This is done as a group exercise. Each group will present their findings to the class.