CALL NOW 713-568-9753
Machine Learning with Spark

Spark is a new and very popular Big Data processing engine. Spark MLLib is a de facto standard for machine learning in Big Data.

This course is intended for data scientists and software engineers. It maintains an optimal balance of theory and practice. For each machine learning concept, we first discuss the foundations, its applicability and limitations. Then we explain the implementation and use, and specific use cases. This is achieved through a combination of about 50% lecture, 50% lab work.

Duration : 4 days

Audience : Data Scientists and Software Engineers

Prerequisites :

  •     familiarity with programming in at least one language
  •     be able to navigate Linux command line
  •     basic knowledge of command line Linux editors (VI / nano)

Objectives :

  •     attain thorough understanding of popular machine learning algorithms, their applicability and limitations
  •     practice the application of these methods in the Spark machine learning environment
  •     achieve clarity in the real-world use of machine learning by illustrating each method with practical use cases

Lab environment:

Working Spark environment will be provided for students.  Students would only need an SSH client and a browse.

Zero Install : There is no need to install software on students’ machines.

Course Outline:

Section 1: Introductions and overviews

  • Machine learning: goals, results, supervised/unsupervised
  • Spark as a tool for Big Data
  • Scala as the language of Spark (together with Python, Java and R)
  • MLLib as a collection of machine learning algorithms

If the students do not have the Spark/Scala prerequisites, a thorough introduction of these is taught in the section

Section 2: SVM (Supervised Vector Machines)

  • Theory
  • Lab
  • Use case: anomaly detection

Section 3: Logistic Regression

  • Theory
  • Lab
  • Use case: healthcare prediction

Section 4: Linear regression

  • Theory
  • Lab
  • Use case: financial modelling

Section 5: Naive Bayes

  • Theory
  • Lab
  • Use case: spam filtering

Section 6: Decision Trees

  • Theory
  • Lab
  • Use case: vessel shipment planning

 

Section 7: Clustering (K-Means)

  • Theory
  • Lab
  • Use case: topic grouping

Section 8: LDA (Latent Dirichlet Allocation)

  • Theory
  • Lab
  • Use case: unsupervised topic discovery

Section 9: Principal Component Analysis (PCA)

  • Theory
  • Lab
  • Use case: stock analysis

Section 10: Recommendation (Collaborative filtering)

  • Theory
  • Lab
  • Use case: dating

Section 11: Graphs – graph operations

  • Theory
  • Lab
  • Use case: finding followers

 

Section 12: Graphs – optimizations with Pregel

  • Theory
  • Lab
  • Use case: shortest routes, PageRank