Google Cloud for Data Scientists

2020 Feb 07

Overview

Data Science is all the rage today, and Google is one of the major promoters of it. Google Cloud Platform (GCP) is one of the leading platforms for Data Science.

In this course, the students will learn to do Data Science with Python, and the capabilities of Google Cloud specific to Data Science

What you will learn

  • Understand Google Cloud’s features for Data Science
  • Process of doing Data Science
  • Using Google Compute Engine
  • Using Google Cloud Storage
  • Visualizing data using Google Data Studio
  • Running SQL queries using Big Query
  • Data analytics with Python
  • Running Python code on Google Cloud
  • Large scale data analytics with Apache Spark
  • Running Spark using Google DataProc
  • Machine Learning fundamentals
  • Spark ML library
  • Doing ML with Spark ML on Google Cloud

Audience

Data Analysts, Data Scientists

Duration

Three to four days depending on the agenda

Format

Lectures and hands*on labs. (50%, 50%)

Prerequisites

  • Interested in Data Science (data science overview is included as needed)
  • Some basic knowledge of Python is highly recommended but not mandatory.
    Our labs utilize Python language. But Python is a very easy language to learn. So even you don’t have previous exposure to Python, you will be able to complete the labs.
  • Some programming experience is highly recommended

Lab environment

Zero Install: There is no need to install any software on students’ machines!

Students will need the following

  • A reasonably modern laptop
  • Unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
  • A Google Cloud account is highly recommended.

Detailed outline

Google Cloud Overview

  • Benefits of Cloud computing
  • Google Cloud ecosystem overview (Data Studio, Compute Cloud, Colossus, Big Query, Data Proc)
  • Lab: Getting up and running in Google Cloud

Google Compute Engine

  • Compute Engine Intro
  • Understanding different types of computing resources
  • Using compute resources effectively
  • Customizing a cloud VM
  • Lab: Using Compute Engine

Cloud Storage

  • Bringing data into the cloud
  • Data storage options in Google cloud
  • Ingesting Data
  • Scheduling data ingestion
  • Lab: Ingesting Data into Google Cloud

Google Data Studio

  • Overview of Data Studio
  • Visualizing data using Data Studio
  • Labs

Google Big Query

  • Introduction to Big Query
  • Running queries on Big Query
  • Labs

Data Analytics With Python

  • Exploring and understanding datasets
  • Cleaning data
  • Feature selection
  • Visualizing data
  • Labs

Python Development in Google Cloud

  • Introduction to Google Colab
  • Introduction to Datalab
  • Running Jupyter notebooks
  • Installing packages in Cloud Datalab
  • Labs

Large Scale Data Processing with Apache Spark

  • Spark Intro
  • Spark Shell
  • Loading and analyzing data in Spark
  • Spark Dataframes
  • Spark SQL
  • Labs

Google Dataproc

  • Introduction to Google Dataproc
  • Running Hadoop, Spark clusters using DataProc
  • Labs

Machine Learning (4th day)

  • Overview of Machine Learning
  • Machine Learning algorithms
  • Feature Engineering
  • Regressions
  • Classifications
  • Clustering

Machine Learning in Spark (4th day)

  • Introduction to Spark ML library
  • Feature Engineering with Spark ML
  • Regressions with Spark ML
  • Classification with Spark ML
  • Clustering with Spark ML
  • Labs
  • Lab: CPU vs. GPU benchmarking

Final Workshop (Time Permitting)

  • Students will work in teams
  • We will do a real-world Data Science problem using Google Cloud
  • And present their work to the class