Google Cloud for Data Scientists
2020 Feb 07
Overview
Data Science is all the rage today, and Google is one of the major promoters of it. Google Cloud Platform (GCP) is one of the leading platforms for Data Science.
In this course, the students will learn to do Data Science with Python, and the capabilities of Google Cloud specific to Data Science
What you will learn
- Understand Google Cloud’s features for Data Science
- Process of doing Data Science
- Using Google Compute Engine
- Using Google Cloud Storage
- Visualizing data using Google Data Studio
- Running SQL queries using Big Query
- Data analytics with Python
- Running Python code on Google Cloud
- Large scale data analytics with Apache Spark
- Running Spark using Google DataProc
- Machine Learning fundamentals
- Spark ML library
- Doing ML with Spark ML on Google Cloud
Audience
Data Analysts, Data Scientists
Duration
Three to four days depending on the agenda
Format
Lectures and hands*on labs. (50%, 50%)
Prerequisites
- Interested in Data Science (data science overview is included as needed)
- Some basic knowledge of Python is highly recommended but not mandatory.
Our labs utilize Python language. But Python is a very easy language to learn. So even you don’t have previous exposure to Python, you will be able to complete the labs. - Some programming experience is highly recommended
Lab environment
Zero Install: There is no need to install any software on students’ machines!
Students will need the following
- A reasonably modern laptop
- Unrestricted connection to the Internet. Laptops with overly restrictive VPNs or firewalls may not work properly
- A Google Cloud account is highly recommended.
Detailed outline
Google Cloud Overview
- Benefits of Cloud computing
- Google Cloud ecosystem overview (Data Studio, Compute Cloud, Colossus, Big Query, Data Proc)
- Lab: Getting up and running in Google Cloud
Google Compute Engine
- Compute Engine Intro
- Understanding different types of computing resources
- Using compute resources effectively
- Customizing a cloud VM
- Lab: Using Compute Engine
Cloud Storage
- Bringing data into the cloud
- Data storage options in Google cloud
- Ingesting Data
- Scheduling data ingestion
- Lab: Ingesting Data into Google Cloud
Google Data Studio
- Overview of Data Studio
- Visualizing data using Data Studio
- Labs
Google Big Query
- Introduction to Big Query
- Running queries on Big Query
- Labs
Data Analytics With Python
- Exploring and understanding datasets
- Cleaning data
- Feature selection
- Visualizing data
- Labs
Python Development in Google Cloud
- Introduction to Google Colab
- Introduction to Datalab
- Running Jupyter notebooks
- Installing packages in Cloud Datalab
- Labs
Large Scale Data Processing with Apache Spark
- Spark Intro
- Spark Shell
- Loading and analyzing data in Spark
- Spark Dataframes
- Spark SQL
- Labs
Google Dataproc
- Introduction to Google Dataproc
- Running Hadoop, Spark clusters using DataProc
- Labs
Machine Learning (4th day)
- Overview of Machine Learning
- Machine Learning algorithms
- Feature Engineering
- Regressions
- Classifications
- Clustering
Machine Learning in Spark (4th day)
- Introduction to Spark ML library
- Feature Engineering with Spark ML
- Regressions with Spark ML
- Classification with Spark ML
- Clustering with Spark ML
- Labs
- Lab: CPU vs. GPU benchmarking
Final Workshop (Time Permitting)
- Students will work in teams
- We will do a real-world Data Science problem using Google Cloud
- And present their work to the class