How to prepare for the Cloudera Data Scientist Certification Exam

At our Houston Hadoop Meetup, Austin Sun showed how to prepare for the Cloudera Data Scientist Certification exam. Austin has prepared for this presentation for quite a while, passed the certification himself, and now shared his experience with others. The certification is definitely recommended by Sujee Maniyam in his “Launching Your Career in Big Data” […]

IBM Strategy for Spark

Last month, Garrett Young of IBM presented at our Houston Hadoop & Spark Meetup. The topic was an interesting one: how is IBM planning to make money on an open source project, in that case, Spark. First, Garrett briefly introduced Spark and spelled out the reasons for IBM’s interest in Spark: it is performant, productive, […]

Processing unstructured text data with Spark 2 APIs – Dataset & Dataframe

This is part of our migrating/updating to Spark 2 series. See all our posts on Spark and Spark2. This post explains how to process unstructured, text data using newer Spark 2 APIs Code repository Learning Spark @ Github And here is the code on Github Screencast Sample data Nursery rhyme : twinkle twinkle little star. […]

Migrating / Upgrading to Spark version 2

Motivation Spark is an amazing computing framework.  Spark version 2 has lots of exciting stuff.  And Hadoop vendors Cloudera and Hortonworks are now supporting Spark 2 on their platforms. So we anticipate lot of people would be upgrading or migrating to Spark 2. However, lots of Spark tutorials and code samples on the web are […]

MOOC or in-class training?

In his article, “How the pioneers of MOOC got it all wrong,” Robert Ubell quotes surprising statistics. To-day, 58 million people have signed up for an MOOC (which stands for Massive Open Online Courses), but the completion rate is between 7% to 12%. Many people quit watching after a few minutes, and many are just […]