Tag Archives: spark

Processing unstructured text data with Spark 2 APIs – Dataset & Dataframe

This is part of our migrating/updating to Spark 2 series. See all our posts on Spark and Spark2. This post explains how to process unstructured, text data using newer Spark 2 APIs Code repository Learning Spark @ Github And here is the code on Github Screencast Sample data Nursery rhyme : twinkle twinkle little star. […]

Migrating / Upgrading to Spark version 2

Motivation Spark is an amazing computing framework.  Spark version 2 has lots of exciting stuff.  And Hadoop vendors Cloudera and Hortonworks are now supporting Spark 2 on their platforms. So we anticipate lot of people would be upgrading or migrating to Spark 2. However, lots of Spark tutorials and code samples on the web are […]

From Spark MLLib 1.0 to Spark ML 2.1

This is part of our migrating/updating to Spark 2 series. See all our posts on Spark and Spark2. Code repository Learning Spark @ Github Screencast   Spark’s Machine Learning (ML) components have changed significantly.  Just like the rest of Spark, the older RDD-based API persists with the newer dataframe based API. Yet, I find that the […]

From Scala to Python in Spark

Scala Vs. Python Spark’s native language is Scala, a fine language, but in many ways Spark seems more popular than Scala.   I’m often asked why Spark’s creators chose Scala.  Given that the Spark framework runs on the  JVM, that really limited the choices of language to venerable Java or new-kid-on-the-block Scala.   As Spark’s […]

Spark Summit 2015 highlights and recap

(Disclaimer : This is not an official post from Databricks) Spark Summit 2015 in San Francisco was well attended.  Kudos to the Databricks team for organizing this fantastic conference. All of the conference talks are available online (both slide decks & video).  The talks were excellent! Here are some highlights: Spark: Cool Features and What’s New Spark is very actively […]