This is part of our migrating/updating to Spark 2 series.
See all our posts on Spark and Spark2.
Code repository
Learning Spark @ Github
Spark’s Machine Learning (ML) components have changed significantly. Just like the rest of Spark, the older RDD-based API persists with the newer dataframe based API.
Yet, I find that the newer API is still somewhat unknown, despite being more powerful and actually significantly easier to use. Not only is the API designed to work on dataframes, but its pipeline based approach is significantly more elegant as well as more familiar to users of scikit-learn, a popular python library for machine learning.
Let’s look at an example from the new API: a k-means clustering example. For those who are unfamiliar, k-means is the simplest way we have to organize data into clusters. We specify the number of clusters we’d like to see (we call this k), and the algorithm randomly seeds points which iteratively converge as the center points for clusters.
To do this, let’s use a very simple and small dataset: the mtcars dataset, which has a group of car models and some numbers associated with them, such as MPG, and number of cylinders. In fact, those are the only two numbers we’ll use, for the sake of simplicity.
I’ve illustrated how this works.Please see the jupyter notebook, http://www.github.com/elephantscale/learning-spark/from-mllib-to-ml.ipynb in our github Learning Spark.
VectorAssembler
The new Spark API has a new class for creating vectors and doing feature extraction: VectorAssembler. It does this within the context of dataframes. Here’s an example of how this works:
assembler = VectorAssembler(inputCols=["mpg", "cyl"], outputCol="features")
featureVector = assembler.transform(dataset)
Note that the output dataframe will now include a new column, called “features.”
Moving on, we now can transform the dataframe again by calling kmeans on it.
kmeans = KMeans().setK(k).setSeed(1)
model = kmeans.fit(featureVector)
wssse = model.computeCost(featureVector)
This means that we have a fit model, that we can apply to our data. WSSSE here is a cost function to measure the effectiveness of the run.
Finally, let’s get the results on our dataframe:
model.transform(featureVector).show()
Try the ipython notebook and demonstrate the results for yourself.