From Scala to Python in Spark

Scala Vs. Python

Spark’s native language is Scala, a fine language, but in many ways Spark seems more popular than Scala. I’m often asked why Spark’s creators chose Scala. Given that the Spark framework runs on the JVM, that really limited the choices of language to venerable Java or new-kid-on-the-block Scala. As Spark’s API is highly based on functional programming, Scala’s functional features make the code far more elegantly expressed than Java.

That said, Scala isn’t all that popular yet, and another Spark language is no doubt at least as popular: Python. Python has been around a long time, not as long as Java, but it’s attracted a huge following. And no wonder many Python frameworks and packages are among the best-in-class in any language. These days, it’s often taught as a first language to students, so there’s a whole new group of students speaking Python as their “native” programming language, not like the C++ and Java of my generation.

So, the fact that Spark speaks Python is huge for its popularity. And, Spark solves one of the main problems with doing analytics in Python: the fact that most python code isn’t particularly scalable.

In early versions of Spark, programming in Python meant a huge performance hit compared to Scala. Since we use CPython (rather than the little-used Jython), all objects must be marshalled between the JVM and the running CPython code. However, these days that’s not as critical, because most modern Spark code uses the Catalyst optimizer anyway to generate its low-level steps, so the originating language is not longer all that critical for performance reasons.

Jupyter Notebooks

One of the cool things about Python is that it’s the native language of Jupyter notebooks. That means we can start a pyspark session that runs in jupyter, by setting some environment variables:

PYSPARK_DRIVER_PYTHON=”jupyter” PYSPARK_DRIVER_PYTHON_OPTS=”notebook”

Then, we when we start pyspark, it will bring up the jupyter notebook.

~/spark/bin/pyspark

Datasets versus RDDs/Dataframes

In Scala, we’re moving into a post-RDD world, with the emergence of DataSets. But that isn’t the case in Python. In python, we’re back to RDDs and Dataframes. Things like spark.textFile don’t exist in pyspark, because we’re back to sc.textFile().

Then why use Python?

Simple! Python is a great Data Science environment. It’s like having most of R’s package goodness but with a modern software development language. Plus, most of us in data science are already using Python pretty heavily in our workflow.

From Scala to Python in Spark

Tim Fox

Leave a Reply Cancel reply