Processing unstructured text data with Spark 2 APIs - Dataset & Dataframe

This is part of our migrating/updating to Spark 2 series.

See all our posts on Spark and Spark2.

This post explains how to process unstructured, text data using newer Spark 2 APIs

Code repository
Learning Spark @ Github

Screencast

Sample data

Nursery rhyme : twinkle twinkle little star.

twinkle twinkle little star
how I wonder what you are
up above the world so high
like a diamond in the sky
twinkle twinkle little star

Query

Find the lines that has the word ‘twinkle‘

Get the code


$  git clone https://github.com/elephantscale/learning-spark
$  cd learning-spark

Spark 1 Approach

Start spark-shell

$ spark-shell

We will use sc.textFile this will give us an RDD


val a = sc.textFile("data/twinkle/sample.txt")

// a: org.apache.spark.rdd.RDD[String]


a.count
// 5
val a_twinkle = a.filter(_.contains("twinkle"))

// a_twinkle: org.apache.spark.rdd.RDD[String]



a_twinkle.count
// 2
a_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

Spark 2 Approach 1

Instead of SparkContext (sc) we will use Spark Session (spark.read.textFile). This will give us an Dataset

This is almost the same as previous


val b = spark.read.textFile("data/twinkle/sample.txt")
b:

org.apache.spark.sql.Dataset[String]

 = [value: string]
b.count
// 5
val b_twinkle = b.filter(_.contains("twinkle"))

// b_twinkle: org.apache.spark.sql.Dataset[String]



b_twinkle.count
// 2
b_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

Spark 2 Approach 2

We will use spark.read.text API, this will yield DataFrame


val c = spark.read.text("data/twinkle/sample.txt")

// c: org.apache.spark.sql.DataFrame = [value: string]



c.printSchema
// root
//  |-- value: string (nullable = true)

c.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |how I wonder what you are  |
// |up above the world so high |
// |like a diamond in the sky  |
// |twinkle twinkle little star|
// +---------------------------+

c.count
// 5

val c_twinkle = c.filter("value like '%twinkle%'")
//

c_twinkle: org.apache.spark.sql.Dataset

[org.apache.spark.sql.Row] = [value: string]

c_twinkle.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |twinkle twinkle little star|
// +---------------------------+

Summary

There are multiple ways of processing unstructured text file in Spark 2.

version	function	return type
spark 1	sc.textFile(“file.txt”)	RDD[String]

spark2	sc.textFile(“file.txt”)	RDD[String]
	spark.read.textFile(“file.txt”)	Dataset[String]
	spark.read.text(“file.txt”)	DataFrame : [value: string]

Processing unstructured text data with Spark 2 APIs – Dataset & Dataframe