This is part of our migrating/updating to Spark 2 series.
See all our posts on Spark and Spark2.
This post explains how to process unstructured, text data using newer Spark 2 APIs
Code repository
Learning Spark @ Github
And here is the code on Github
Screencast
Sample data
Nursery rhyme : twinkle twinkle little star.
twinkle twinkle little star
how I wonder what you are
up above the world so high
like a diamond in the sky
twinkle twinkle little star
Query
Find the lines that has the word ‘twinkle‘
Get the code
$ git clone https://github.com/elephantscale/learning-spark
$ cd learning-spark
Spark 1 Approach
Start spark-shell
$ spark-shell
We will use sc.textFile this will give us an RDD
val a = sc.textFile("data/twinkle/sample.txt")
// a: org.apache.spark.rdd.RDD[String]
a.count
// 5
val a_twinkle = a.filter(_.contains("twinkle"))
// a_twinkle: org.apache.spark.rdd.RDD[String]
a_twinkle.count
// 2
a_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star
Spark 2 Approach 1
Instead of SparkContext (sc) we will use Spark Session (spark.read.textFile). This will give us an Dataset
This is almost the same as previous
val b = spark.read.textFile("data/twinkle/sample.txt")
b:
org.apache.spark.sql.Dataset[String]
= [value: string]
b.count
// 5
val b_twinkle = b.filter(_.contains("twinkle"))
// b_twinkle: org.apache.spark.sql.Dataset[String]
b_twinkle.count
// 2
b_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star
Spark 2 Approach 2
We will use spark.read.text API, this will yield DataFrame
val c = spark.read.text("data/twinkle/sample.txt")
// c: org.apache.spark.sql.DataFrame = [value: string]
c.printSchema
// root
// |-- value: string (nullable = true)
c.show(false)
// +---------------------------+
// |value |
// +---------------------------+
// |twinkle twinkle little star|
// |how I wonder what you are |
// |up above the world so high |
// |like a diamond in the sky |
// |twinkle twinkle little star|
// +---------------------------+
c.count
// 5
val c_twinkle = c.filter("value like '%twinkle%'")
//
c_twinkle: org.apache.spark.sql.Dataset
[org.apache.spark.sql.Row] = [value: string]
c_twinkle.show(false)
// +---------------------------+
// |value |
// +---------------------------+
// |twinkle twinkle little star|
// |twinkle twinkle little star|
// +---------------------------+
Summary
There are multiple ways of processing unstructured text file in Spark 2.
version | function | return type |
spark 1 | sc.textFile(“file.txt”) | RDD[String] |
spark2 | sc.textFile(“file.txt”) | RDD[String] |
spark.read.textFile(“file.txt”) | Dataset[String] | |
spark.read.text(“file.txt”) | DataFrame : [value: string] |