CALL NOW 713-568-9753
22 Jun 2017
Processing unstructured text data with Spark 2 APIs – Dataset & Dataframe
Sujee ManiyamPosted in: Dev Stuff, Screencasts, Tutorials 0

This is part of our migrating/updating to Spark 2 series.

See all our posts on Spark and Spark2.

This post explains how to process unstructured, text data using newer Spark 2 APIs

Code repository
Learning Spark @ Github

And here is the code on Github

Screencast

Sample data

Nursery rhyme : twinkle twinkle little star.

twinkle twinkle little star
how I wonder what you are
up above the world so high
like a diamond in the sky
twinkle twinkle little star

Query

Find the lines that has the word ‘twinkle

Get the code


$  git clone https://github.com/elephantscale/learning-spark
$  cd learning-spark

 

Spark 1 Approach

Start spark-shell

$ spark-shell

We will use sc.textFile  this will give us an RDD


val a = sc.textFile("data/twinkle/sample.txt")
// a: org.apache.spark.rdd.RDD[String]
a.count
// 5
val a_twinkle = a.filter(_.contains("twinkle"))
// a_twinkle: org.apache.spark.rdd.RDD[String]

a_twinkle.count
// 2
a_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

 

Spark 2 Approach 1

Instead of SparkContext (sc) we will use Spark Session (spark.read.textFile).  This will give us an Dataset

This is almost the same as previous


val b = spark.read.textFile("data/twinkle/sample.txt")
b: org.apache.spark.sql.Dataset[String] = [value: string]
b.count
// 5
val b_twinkle = b.filter(_.contains("twinkle"))
// b_twinkle: org.apache.spark.sql.Dataset[String]

b_twinkle.count
// 2
b_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

 

Spark 2 Approach 2

We will use spark.read.text API, this will yield DataFrame


val c = spark.read.text("data/twinkle/sample.txt")
// c: org.apache.spark.sql.DataFrame = [value: string]

c.printSchema
// root
//  |-- value: string (nullable = true)

c.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |how I wonder what you are  |
// |up above the world so high |
// |like a diamond in the sky  |
// |twinkle twinkle little star|
// +---------------------------+

c.count
// 5

val c_twinkle = c.filter("value like '%twinkle%'")
// c_twinkle: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: string]

c_twinkle.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |twinkle twinkle little star|
// +---------------------------+

 

Summary

There are multiple ways of processing unstructured text file in Spark 2.

version function return type
spark 1 sc.textFile(“file.txt”) RDD[String]
spark2 sc.textFile(“file.txt”) RDD[String]
spark.read.textFile(“file.txt”) Dataset[String]
spark.read.text(“file.txt”) DataFrame : [value: string]

 

Leave a Reply

Your email address will not be published. Required fields are marked *