Processing unstructured text data with Spark 2 APIs – Dataset & Dataframe

This is part of our migrating/updating to Spark 2 series.

See all our posts on Spark and Spark2.

This post explains how to process unstructured, text data using newer Spark 2 APIs

Code repository
Learning Spark @ Github

And here is the code on Github

Screencast

Sample data

Nursery rhyme : twinkle twinkle little star.

twinkle twinkle little star
how I wonder what you are
up above the world so high
like a diamond in the sky
twinkle twinkle little star

Query

Find the lines that has the word ‘twinkle

Get the code


$  git clone https://github.com/elephantscale/learning-spark
$  cd learning-spark

 

Spark 1 Approach

Start spark-shell

$ spark-shell

We will use sc.textFile  this will give us an RDD


val a = sc.textFile("data/twinkle/sample.txt")

// a: org.apache.spark.rdd.RDD[String]


a.count
// 5
val a_twinkle = a.filter(_.contains("twinkle"))

// a_twinkle: org.apache.spark.rdd.RDD[String]



a_twinkle.count
// 2
a_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

 

Spark 2 Approach 1

Instead of SparkContext (sc) we will use Spark Session (spark.read.textFile).  This will give us an Dataset

This is almost the same as previous


val b = spark.read.textFile("data/twinkle/sample.txt")
b: 

org.apache.spark.sql.Dataset[String]

 = [value: string]
b.count
// 5
val b_twinkle = b.filter(_.contains("twinkle"))

// b_twinkle: org.apache.spark.sql.Dataset[String]



b_twinkle.count
// 2
b_twinkle.collect.foreach(println)
// twinkle twinkle little star
// twinkle twinkle little star

 

Spark 2 Approach 2

We will use spark.read.text API, this will yield DataFrame


val c = spark.read.text("data/twinkle/sample.txt")

// c: org.apache.spark.sql.DataFrame = [value: string]



c.printSchema
// root
//  |-- value: string (nullable = true)

c.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |how I wonder what you are  |
// |up above the world so high |
// |like a diamond in the sky  |
// |twinkle twinkle little star|
// +---------------------------+

c.count
// 5

val c_twinkle = c.filter("value like '%twinkle%'")
// 

c_twinkle: org.apache.spark.sql.Dataset

[org.apache.spark.sql.Row] = [value: string]

c_twinkle.show(false)
// +---------------------------+
// |value                      |
// +---------------------------+
// |twinkle twinkle little star|
// |twinkle twinkle little star|
// +---------------------------+

 

Summary

There are multiple ways of processing unstructured text file in Spark 2.

version function return type
spark 1 sc.textFile(“file.txt”) RDD[String]
     
spark2 sc.textFile(“file.txt”) RDD[String]
  spark.read.textFile(“file.txt”) Dataset[String]
  spark.read.text(“file.txt”) DataFrame : [value: string]

 

Sujee Maniyam
Written by:

Sujee Maniyam

Sujee Maniyam is the co-founder of Elephantscale. He teaches and works on Big Data, AI and Cloud technologies. Sujee is a published author and a frequent speaker

Leave a Reply

Your email address will not be published. Required fields are marked *