Review of “Learning Spark” by Karau, Konwinski, Wendell & Zaharia

 

learningpsarkLearning Spark” was the first published book on the subject. Six months later, there appeared a plethora of other books. This review will concern itself mostly with the “Learning Spark” book, but for the benefit of the reader, it will also compare it to a couple other books. The intent here is to help the reader to choose which book to read first, and to explain the merit of this and other Spark books.

Written by the major Apache Spark committers, the book has the potential of being a definitive resource on Spark, and indeed it comes close. The areas that “Learning Spark” covers are: software install and basic operations, programming with RDD and with key/value pairs, using advanced Spark facilities, SparkSQL and Machine Learning. Let’s look at each area.

Install and basic operations are a necessary part. You need to install and you need to use the Spark shell, that’s a given. What the book adds beyond the manual is an overview of Spark architecture, and discussion of the versions and builds, and where they fit. So, by the time you come to do the actual install, you are quite certain that the instructions will be correct, and indeed they are. This is more than some books can boast.

Another great thing about “Learning Spark” is the description of the architecture and implementation. Clear and authoritative, it gives you the knowledge to answer nagging questions asked by your alter ego, as well as by others developers.

The RDD programming, and its specific applications in the world of key/value programming are explained clearly and concisely. Here one might wish for more coverage of general information, for example, how key/value RDDs are different from generic RDDs, and what do they give the programmer. Otherwise a reader may be lost in the tables listing the API functionality (which, by the way, can be found online). But overall, examples are good and are working.

SparkSQL is described well, but it is missing the later addition of DataFrames. Equally missing is a chapter about GraphX. But Spark Streaming is there.

What are the book’s competitors? “Fast Data Processing with Spark, 2nd Edition” by Sankar and Karau (notice, it is the same Holden Karau – she is quite prolific). This is a very practical book, with lots of scripts, screenshots and instructions. No theory though, which some reviews consider a drawback – but hey, the first edition (by Holden Karau) was the same way. And while the first edition got low reviews, the second one has them quite high. So I would recommend this as a companion book, sort of labs to “Learning Spark.”

Another book that must be mentioned is “Analytics with Spark” by Ryza, Laserson, Owen & Wills. That is an excellent book for what it does – practical machine learning and analytics – and my review for it is coming, but it does not in any way substitute the “Learning Spark” book. To know Spark, you need “Learning Spark.” To know Spark’s applications in the data science world, you will need the other one.

All-in-all, the book is excellent and indispensable for anyone who wants to feel that his or her knowledge of Apache Spark is complete, and it provides the solid foundation for this knowledge. A second edition is in order, with the upgrades as mentioned above, but even now the book is very useful companion to the Big Data software developer.

Leave a Reply

Your email address will not be published. Required fields are marked *