Last month, Garrett Young of IBM presented at our Houston Hadoop & Spark Meetup. The topic was an interesting one: how is IBM planning to make money on an open source project, in that case, Spark.
First, Garrett briefly introduced Spark and spelled out the reasons for IBM’s interest in Spark: it is performant, productive, leverages existing investment (in Hadoop) and grows better with age (growing community). He mentioned that IBM Watson uses Spark Machine Learning. It also uses Scikit-Learn and can pull data from different sources.
He explained how much effort IBM is putting into Spark – they are the second-largest committer after DataBricks. I am showing the slide with the numbers here.
The answer emerged as follows: there are many good open-source projects in the world. However, it takes a large company to bring them together and deliver them for use by developers, and that is what Bluemix does.
The meetup attendees seconded this: in many large companies, particularly in O&G companies in Houston, there is a need for all-over organization of Big Data processing, with the integration of multiple tools and with security. That is the major emphasis of Big Data players now. This is what Cloudera calls “data pipelines.”
So too, IBM will continue to build on Spark, integrating multiple products and services with it, and with its other offerings. Plain and simple, but was not clear to me before. So thank you, Garrett!
The complete presentation can be found on SlideShare here.