Big Intelligence, from Big Apps on Big Compute on Big Data
Technology revolutions play out in familiar patterns. And almost always from the bottom up.
In Web 2.0, we shifted the focus to top-down user experience.
The Big Data story follows the same arc. Big Data 1.0 has been about storage (e.g., HDFS) and computation (e.g., Apache Spark). We are now at the threshold of Big Data 2.0. It’s time to change the conversation and focus on end-user applications. These are Big-Data-native applications, which business users and data scientists can use to interact directly with their Big Data.
We’ll call them “Big Apps”.
Big Apps on Big Compute on Big Data
Big-Data applications (Big Apps) can’t happen before the arrival of Big Compute, sitting on top of Big Data. In that sense, Apache Spark and Tachyon are key pieces of this larger puzzle. They play the role of the Big Compute engine that fills the gaps between Big Data and Big Apps. As I have written elsewhere, in Spark and Tachyon we have the perfect architectural timing. These engines correctly anticipate the cross-over between rising business value and dropping hardware (memory) costs.
But there’s another significant property of Spark that separates it from all other in-memory Big Compute engines. And this is something most of us do not fully appreciate.
In the Spark framework, data is stored in RDDs (Resilient Distributed Datasets), which are first-class citizens. RDDs have life cycles that transcend compute cycles.
This is very different from, say, Hadoop MapReduce, which holds data in memory only temporarily as an internal part of each Map-Reduce stage. When each stage completes, the data, transformed and summarized, is written out to disk. All persistence happens at the disk level.
What do RDDs buy us, besides memory-speed iterations between compute stages? They buy us something very significant: the ability to have a long-lived applications that can access high-level data structures without having to go back to disk, and without having to recompute them. If you look at other architectures that are Spark’s rivals, most will lack one property or another in this very important dimension.
Likewise, Tachyon provides us with the facility of persistent in-memory data structures. And once we have such structures, we can not only access them, we can also share them. Thus, Spark and Tachyon make it possible to create collaborative Big Apps.
Spark RDDs have life cycles that transcend compute cycles, making it possible to create collaborative Big Apps
Collaboration-Led Productivity: The Most Important Feature/Benefit for Users
When I was working on Google Apps, we would often hear people ask, “Why launch Google Spreadsheets? It’s 20 years behind Microsoft Excel and 200 features short!” They didn’t realize that a driving mantra for Google Apps was “It’s the collaboration, People!” I have seen metrics, and still experience daily, how Google Apps’ real-time collaboration features boost team task productivity by a factor of 10x or more. It is collaboration among team members with diverse skillsets and points of view that yields these large gains in organizational smarts.
It’s the Collaboration, People!
When something can increase productivity by such a huge amount, arguments about data engines that run 20% faster just pale in comparison.
Therefore, to us at Arimo, the fact that Spark and Tachyon enable deep collaboration over Big Data is very significant. We have built a full suite of user-facing applications that exploits these collaboration capabilities. For example, Arimo Narratives is an interactive document that allows business analysts and data scientists to collaborate on creating data narratives, complete with text, interactive charts & maps, in real time, on the same huge datasets. They can use different access languages, one using plain English, the other using R. They can even collaborate across different client applications, one using a web browser, the other using R-Studio.
The brain power, insights, and productivity supported by these capabilities are phenomenal.
What’s Coming Next?
What else can we do to help people and businesses become smarter and more productive? From the very first days where companies such as Google and Facebook started accumulating data at scale, it was about applying algorithms to learn from that data, to build better systems and to drive decisions. So if you think about it, the driving rationale for Big Data is really to marry it with Machine Learning to produce wisdom and insights. Thanks to Big Data and Big Compute, recent major advances in Deep Learning and related areas indicate that we are at the threshold of a significant acceleration in machine intelligence.
Big Data is really about Machine Learning … We are at the threshold of a significant acceleration in machine intelligence.
At Arimo, we are working to ensure that we can all enjoy the benefits of this acceleration. The power of predictive analytics can equip business systems to learn via examples. It will help us discover the unknown by interpolating from the knowns. It will help us forecast future outcomes by extrapolating from the past. Every one of Arimo’s Big Apps has such predictive capabilities built-in as native features, from natural human interfaces to advanced machine-learning algorithms in the engine. Machine Learning is strong in our team’s DNA, and we envision a future in which machine intelligence is increasingly leveraged to aid and boost our human intelligence.
I am optimistic about this future and excited about delivering Arimo Big Apps, towards a future with Data Intelligence for All.