When data lakes first started springing up about eight years ago, there were more than a few ways to define them. It all depended on who you talked to. Water-related puns abounded, along with aquatically-themed warnings about swamps, sinkholes and puddles. Today, things are not that different. In fact, data lakes, and their puns, are back and flooding the headlines. That alone is a bit surprising for industry watchers who predicted that use of the fluid term would eventually evaporate.
But it hasn’t, and arguments about exactly what a data lake is have been renewed and updated, with some similar themes. Some say it’s just about storing data in its native form, while others describe more analytically-powered environments. For many, the term data lake seems like it’s simply shorthand for a shared data environment, powered by Hadoop and Spark.
However we define the data lake, it’s clear that there is interest in the concept, as well as success rates and ROI. Based on the latest analyst stats and our own anecdotal evidence, data lake levels are continuing to rise.
Analysts Weigh in On Data Lake Adoption
Just two years ago, IDC’s Data Lake Survey, September 2016, found 18% of companies had a data lake. Of those companies that were considering a data lake, 42% planned to deploy one within the next six to 12 months and 27% planned to deploy one within the next few years, leaving only 13% of companies surveyed not actively exploring the use of a data lake. Now that we’re in 2018, another way to look at these numbers is that a lot of companies surveyed in 2016 theoretically have a data lake or an imminent plan to deploy one.
A 2017 TDWI report, “Data Lakes: Purposes, Practices, Patterns, and Platforms” reports higher data lake adoption rates. Survey results reported that 23% of respondents already have a data lake in production. Of those: 24% were planning to have a data lake in production within 12 months, 15% within 24 months, 10% within 36 months, and 21% in 3+ years. Only 7% weren’t planning to put a data lake into production.
And the deployment platform of choice? For 53% of businesses, it’s Hadoop. With another 24% reporting data lake deployments on both Hadoop and a relational database management system (RDBMS), according to the 2017 report from TDWI.
While data lakes are frequently associated with Hadoop, another firm agrees that relational technology shouldn’t be counted out. “By 2020, Gartner predicts that 30% of data lakes will be built on standard RDBMS technology at equal or lower cost than Hadoop,” predict Gartner analysts Ted Friedman, Roxane Edjlali, Nick Heudecker, Donald Feinberg, Mark A. Beyer, Adam M. Ronthal, Andrew White in the Report: "Predicts 2018: Data Management Strategies Continue to Shift Toward Distributed,” 31 October 2017.
So there you have it. Data lakes are being built and businesses are diving in. Of course, implementation is driven by a company’s specific business and data needs, but all signs indicate that data lakes will continue to make a splash as part of a larger data strategy.
You can learn more about data lakes at this online resource, Data Lake Concepts, which offers definitions, news and articles about data lakes.
Article by Hannah Smalltree. Hannah is on the leadership team at Cazena, which offers enterprise Big Data as a Service. She’s worked for several data software companies and spent over a decade as technology journalist, interviewing companies about their data and analytics programs.