Page 1 of 3 next >>

Don’t Get Washed Out by the Overflowing Data Lake: 5 Key Considerations

By Steve Sarsfield

Dec 13, 2018

On many TV shows and movies, there’s that character that’s back in the security operations center at headquarters or is working on a laptop outside the action in a van, the technical member of the crime-solving team that sifts through petabytes of video, logs, and other data to uncover the crucial clue to solve the crime.

However, as database professionals, we know the truth. The data isn’t just sitting there waiting to be analyzed. It’s dirty, unformatted, and not necessarily ready for analysis. Collecting and harnessing the power of massive amounts of data is a complicated exercise.

For more articles on big data trends, access the BIG DATA SOURCEBOOK

In many organizations, data management pros are still challenged by a collection of legacy enterprise data warehouse architectures, Hadoop, and cloud storage (to name just a few).

The ever-growing volumes of data that we generate lead us to looking for places to put the data, such as the data lake. It grows and grows with data that may or may not have value, but the prevailing thought is that it may be needed someday.

Regardless of how good the platforms are at storing data, most data lakes are not full analytics platforms. This leaves many organizations with a massive amount of data and a huge usability gap when trying to perform analytics on it. It takes a solid strategy for governing data to make sure data analytics can be leveraged for continued business insights that can translate to company success.

Here are five key considerations for organizations that are trying to make more out of their data lake.

1-Consider Query Scaling, Not Just Data Scaling

When data is coming at you with volume, variety, and varying velocity, less consideration is given to the scalability of analytics than keeping up with data. Data management pros may find themselves working on the data conveyer belt, needing to take data from the belt and put it somewhere. Often this means that they put the data conveniently in Hadoop Distributed File System (HDFS) volumes or cloud storage such as Amazon S3, Google Cloud Storage, or Azure Blobs. These storage locations offer a fast and convenient way to store data with an amazingly low cost.

However, keep in mind that cloud storage and HDFS are not databases. A database management system (DBMS) is a place where you can load and store data in the most optimal way for queries. ACID (atomicity, consistency, isolation, durability) compliance, workload management, and concurrency are the foundations of a database. When your data is in cloud storage, you’re more likely to use a query engine to perform analytics. A query engine system is less concerned about optimizations and more interested in data exploration. Use it to explore the data that is outside the constraints of service level agreements.

You need a DBMS when your database needs a new home that can deliver compliance with standards for SQL, ACID compliance, and where backup and restoration are part of the system. A DBMS provides advanced methods for optimization and for faster analytics. Most importantly, you store data in a database with a DBMS when you’re expecting it to meet service-level agreements on analytics. In other words, if you have to run X number of reports in X number of minutes, use a DBMS. If you have hundreds or even thousands of end users analyzing data, a query engine looking at unknown data volumes generally won’t cut it for timely analytics.

2-Consider the Analytics You Need, Not Just the Storage You Need

For data management systems, the two most important factors are safely storing the data and effectively analyzing the data. Yet, the analysis part is often under-scrutinized. Analytical systems vary greatly in the depth of analysis offered. Some analytical systems don’t offer a complete set of SQL queries. If you need to do a JOIN with a WHERE clause, for example, some setups can’t handle it. If you want to do geospatial analytics such as finding the distance between addresses or LAT/LONG points, some systems require extra add-ons that make it clunky and onerous.

Page 1 of 3 next >>