8 Rules of the Road for Fast Data Management and Analytics


These days, end users—be they employees or consumers visiting a site—expect information delivered in seconds, if not nanoseconds. Applications tied into networks of connected devices and sensors are powering operations and making adjustments on a real-time basis.

This calls for fast and intelligent data—also often referred to as “streaming data”—sought by enterprises to compete in an intensive global economy. From a consumer’s point of view, no user experience is complete without some smattering of analysis or intelligence, tied to recommendation engines that provide additional insights or courses of action for users to follow. The challenge for data and development teams is to not only build on these types of intelligent services but to also find the best ways to deliver data interactively and in real time—or near real time.

The next generation of technologies and methodologies emerging—from in-memory databases to machine learning to alternative forms of databases—promises to deliver on this potential. Fast data is at the forefront of the real-time revolution. It means large volumes of data need to be moved through systems and across decision makers’ desks, enabling real-time views of events as they happen, be it a customer problem, an inventory shortage, or a systems glitch. It’s a matter of identifying the moments of truth, in real time, when an end user—be it an employee or customer—is ready to move on to make the next decision.

This is a sea change from the typical data environments, which, until recently, were tasked with delivering static reports, most likely on historical data. Now, there is a drive to what ranges from real-time analytics to streaming analytics to operational intelligence, in which information viewed by decision makers is refreshed on an instantaneous basis. The next phase in this evolution is toward predictive analytics, built upon a constant feedback loop of real-time data from sensors and systems feeding into operations.

Enterprises also recognize that real-time capabilities deliver greater value to their organizations and customers than traditional batch-mode processing. It helps keep applications and related information being provided refreshed on a constant basis. Batch mode, on the other hand, means periodic updates of large sets of data, likely on a 24-hour cycle, with no real-time interaction.

Many enterprises recognize the role that fast data is playing in current and future growth plans. A recent survey of 4,000 data professionals by OpsClarity, Inc. found that 92% of companies plan to leverage stream-processing applications, while 79% intend to reduce or eliminate their investments in batch-only processing. It’s going to take some time, however. While 65% of respondents claim to have real-time or near-real-time pipelines currently in production, they are still leveraging a wide mix of data processing technologies—batch, micro-batch, or streaming.

There are many solutions on the horizon that promise to converge real-time capabilities with data environments. However, integrating these multiple approaches can be daunting. Streaming analytics, for example, can be supported through open source solutions such as Apache Spark, a fast cluster computing system, and Apache Kafka, a distributed streaming platform, but integrating these various environments can be challenging. Plus, these newer solutions often get built on top of existing data environments, such as data warehouses.


For more articles on this topic go to the White Paper Download Section of the DBTA website and download this Best Practices report.


The good news is a fast data infrastructure doesn’t have to be a drag on performance. Data managers need to take proactive measures to build, maintain, and support today’s generation of highly interactive and intelligent applications. Maintaining performance is an important piece of the puzzle, and that’s where database technology converges with the drive to real time. Here are the key elements to consider in moving to a fast or streaming data environment:

Mind your storage.

Fast data requires a technology component that is essential: abundant and responsive storage. This is where data managers and their business counterparts need to understand when and where data pulsing through their organizations need only to be read once and discarded, or stored for historical purposes. Many forms of data—such as constant streams of normal readings from sensors—simply aren’t important enough to invest in archival storage.

Consider alternative databases.

Much of the data that is being sought across enterprises these days is the unstructured, non-relational variety—video, graphical, log data, and so forth. Relational data systems, for example, tend to be slower than necessary for the tasks that employ unstructured data streams. NoSQL databases, for example, have lighter-weight footprints and can process these data streams at faster rates than established relational database environments.

Employ analytics close to the data.

It may also be helpful to use data analytics that are embedded with database solutions for many basic queries. This enables greater response times, versus routing data and queries through networks and centralized algorithms that may drag on performance and increase wait times.

Examine in-memory options.

The delivery of highly intelligent, interactive experiences requires that back-end systems and applications operate at peak performance. That requires movement and delivery of data at blazing speeds, recognizing that every nanosecond counts in a user interaction. In-memory technologies—which can support entire datasets in memory—can deliver this speed.

Employ machine learning and other real-time approaches.

Behind just about every analytics-driven interaction is an algorithm that employs techniques to gather data and do some type of pattern matching to measure preferences or predict future outcomes. Machine learning approaches enable these systems to adjust software to data streams without time-consuming manual intervention.

Look to the cloud.

Today’s cloud services support many of the components required for fast or streaming data—from machine-learning algorithms to in-memory technologies. Most respondents in the OpsClarity survey (68%) cite the use of either public cloud or hybrid deployments as the preferred mechanism for hosting their streaming data pipelines.

Pump up your skills base.

The next-generation approaches required for delivering fast or streaming data and analytics also call for new types of skills in these areas. Data professionals need greater familiarity with new tools and frameworks, including Apache Spark or Apache Kafka. Organizations must increase their levels of training for current data management staffs, as well as seek out these skills in the market.

Look at data lifecycle management.

It’s important to be able to filter the data that is required for eventual long-term storage, versus the data that is only valuable in the moment. Otherwise, the amount of data that would need to be stored would be overwhelming—and mostly unnecessary. A way to address potential storage overload is data lifecycle management, in which certain types of data are either eliminated or moved to low-cost storage vehicles, such as tape, after a predetermined amount of time.



Newsletters

Subscribe to Big Data Quarterly E-Edition