The Growing Thirst For Data Lakes

Sonia Johnson Dec 28 7 min read

As data continues to grow and diversify, many organisations are finding that traditional methods of managing information are becoming outdated. Data management, storage and analytics has even more so become a priority.  Along with the increases in volume of data, storage is more complex and diverse.  Businesses are challenged by the disconnected siloes of multiple storage sources, which lead to poor visibility, low performance and limited management capabilities. Data lakes can provide significant value here, making it possible for enterprises to engage new types of analytics like big data and machine learning and to better manage this data complexity.

What Is A Data Lake?

Aberdeen research has found that the amount of data coming into organisations has increased by 25% every year for the last five years. Given how data is transforming and challenging businesses, the need for power that data lakes can bring is more important than ever before.  A data lake is essentially a centralised repository that allows you to store all your structured and unstructured data at any scale.  You can store your data as-is without having to first structure the data and can run different types of analytics – from dashboards and visualisations to big data processing, real-time analytics, and machine learning to guide better business decisions.

Data Lakes Compared To Data Warehouses

All data repositories have a similar core function: housing data for business reporting and analysis. But their purpose, the types of data they store, where it comes from and who has access to it differs. In general, data comes into these repositories from systems that generate data — CRM, ERP, HR, financial applications and other sources. The data records created from those systems are applied against business rules and then sent to a data warehouse, data lake or other data storage area. Once all the data from the disparate business applications is collated onto one data platform, it can be used in business analytics tools to identify trends or deliver insights to help make business decisions.

Smaller organisations may require a simple SQL data mart or data stores to manage data, while mid to large organisations, depending on the requirements, may require both a data warehouse and a data lake as they serve different needs, and use cases. This data repository cheat sheet that Tech Target put together is quite useful.

A data warehouse is a database optimised to analyse relational data coming from transactional systems and line of business applications. The data structure, and schema are defined in advance to optimise for fast queries, where the results are typically used for operational reporting and analysis. Data is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust.

A data lake is different, because it stores relational data from line of business applications, and non-relational data from mobile apps, IoT devices, and social media. The structure of the data or schema is not defined when data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. Different types of analytics on your data like SQL queries, big data analytics, full text search, real-time analytics, and machine learning can be used to uncover insights.

As businesses with data warehouses see the benefits of data lakes, they are evolving their warehouse or data stores to include data lakes, and enable diverse query capabilities, data science use-cases, and advanced capabilities for discovering new information models. Gartner names this evolution the “Data Management Solution for Analytics” or “DMSA” of which Microsoft Azure is cited as one of the Leaders, noted for its leadership in cloud data management.  In fact, according to Gartner, Microsoft in this space is growing at twice the rate of the overall market. 

Data Lakes – What Should Be Considered?

As organisations are building data lakes as part of analytics platforms, below are some important capabilities to consider:

  • Unlimited storage, scalability & fast performance for analytics data
  • Ease of integration & data ingestion with your current architecture
  • Look for enterprise grade security (encryption, network level security & access control)
  • Affordability is important – cloud-based data lakes are a great solution here
  • Ensure the data lake can accommodate your data types & what you want to do with the data
  • Ensure you have a strategy and process around data management & data governance as this can be more difficult where data is unstructured
  • Consider the type of tools and skills that exist within your business as building & maintaining a data lake is not the same as working with databases, it requires big data architecture expertise
  • Consider planning ahead in data lake design – even though not necessary, structuring data schemas upfront can ensure better data quality
What’s The Benefit Of Cloud-Based Data Lakes?

Cloud-based data lake services are powerful yet simplified tools that make collecting, storing, managing and analysing high volumes of big data easy and more efficient. Azure Data Lake (ADL) is Microsoft’s on-demand data repository for big data analytic workloads. As a public cloud service, it provides organisations data storage and data analytic solutions with instant scale, similar to other Azure tools. The service encompasses two resources – Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA) – which merge affordable storage capacity and powerful analytics into one useful tool. Here are some of the benefits Azure Data Lake can provide businesses:

1. An easy way to store big data for future analysis: Data lakes used to be time-consuming projects, but Azure Data Lake and its huge list of fully managed supporting analytics and storage services (Azure Data Factory, Azure Machine Learning, etc) remove that barrier. It gives businesses a place to store high volumes of big data, sourced from both on-premises and cloud-based sources, without having to define or transform it first. Without a time limit on how long you can store data in Azure Data Lake, you can come back to it for exploration and analysis and produce your desired insights at your own pace.

2. Consolidate your big data into one place: ADLS brings together all of your big data from disparate sources across cloud and on-premises environments into one central place. You can monitor and manage all stored data more easily, without having to go back and forth between more than one silo. If you’re looking to reduce the number of places you store your data or tools you use for data analytics, it’s an ideal solution for data consolidation.

3. A cost-effective data lake: Running your big data workloads on Azure Data Lake Analytics is charged based on a per-job basis whenever your data is processed, you can or use an on-demand cluster. Without any hardware or traditional licensing or support agreements, you basically only pay for what you need and what you use.

4. High level security & compliance: Azure Data Lake is backed by enterprise-grade security and makes it safer to manage data overall – staff don’t have to manually store or migrate your big data, so risk is reduced. Because it’s in the cloud, it also makes compliance, governance and logging much easier. Finally, it is integrated with Azure Active Directory (AAD), which means you can provide seamless authentication around all your stored data.

5. Remote access: Cloud-based options like Azure Data Lake naturally makes big data more easily available remotely. This enables better enable collaborative analysis and improves overall information accessibility.

Whatever you require, a simple SQL Data Mart or assistance in building or managing a data warehouse or a data lake, BOOMDATA can assist as Microsoft Data Platform & Analytics specialists.