Top 7 challenges of building a data lake

While from the technical perspective, deployment, management and provisioning tools are available to quickly set up a Hadoop cluster, introducing it to the organization is a tough task.

Here are the biggest challenges of building a data lake

1. Understand the purpose and limitations of the technology

Hadoop environment provides a vendor independent tool-chain for data storage, management and analytics in a scalable fashion – it can manage infinite amount of data and with add-ons makes real-time and data science use-cases available for the enterprise
Many organizations does not have Big Data. They just load their RDBMS content and get suprised that a 20 node enterprise Hadoop setup’s performance is subpar compared with a single PostgreSQL instance. Sure, because the purpose and capabilities are different – we should compare apples with apples after all
In 95% of the cases stakeholders think that this new technology will substitute their painful proprietary data environment of RDMBSs. It is a highly distributed environment so the case is that is is never capable to substitute proprietary data systems. But it can live well beneath those providing extreme value. The purpose of the data lake is different. Injest data, all of your data, and deal with portions of it without costly ETLs to move it from here to there. Refine your data system, be it processing, reporting or prediction models by slowly adding more datasets.

2. Security

In today’s world of information security and huge data breaches, a security framework is the most important step when introducing a data lake
When building a data lake our goal is to have all data at one place and make it available for various teams. But for years our only goal was to make the data generally unavailable (only available to those with permissions)
Keep in mind that we are dealing with a framework that was built for scientific usage without enterprise in mind, therefore planning and deployment of the security framework should be a well defined sub-project of ours.

3. Data governance (confidentiality, integrity and availability)

An RDBMS enforces a schema, also many rules, e.g. you cannot add a string to an integer field. A data lake does not really have such enforcement mechanisms by default – you can put anything, anywhere. What will this lead to? Everybody puts everything, everywhere. You should setup a governance structure
Policies, procedures, processes, responsible personnel (RACI framework), documentation, a technology enforcing those and rich documentation to avoid chaos. However, constructing such a framework is a tough task.
So what is the data, who put it there, why is it there, how and when it changes, who owns it? Construct a data dictionary, know your data, the sources and sinks of it and enforce keeping it up-to-date.
For further information on data governance challenges, take a look at this whitepaper.

4. Cost structure of software and support for open-source tools

Commercial, enterprise data software had huge license costs and relatively not high implementation costs. Everyone understood that when paying for a product, the costs are paid upfront – so as for buying anything in a store
Now no or minimal license costs are involved – it is simply hard to admit that something almost free can deliver great value
However, being a new technology, support and fast evolution of the stack requires continuous and not so negligible operation investments
As the tools are vendor independent there is a shift towards independent contractors providing great service in this space
When an organization can avoid vendor lock-in to a big IT provider, it simply should and set up a project structure that members could be quickly swapped for enabling agile integration
To put it simply, there is a need to continuously pay for vendor independent personnel, expertise and not the software license or ever increasing service costs of specific solution providers.

5. Availability of data experts

It is easy to find RDBMS analysts with established and well known certificates assuring professional support of the proprietary technology stack
Unlike with big data – it is not only hard to identify vendors for the various building blocks of the technology, it is hard to understand what expertise is needed for a project – is it DevOps, Linux ops, deployment automation, data science, data analysis, programming in MapReduce, with Spark for Scala, etc. There is no such as a SQL analyst running our ETLs and reporting anymore but a team of professionals with Big Data stack operation, Big Data stack development and data, data science expertise, who should be aware of our problems and subject matter
Education of the users and stakeholders: responsible knows little of this world – current architects will be challenged to integrate the tools to the enterprise, BI analysts will be having a hard time to compile Scala notebooks. Current RDBMS operators will have hard time to identify data assets around the lake. It is a new paradigm and people should learn, support from professionals of the domain is essential, especially when integration budgets rise into the space of millions of dollars.

6. Rigid, on-premise operation environment

Many organizations don’t trust the cloud and starts an onsite Big Data cluster. Dealing with network setup, security, sizing, procurement, problems around the data center, support structure often causes years of delays. The integration could be done when the hardware is already at it’s end of life-cycle…
Cloud enables flexibility. The time needs and performance concerns of onsite deployments complicates data projects as much as cloud is the only option to start with and showcase the value of the technology to win further organizational support on spreading it. POCs should take as little time as possible, therefore infrastructure related questions should be the least of the concerns.

7. Project management

Classic waterfall or ad-hoc project management methods won’t really work for complex data projects due to time needs and the incapability to manage unknowns of the data: to avoid killing a quick win, agile proof-of-concepts have to be done, first to identify the value and challenges with the data and when accepted, those should be extended to the production phases
When due to deployment issues the production phase is delayed (see point 5), project members get demotivated. We have not yet dealt with actual data, just building the environment and nobody has the lust anymore to learn the new programming paradigms and data management rules
Motivation is a key: engineers from various background joins such projects – the knowledge area to learn is huge and while Big Data is a hot topic, 90% of the development efforts goes into the not so joyous ETL scripting.