How to create a successful data platform 2.0

Sindhu Murugavel
7 min readAug 3, 2020

Read this if your data platform 1.0 was not successful as you thought it would be.

Photo by Estée Janssens on Unsplash

Over 7 years of my career, I have been closely working on different aspects of building a data platform. It was called Data Lake, Data Swamp, Data Supply, Data Factory and so on. As cool as the terms were, did it actually deliver cool results is a million dollar question. Data is produced abundantly and there is no one way to house it right and derive value out it. If there was, all data-driven companies would have taken that to be their bible and never would have turned back.

Building your data platform right involves an iterative process. A good starting point is to take the lessons learnt from the first time to be able to do it right this time. This involves technology changes, process changes — more than that, it involves mindset changes. If a company can do all these three things successfully, they will be on their way to build a state-of-the-art data platform.

Here are the 10 things that I have learnt so far to build a 2.0 version of the data platform right. If you don’t have a data platform already, this is your lucky day. Keep the below points in mind as you articulate the data platform of your dreams.

Missed focus on Data Quality

There is so much excitement about getting value out of the data, that the first thing that happens is to blindly get it ingested. That is the right step but there is lot of time spent creating the framework for ingesting & transforming the data rather than focusing on the quality of data. Without quality, ingested data cannot be used for anything. To be a data-driven organization you need quality data.

Solution: There are lot of ingestion tools in the market that do ingestion off-the-shelf. The value of this approach can be visualized when testing a COTS product to see how the time taken can be shortened to build one of the complex processes in the ecosystem. This way quality time is spent looking at the data and not at the processes required to load the data.

Too many access constraints

Okay, so the data is now there — Problem solved right? Today users have a hard time accessing the data due to environmental access issues, no available catalog etc. A hybrid data environment might make this even more difficult. But do users really care where they get their data from? No! This can be used to our advantage.

Solution: Provide a black box environment to the end user where they get one central access to all the data across the hybrid environment. They need data virtualization to abstract the underlying platform and project the data as if it were in a single environment. This can be established by plugging in a common query layer that talks to all environments.

No clue about what data is present

Now that quality data is loaded, how do the consumers of the data know that its present? Only when they can get a real feel of the data, they can figure out how to use it well. In the current world, its really difficult to know what data is available in the platform for use.

Solution: The idea of a data catalog will always be in the talks when building a data platform. But trust me, it will never happen. I suggest marrying the data catalog part to the data landing part so that both are done at the same time. This way there is assurance that the catalog is ready for use when the data is.

No Self-Servicing

Once quality data is landed and a data catalog is setup, there will be an increasing demand to get more data in. Planning can help handle some of this but I am pretty positive that it might not help deliver data on time. Relying on manual labor for provisioning data to a downstream application takes a lot of time.

Solution: Building an automated workflow that provisions data in no time is the right way to delivering data quickly. Rather than fixing it tactically by getting the most urgently needed data in, there is a need for a strategic approach that will lessen the burden in the long run. This will help data being available to the business faster and help us bring more value at market speed.

Not being really Agile

Agile is a buzz word in the industry over the past couple of years. Don’t get me wrong — I really like the idea of iterative development. But do companies adopting Agile methodologies, really follow an iterative process for software development? Most of them get it wrong, and follow Agile ceremonies like Iteration Planning & Retrospectives and forget the core idea of actually performing Agile development.

Solution: For a web application or mobile app, its easy to be feature oriented and deliver those features in iterations. In the data warehousing world, it might be difficult to do the same. There is still lot of opportunity to focus on iterative ways and delivering small rather than delivering big bang. It just needs a good eye and a good mind. Think!

Hadoop cannot deliver unrealistic numbers performance-wise

Hadoop has been exclusively adopted by the industry due to its low cost, large storage & parallel processing approach. The idea of why it evolved has been lost in the mix at this point. The original purpose behind it was to crunch large quantities of data in a distributed fashion, typically for batch processing. The term batch processing is once again subjected to interpretation. Ideally, it is a process where you do NOT expect predictable performance. It corresponds to all those long-running reporting jobs or nightly jobs that swallow a lot of data for analytics operations only.

Solution: The right way is to compliment Hadoop with a list of other tools like RDBMS, NoSQL and In-Memory databases that can help deliver predictive results based on the type of data thats suited for them. Building a hybrid environment that leverages the right platform that matches our needs is the key. If you want predictive query performance choose an RDBMS. If you want schema flexibility, choose a NoSQL and so on.

Non-resolved Technical Debt

Technology is growing by the day. There is no need to catchup with it every day, but there needs to be a plan(and execution) to catchup with it sometime soon. Being monolithic in building and using data processes and tools and leaving technical debt unaccounted can be a huge setback. The Datalake will eventually become too huge to handle anything new due to its growing nature.

Solution: Evaluate and USE new technology. Always be on the lookout for industry standards and incorporate them without incurring too much technical debt by allocating ample time and planning for tech catch-up. Integrate innovative ideas that worked for different internal teams into your end product.

Adios, Batch

The initial drift towards migrating some good old batch processes has not faded yet. It’s ok to do that if there is constant focus on real-time components as well. Most migrations are to keep the existing process as is, only on newer tech — Example migrating a overnight mainframe job to a overnight Hadoop job. Nobody has the time to think — “Well, does this job need to be overnight? Can we deliver this data sooner? Will that add more value to the business?”

Solution: Real-time data can be used for batch but batched data cannot be used real-time. Hence, the focus should be on the real-time aspect of the data factory because it can serve both real-time and batch processes. A publish-subscribe pattern needs to be established. This will help enable data being provided as a service rather than a nightly job.

Not serving high priority customers

Creating data for data scientist who might use the data in a year or so is cool, but what about customers who need the data right now? Provisioning data for users who might/might not use the data is a stupid idea. There are lot of customers who need data right away from the platform for their immediate business need.

Solution: Planning should be first come first serve for users who need the data right now and realigning future users based on their expected timeline. This will help serve our customers who need data right now, without forgetting the users of the future.

Change — There, I said it!

Change is difficult. Nobody wants to change anything because they are scared that it may cause them to fail. This is THE mindset change I talked about earlier. The thing that they do not understand is that — they were never succeeding in the first place, so this might be a welcome change that will actually work.

Solution: Create an environment were “fail” is not a frowned word. It is better to fail fast and learn a million wrong ways to not do something rather than try to build something perfect the first time. I don’t want to state the obvious “Edison and the light bulb” example, but you get the idea.

By following these 10 points, you might be on your way to building a much successful data platform for the future.