Many IT departments are choosing to build their own customised data platform to help them in their digital transformation. Most of the time it is because they are under the pressure of the business lines (retail banking, payment…). Most of these companies are tempted to start building their own data platform including date lake, schemas, formats, historicization, scalability and governance processes. They try to rethink the way data should be acquired, stored, historised, distributed, analysed, and governed within their companies. While this is definitely possible, most of the companies are underestimating the complexity of the underlying technology of a modern data platform due to the minimisation of the following implementation challenges:

The underestimating of the above-mentioned challenges will cause the following results :

We will give some numbers to illustrate the costs of the above-mentioned results. We worked for a banking group that, under the pressure of the business lines (retail banking, payment), was rethinking the way data should be acquired, stored, historised, distributed, analysed, and governed. They launched a data hub program, where they defined (1) a target data architecture and (2) two first use cases (clickstream analytics for dynamic targeted web banners and customer 360° view for employee applications in the branches).

Like many significant banks, they thought they could cover the end-to-end implementation path from the design to the implementation. They set up a team of 20 developers for a one-year project.

After one year of work, three open-source modules were developed, but unfortunately, they were not good enough to be put in production. The mother company spent almost 4,5 Mi € on this project:

Due to the lack of results and the huge amount of costs, the data hub development program was abandoned. Each of the daughter companies had to develop its own components with a minimalist approach. The main reason for this failure is the complexity of the underlying big data technology (Hadoop ecosystem, Hbase, Hive, Spark, Cassandra, Flink, etc.).

What can we learn from it?

To avoid failures of implementation, try to take into account the following advice:

Have a project ?

Try your decentralised data organisation on a pilot project

A low-risk approach to build experience fast

Get more concrete details