Our product manager, Luc Berrewaerts, unpacking some of the aspects of data engineering, what it is and how complex it can be, but also sharing his experience on new technologies to navigate the minefield of data engineering.
The data science paradigm, also known as the fourth paradigm or even the fourth industrial revolution, transformed the fabric of our lives and every aspect of how businesses run. It is quite impossible to think of an industry that has not been revolutionised by the superpowers of data science.
People turn to Alexa, Amazon’s voice AI, to dim the lights of their living room, ask Siri for immediate answers to their questions and depend on their online banks for their everyday financial needs.
You can think of data science as a luxurious bathroom with a rainfall tropical shower. Yet the showering experience is meaningless if there is no running water. Same applies here: while we enjoy the benefits of data science, there are other actors working behind the scenes. These actors are in charge of bringing the water to the bathroom or, in other words, of building the necessary data infrastructure to bring the data to the projects that need it. Their work enables data scientists to develop and optimise artificial intelligence or machine learning powered technologies that unquestionably make our lives easier.
Who are these actors? Data engineers.
What is Data engineering and how does it add value?
The key to understanding what data engineering means lies in the word “engineering”.
Engineers design and craft. Data engineers design and craft pipelines to allow data to flow out of the source systems while transforming it to a highly available and usable format by the time it reaches the data scientists or any other end-consumers.
In addition, data engineering ensures that these data pipelines have consistent inputs and outputs, which sustains the very famous GIGO concept in computer science, “Garbage In, Garbage Out”. A concept that expresses the idea that nonsense input data inevitably produces nonsense output. This shows how critical and essential data engineering is to the data process.
All in all, data engineering adds value through its ability to automate and optimise complex processes, transforming raw data into usable and accessible assets.
How complex is data engineering?
In one phrase, data engineering is complex. It focuses on the creation of processes and interfaces ensuring the seamless access and flow of data. This includes several data practises from data warehousing, data modelling, data mining to data crunching. All of which involve to master, operate and monitor a highly complex data engineering toolkit composed of workflow management platforms, data warehouses, data processing engines and even streaming tools.
Another factor contributing to the complexity of data engineering is data silos. Also known as an information silo, a data silo is a repository of data that is not accessible to the entire organisation, but to only some part of it. These silos reveal a situation where one or more information systems or subsystems that are conceptually connected, are incapable of operating with one another.
By the same token, data engineering across silos becomes more and more complex as it involves bringing together systems that are using different technologies, with different data structures (relational databases, document databases, CSV and XML files, streams, etc..) and are also stored in different locations (several clouds, on premises, systems in different subsidiaries and different countries).
A cascade of additional challenges comes with real-time data engineering. A good example would be real-time payments, that is 24/7 electronic payments that are cleared by banks within seconds. With such payments, the payer is immediately notified while the payee’s bank account is credited. In data terms, data should be accessible to the end user as quickly as it is gathered, meaning that there is almost no delay between the moment it is created at the source and the moment it is accessed. Access to the data should be instantaneous.
Being able to implement a true real-time architecture assumes the mastery of all the above listed tasks along with cutting-edge distributed technologies, which are not only difficult to grasp but are also evolving incredibly fast.
Ultimately, scaling data engineering ranks amongst the most complex tasks. Data architecture, code maintenance, data consistency, performance and governance are enormous challenges that most organisations, as data-driven as they might be, are failing to address.
One solution to these pressing challenges could be data virtualisation.
In essence, data virtualisation is an approach to data management that aims at producing quick and timely conclusions from various sources without needing to embark on a big data project. Unlike the classical process, data remains in place and real-time access is given to the source systems.
Acting as a logical layer, data virtualisation integrates enterprise-wide siloed data, unifies and governs access to finally distribute it to end users in real time.
One of data virtualisation’s greatest business benefits is its ability to deliver insights quickly. The virtualisation technology maps the data from disparate sources and builds a virtualised layer which can be seamlessly integrated with consumer applications. This approach is more time-efficient as the data does not need to be moved from the source systems.
Lower costs for infrastructure, faster time-to-insight, enhanced real-time capabilities, these are a few benefits of using data virtualisation that can be game-changing for many businesses.
In a way, data virtualization seems to be solving the challenges posed by data engineering, but unfortunately, experience shows otherwise. While it provides incredible results for small scope projects, the approach quickly shows its technical limitations when you want to deploy it at full scale.
As helpful as virtualisation can be, it does have various flaws and scalability is one of them. In fact, scaling on virtualisation is a tedious and time-consuming task as it involves ensuring all the requisite software, storage and resource availability generally provided by third party vendors. Not to mention, the incremental costs of increased resources.
Another issue that arises is performance. Resources in virtualisation are shared. This means that the resources virtually available for one single user are now shared among a greater number of users. As the tasks grow more complex, so does the need for better performance. Yet, what happens here is that the sharing of resources slows down the pace of the process, resulting in a substantially higher time needed to complete a specific task. Hence, performance standards are not met.
Now, what if I told you that there is a way to overcome the scalability and performance issues posed by virtualisation while still benefiting from its advantages?
How? With “streaming virtualisation”.
Virtualisation is by essence real-time and if you want to achieve the same functionally as data virtualisation, you need to do real-time data engineering which, as we discussed, is extremely complex technically.
So is there a way forward? Could we do things by the book on a real-time foundation without exposing technical complexity to the user?
The answer might lie in yet another old concept: no-code/low-code engineering.
Why no-code/low-code engineering?
Simply put, no-code/low-code engineering enables people with little to no coding knowledge to create applications by automating every aspect of the application lifecycle using simple non technical tools. This not only streamlines solution delivery for non-developers but also reduces the burden on developers of writing code line by line. A great example would be no-code/low-code web development. This type of web design enables programmers and non-programmers to design and build custom websites without needing any technical skills and without learning to write code.
No-code/low-code does not mean that there is no code. It actually means that end users don't need to code, because something else (your no-code/low-code solution) takes care of the code, and that end users only need to think about the business logic.
In the context of data engineering, the no-code/low-code approach enables users to perform all sorts of data manipulations like data ingestion, transformation or even machine learning with little to no-coding. How ? By automating what can be automated in data manipulation processes. Data engineers and data scientists can stop engaging in repetitive tasks and instead focus on high value-added activities.
Now the real question is: what are the data engineering activities that normally require heavy coding and that could benefit from a no-code/low approach?
- Implement and integrate components accessing data from many types of data sources
- Write all code needed to run real-time data pipelines
- Organise and govern data pipelines
- Setup the operation and monitoring of the solution
With a no-code/low-code solution,, all this code is maintained by the solution and not by you.
As a consequence, no-code/low-code data engineering allows to:
- Empower end users (non technical, citizen data engineers)
- Benefit from the latest technologies without investing in skills development
- Lower Total cost of ownership (TCO) and time-to-market
It is undeniable that smart data engineering amplifies business value and gives competitive edge. Yet, we should not overlook the fact that, as the data industry is evolving with the booming of new technologies, so do data engineering challenges. In this new data landscape, data engineers must work and act on more data than ever before. All that information overload is putting a lot of pressure on the most advanced machines that struggle to process the wealth of data running through them.
In addition, new data-science applications, real-time processing and a number of other IT-driven innovations are straining even further the capabilities of data engineers.
Fortunately, low-code/no-code data engineering is taking some of the burden off of the data engineers’ shoulders by automating data tasks and pipelines and giving room for value-added activities. When performed smartly, this new generation of data engineering can help break your data silos and solve all your data engineering problems.