Modern Data Architecture: What is it
Every modern organization that depends on data for its decision-making process is rethinking its data strategy. Compared to a few years back, now their abundance of data access new technologies and tools that promise to transform how modern businesses serve their customers and how they compete.
Rather than reacting to events, today's organizations that are driven by modern data architecture, anticipate business requirements and work towards optimizing them for better business outcomes. Companies that fail to act proactively and adapt to a modern data architecture lose on customers, and market share and are gradually driven out of business by competition.
In this blog, we are going to have a look at ETL-based architecture, which has been prevalent for our data products for a couple of decades now, the challenges involved with ETL, and why modern data architecture is evolving now compared to the last couple of decades.
What is data architecture?
Data architecture outlines the process of managing data: from the collection, through transformation, distribution, and consumption. It is the basis of data processing and AI and determines the flow of data in storage systems.
Data architects, data scientists, and engineers create designs based on business requirements to generate a data model and its underlying data structures. These designs help support initiatives like reporting or data science.
Which modern data architecture should you use?
The architecture that best meets your business needs will depend on what they are.
- Need a quick-and-dirty MVP? ETL is your choice.
- Building data operation that's gonna last and you expect standard questions? Go with ELT.
- Data mesh works best for teams who are working together but operate at varying speeds and with various needs.
Avoid commitment lock-in, which can make it difficult to switch your data architecture. Choose an architecture that won't restrict you if your business needs change.
ETL-Based Data Architecture
From the above picture, we can see how data typically flows in an organization (top row). ETL in itself is just a means to an end. We do ETL to build dashboards, and reports and to use data in different ways in the organization.
Let's have a look at the typical journey of data in an organization. Considering the above image, we can see that data may be generated from multiple sources such as transaction data sources, files, or even in unstructured formats. Next, the data goes through the ETL process and then is loaded in data warehouses or data marts and then into a reporting layer which gives a visual representation of the data. The ultimate goal is to drive insights from the data.
Now coming to the bottom row of the above image, we can see the different teams in an organization that work on the different layers of data architecture. So for example, if we have some transactional data saved in MySQL database or any other transactional database, it is owned by application engineers. Next, coming to the ETLs, are typically developed by data engineers, data analysts, or pipeline developers. Their job typically consists of extracting data from the data sources, defining transformations, cleaning the data, and creating fact tables among others. The analytics team maintains the warehouses or data marts. In some organizations, there are infra engineers or DevOps engineers who maintain the warehouses. Coming to the reporting side of things, we have analytics teams who work on data that is loaded into the data warehouse to build reports. They generally use a BI engine to build the reports. These reports are then consumed by the business teams. These teams could be business operations teams, product management teams, or our sales teams. They are the data consumers in the organization.
Therefore, we can see that the data in an organization flows from left to right however the requirement flows from right to left. Also, there are multiple data systems, and multiple teams through which the data flows in an organization. In short, we can say that the requirements and the data flows travel in a reverse direction.
Evolution of Data Architecture
Now let's see how modern data architectures have evolved. Data warehousing started in the 1980s and still in some large companies this kind of model is followed. We have operational databases, and ETL tools like Informatica in the middle, providing pure ETL infrastructure that pulls data from multiple data sources, builds some transformations, and pushes the data into a master data management warehouse or data marts. Earlier organizations used to build individual data marts for individual teams. For example, there used to be different data marts for marketing teams, human resources teams, sales teams, and so on. Next, the BI tools would talk with these data marts to build reports. This has been the most traditional ETL architecture in the data environment.
Around 2010 onwards when big data and consumer internet became more mainstream, organizations started generating lots of semi-structured or unstructured data. Previously, the data volume was not too high and the data was in structured format. But with the emergence of Big Data, around late 2010 and 2011, the concept of a data lake came into existence. Data lakes helped organizations store large data volumes of data at comparatively low costs. For example, in the cloud world, we can think of AWS S3 as a data lake, where we started dumping all kinds of data (structured, semi-structured, or unstructured) at low costs.
Along with this data fabric, we started using certain technologies like Hadoop, and Spark among others for processing and doing ETL of this huge data and ultimately pushing this data into the data marts.
So when comparing the last two models, we can see that not much changed between the infrastructures. Only the volume of the data increased and we introduced a data lake in the middle to store this huge data at lower costs. Along with this data infrastructure, the ETL process started pruning data and moving the data into the data marts. Time series databases were also introduced around this time which had the capability of storing huge volumes of data at lower costs.
During the last 5 to 10 years we have seen a lot of progress in data science and machine learning. With the data pipelines and a lot of data coming into these data lakes and the emergence of data prep pipelines, we started using the prepared data for machine learning through Python, Pyspark, and Spark.
In short, we can say that the ETL has not completely gone away but has introduced a lot of moving parts.
Challenges of ETL Data Architecture
Lack of Communication between Teams and Systems
Referring to our earlier comment, we said that the data flows from left to right whereas the requirements flow from right to left. This requires a lot of coordination between teams and systems to make things happen in an organization. For example, the end goal of data is to provide insights to the business teams but they don't have visibility to all kinds of data. They only have visibility to the layer that is exposed to them, that is, the BI layer. As the data moves from left to right a lot of decisions have already been made during this journey.
For example, if the business user wants a particular kind of information from the data, he has to reach out to the data analytics team who will in turn reach out to the pipeline developers who will again process the raw data and build some summary table which will be passed on to the business team through the analytics team. As we can see this involves a lot of communication and synergy among teams and systems. Also, there are a lot of operational complexities as well which may arise from the data movement through multiple systems and the limitations and capabilities of each system may be different from each other. This leads to a lot of inefficiencies and any new insights demanded by the business teams may take months to be delivered to them.
Therefore, we can say there is a scale mismatch across the data journey. On the left-hand side, we have the raw data which gets pruned at every stage and we arrive at the summarised version of the data delivered by the BI dashboard.
Another problem across organizations is fragmented knowledge across different teams. This leads to long turnaround times. For example, pipeline developers may not know why certain KPIs are required and how they shall optimize the core systems.
All Parties try to achieve local optima
Since multiple teams are working on different tools, they try to local optima. They don't have a holistic picture. They tend to optimize for their pipeline only. For example, the analytic team is working to optimize only the data warehouse. What they fail to understand is that by adding or eliminating certain data they might be able to optimise at their level but at a holistic level this might not be helpful. Therefore, we can say that there is a correlation between the optimization done at several levels and the overall impact data can create at the organization. This will require a waterfall approach to succeed and will take months to get implemented.
Now, the challenge with this model is that businesses nowadays are changing very fast compared to 15 or 20 years back. This was designed for large companies who work in a pre-defined and stable environment for a long time. For example, if a company is making cars they can define the KPIs in 6 to 12 months and implement the complete ETL architecture over the next few months. Now, in this case, their KPIs will not change in the next month or two months. The waterfall approach may work for them as their business models don't tend to change rapidly.
On the other hand, in the case of consumer internet companies, we are not sure what kind of KPIs will be required in 2-3 months down the line. This is because the business is changing very fast hence the waterfall approach may not work in this case.
Modern Data Architecture
Under the modern data architecture, we can see that with the advancement of cloud technology, we are collapsing certain data silos layers. For example, the layers of ETL, Data Marts, and Reporting can be collapsed into a single layer of cloud data lake.
This modern data architecture is relatively new. It was only in 2019 that people started writing on this topic and organizations started moving towards this architecture. In this modern architecture, we have structured, semi-structured, and unstructured data coming into the cloud data lake. These data lakes provide scalable data storage these days. This is in contrast to our earlier assumptions wherein we considered cloud data lake as just the storage wherein we used to dump all data and then process it. But now, the data lakes have become so powerful, scalable, and cost-effective that we can process the data stored there. For example, when we do ETL, we prune the raw data sets into summary tables. The raw data does not lie in the data mart, it may lie somewhere else like in an S3. But if someone is required to do ad hoc analysis on raw data, it's not possible on ETL. This is because the raw data is not exposed to them in the data warehouse. It lies in S3 or some other storage and requires a set of different tools to access and analyze them.
Now with the emergence of this modern architecture, our raw data and transformed data lie in one place. Instead of the data flowing through different tools, we might look at it as different stages of data. But, all the data is in one place and all the processing is in one place. For instance the whole data integration first level and the second level of summary is at one place and analysis can be done at any level of data.
Let's say, you are operating at a large scale and you have transactions on your mobile application and you are also tracking the user activity on your app and you are generating 10 million events per day. The day that you have is for the past 3 years. Now, this will lead to billions of events. On the first layer, you can segregate those events on certain KPIs daily, monthly basis, and yearly basis, and 60-70% of the use cases are handled by the daily rollups. Now a new product launch requires a metric that is not available on a daily roll-up. In the old ETL setup, you would be required to do a lot of data movement to make that work. But, with the introduction of this latest architecture, you can build the required metric without much of an effort. This is possible as all the data is lying in one place and you can analyze them using a single tool. You can query the data at a more granular level. All this can happen in a much more cost-effective way with an interactive query speed. Earlier we didn't have interactive querying with ETL platforms like Hadoop and with the emergence of cloud data lake, we have these interactive querying facilities. For example, with Google, we have Google Big Query, and with AWS we have Redshift and Athena which can be coupled with S3 to work along it.
Under this architecture, instead of fragmenting the data, we are keeping the data at one, unified data access, place organized at different levels of aggregation and summarisation and finally using that data in different tools.
Characteristics of Modern Data Architecture
Self-Serve
The main advantage of this architecture is self-serve. Under self-serve, teams across the organization work on the same data sharing a single copy of truth. There is high agility in asking a new question. All the data is accessible to all departments. This is unlike the ETL world where teams are data isolated and they don't know what data is available with the other team.
It's much easier to implement data governance across the organization. We have unified definitions and models across all data sets.
Business Users can analyze the data on their own via point and click visual interface.
Analysts can focus on advanced analytics and multi-dimensional data modeling, rather than spending time on tactical reports asked by business teams.
Speed and Scalability
The modern data architecture is easily scalable as they are hosted on cloud platforms and are designed for large volumes of data. But the best part about data platforms is that they are equally efficient if the volume of data is less. So organizations can start small and as they grow this architecture can facilitate their growth. Right from a few hundred thousand records to billions of records and Petabytes of data.
Sprinkle enables ingestion and enrichment into the cloud data lakes. Sprinkle has connectors to hundreds of data sources from which data can be pulled to the cloud data lakes and enrichment can be done in one place. This is done without fragmenting data at many places giving full visibility of the data to the end users.
The turnaround time for building a new KPI or new report under this architecture is relatively low. For example, if a KPI took 3 weeks to be developed in the ETL architecture it takes only a couple of hours to be done under this architecture.
Cost Effective
Most of these platforms have pay-as-you scale models. They have a lower data management overhead as it does not require ETL and data movement. The total cost of ownership of data assets is relatively less as compared to the ETL architecture
This architecture ensures high data quality as no data duplication happens across multiple systems.
Moving to Modern Data Architecture with Sprinkle
- Sprinkle automates complete data flow. We have connectors to multiple data sources like MySQL, Kafka, and Kinesis, among others.
- We have connectors to most of the cloud data lakes present in the market. Some of them are Snowflake, Redshift, Athena, and Big Query among others.
- We have integrated dashboards and analytics with the Sprinkle platform along with integrated Jupyter Notebooks to build machine learning algorithms.
- Sprinkle enables data democratization across the organization with a quick turnaround time.
Frequently Asked Questions FAQs - Modern Data Architecture
What is the modern data architecture?
Modern data architecture refers to the design and structure of data systems that are capable of handling large volumes of data, processing it in real-time, and integrating various data sources. It involves the use of technologies such as cloud computing, big data platforms, and advanced analytics tools to manage and analyze data efficiently.
What is modern data streaming architecture?
Modern data streaming architecture is a framework that enables real-time processing and analysis of data streams. It allows organizations to capture and analyze data as it is generated, providing immediate insights and enabling faster decision-making. Technologies like Apache Kafka, Apache Flink, and Spark Streaming are commonly used in modern data streaming architectures.
What are the advantages of modern data architecture?
The advantages of modern data architecture include improved scalability, flexibility, and agility in managing and analyzing large volumes of data. It allows organizations to gain valuable insights from their data faster, make informed decisions quickly, and adapt to changing business needs more effectively. Modern data architecture also enables better integration of disparate data sources and seamless collaboration across teams.
What are modern data technologies?
Modern data technologies encompass a wide range of tools and platforms used for storing, processing, analyzing, and visualizing data. Some examples include cloud-based storage solutions like Amazon S3 or Google Cloud Storage, big data platforms like Hadoop or Spark, advanced analytics tools like Tableau or Power BI, and machine learning frameworks like TensorFlow or PyTorch.
What do you mean by data architecture?
Data architecture refers to the design principles, standards, models, and policies that govern how an organization collects, stores, processes, and manages its data assets. It encompasses the overall structure of databases, data warehouses, and other storage systems that support the organization's information needs.
What is a modern data analytic tool?
A modern data analytic tool is a software application or platform that enables users to extract insights from large datasets through visualization, reporting, querying, or predictive analytics. Examples include Tableau for interactive visualizations, Google Analytics for web traffic analysis, or Apache Zeppelin for collaborative notebook-style analytics.
What is an example of a modern data stack?
An example of a modern data stack could include technologies like Apache Kafka for real-time streaming ingestion, Apache Spark for distributed processing, Elasticsearch for search functionality, and Tableau for reporting and visualization purposes.
How many types of data technology are there?
There are two main types of Data Technology: Traditional Data Technologies (SQL) & Modern Data Technologies (NoSQL).
What are the future technologies for data?
Future technologies for data are likely to focus on advancements in artificial intelligence/machine learning algorithms for predictive analytics capabilities; further developments in edge computing to process IoT-generated data closer to the source; increased adoption of blockchain technology for secure transactions; improvements in natural language processing techniques; enhanced cybersecurity measures to protect sensitive information; innovations in quantum computing for complex calculations.
What are modern data pipelines?
A modern-day pipeline is a sequence of connected elements designed for automating the flow/delivery of data between different stages/steps in data processing workflow typically includes tasks like collection, data ingestion, cleaning/preprocessing, analysis, modeling, and visualization.