Data Catalog -  Build vs Buy

BlogsData Engineering

Data Catalog - Build vs Buy

data catalog build vs buy

Data catalogs are becoming increasingly popular to answer questions relating to data discovery and data trustworthiness. But, when a business decides to go down this path, it needs an answer to whether to buy or build a data catalog solution before taking the first steps. 

Few companies like Uber, Air BNB, and Linkedin among others have successfully built their data catalog solution as per their business model. But, only a few companies possess the skills and clarity of thought to build their catalog. Sometimes companies spend weeks or months trying to build one for themselves but without any successful implementation and adoption across the business and end up spending hundreds of thousands of dollars.

with and without data catalog journey

image source

Building vs Buying at a Glance

building vs buying robust data catalog

Data Catalogs: Understanding the Building Procedure

Building a data catalog involves a lot of effort concerning time and resources. Let's take a quick look at the process of building a data catalog solution.

People

To build a data catalog and to build capable technical expertise involves a lot of labor and time. This is because to build a new product, a dedicated team,  and technical expertise is required. It involves doing research for best practices, developing the product, and finally implementing it. 

Based on research, it is known that a minimum of five data engineers are required to look after the data catalog product. This number goes up even higher when we are in the process of developing and implementing the data catalog. 

Even with the presence of open-source data catalog solutions like Amundsen, CKAN, and others which are available in the market, businesses find it very difficult to launch their own data catalogs. These open-source data catalogs are free only on paper but the cost of efforts that are incurred by businesses are high. 

Data and analytics leaders for a financial institution in the US describe that building a data catalog solution is a frustrating journey. He explains that it took more than eight months to deploy a new data catalog solution. 

When considering building a data catalog solution, we always need to keep the competitive landscape in mind. Is it possible to go another year without a proper data catalog solution in place?

Planning and Designing 

With each passing day, modern businesses are investing more resources in data management solutions. However, allocating resources to all data stack projects doesn't necessarily give the best results. Most of the homegrown data catalogs are built keeping in mind today's problem statement. And then they take about a year to design and deploy. This means on day one of the data catalog it's already using last year's technology and with modern businesses, this is not acceptable. 

They might become obsolete. Here is how

  1. Most probably they won't be compatible with some tools that we might use a few years down the line.
  2. Keeping up with data catalog standards is difficult because they change so rapidly.

These things could be avoided if we had an internal and dedicated data catalog team along with our IT team. 

Processes involved after Building

Building a business data catalog is just the beginning. We need to constantly look after it. Therefore it is important to have a development cycle, provide adequate support for queries, and keep them up to date. 

data catalog working for improved data integrity

Maintenance

The work for a team doesn't end with just developing a data catalog solution. On the contrary, it's just the beginning. Research says that a minimum of six to seven data engineers are required to maintain a data catalog, which is way more costlier than buying a data catalog solution built for the cloud. 

There are a lot of hidden costs involved in the process. We generally end up paying 2x to 3x of the cost of buying a data catalog which involves support and maintenance costs. 

Staying Competitive

Some of the well-known data portals are the ones which are maintained by respective governments. For example, Data.gov or data.gov.uk which are built on CKAN. These portals function properly and cater to the requirements of the government as they have dedicated resources behind them. But for modern businesses, it is becoming increasingly difficult to allocate dedicated resources. Also, the way data is used today has changed considerably since these open-source platforms were launched. This is why:

Difficulty in Finding Data:

We can end up getting a deprecated data set as a result when searching. 

Difficulty in using Data:

Most of the datasets are in Excel or CSV. Before using them we need to clean and normalize the data and then ingest the data into another tool for analysis. 

Difficulty in Understanding the Data:

The documentation is very limited. Generally, there is only one contact email and only the name of the data set. 

Piling up of Cost:

Maintenance of a data catalog for the long haul is like spending on life support and not on innovation. Companies like Airbnb, Uber, and a few others have developed it because they have a clear point of view. 

A mature data-driven culture is a must for the successful design and implementation of a data catalog solution. For businesses who are still defining their data strategy, investing in a homegrown data catalog solution, involves a lot of risk as the direction of the data strategy can change multiple times.

Is buying the answer to all Problems?

data catalog best practices

image source

User Experience

Modern data catalog teams understand the way modern businesses operate. They have a dedicated UX team that keeps in mind that the design should be comfortable for every one of the organizations. This results in better data governance and experiences for the whole organization. 

Sprinkle has a dedicated team that works on feedback from existing clients and is continuously on the work of improving the UX as per user convenience.

Service and Software Expertise

Most data catalog vendors have a service component along with their solutions. This adds a human touch to the very important solution. These experts from the data catalog team reinforce the best practices and enable the proper usage of the data catalog. We need to work with a vendor who is willing to go the extra mile in regards to training and sharing expertise when it comes to the deployment of the data catalog. For example, Sprinkle with its dedicated solutions engineering team is always working as an extended arm for the clients and regularly holds training sessions for the customers. 

Finally, it's very important to consider that a vendor is a kind of a leader in the industry. The vendor must be one step ahead of the curve. For example, the Sprinkle team comes with more than 2 decades of experience in the data analytics industry and always strives to improve the product and customer experience.

Conclusion 

Young data-driven organizations who are still in the way of deciding their data journey and who have limited resources must go for an already existing data catalog solution in the market. This is because it doesn't make sense for them to spend a lot of time and money on building one in-house and by the time it's developed, they have a different strategy in place. Mature organizations can think of building a data catalog provided they are willing to have a dedicated team looking after the solution even after it's built. 

Frequently Asked Questions FAQs - Data Catalog Build Vs Buy

What is a data catalog, and why is it important?

A data catalog is a centralized repository that stores metadata about an organization's data assets, such as databases, tables, columns, and data dictionaries. It is important because it enables organizations to quickly and easily discover, understand, and manage their data assets, which can help improve data governance, data quality, and decision-making.

What are the benefits of building a data catalog?

Building a data catalog allows organizations to tailor the solution to their needs and requirements. It also provides more control over the design, functionality, and integration with other systems. Additionally, building a data catalog can help develop in-house expertise and promote a culture of data management.

What are the benefits of buying a data catalog?

Buying a data catalog can be faster and more cost-effective than building one in-house, especially if the vendor offers a cloud-based solution that requires minimal setup and maintenance. Vendors may also offer advanced features and functionality unavailable in a homegrown solution, such as machine learning-based data discovery and automated data lineage.

What factors should I consider when building or buying a data catalog?

Some factors to consider include the organization's budget, resources, and expertise; the scope and complexity of the data catalog; the level of customization and control required; and the availability and suitability of vendor solutions.

How can I evaluate different data catalog solutions?

Some criteria to consider when evaluating data catalog solutions include their functionality, ease of use, scalability, security, integration with other systems, support, maintenance, and total cost of ownership. Reading reviews and comparing vendors' customer satisfaction ratings and market presence can also be helpful.

What is the difference between data catalog and metadata? 

The main difference between a data catalog and metadata is that a data catalog is a collection of metadata about various datasets, while metadata refers to the information or description of the data itself. In other words, a data catalog is like a library that contains metadata as its contents. 

What is the difference between a dataset and a data catalog? 

A dataset is a collection of related data points or records, while a data catalog is a tool used to organize, manage, and provide access to multiple datasets. Essentially, a dataset is the actual data being stored or analyzed, while a data catalog serves as the repository for organizing and managing these datasets. 

List some data catalog tools 

Some popular data catalog tools include Collibra Catalog, Alation Data Catalog, IBM Watson Knowledge Catalog, Informatica Enterprise Data Catalog, and Apache Atlas. 

What are the types of metadata in the data catalog? 

Types of metadata in a data catalog can include

  • technical metadata (e.g., file format, storage location),
  • descriptive metadata (e.g., title, author),
  • structural metadata (e.g., relationships between different datasets), and
  • administrative metadata (e.g., access controls). 

What is a data Catalogue example? 

An example of a data catalog could be Amazon Web Services Glue Data Catalog which provides centralized metadata management across various AWS services like Amazon S3 and Amazon Redshift. 

Written by
Soham Dutta

Blogs

Data Catalog -  Build vs Buy