From Data Cataloguing to Data Discovery: An Open Source Approach

Jacky Jiang
6 min readOct 9, 2021

Intro

We live in a time where data is all around us. The growth in the number of connected devices, new regulatory compliances, and emerging IoT & 5G technology all leads to the increasing amount of structured & unstructured data generated, identified, acquired, and stored every day. In the new digital age, businesses and government agencies unavoidably have to make better use of their data to realise their full potential. As part of the digital transformation, organisations need data to help make decisions, solve problems, understand performance, understand customers, improve processes and service delivery. However, silos of unconnected data won’t allow any of those. Data assets can only be utilised when they are discoverable within the organisations so that people are aware of the data’s existence and can evaluate whether it is fit for purpose.

Data assets could be left in silos of different business units in many ways. It could be for organisation, structural or technical reasons. When an organisation starts growing, it’s common for different functional teams to develop their own ways of working with data that suit their needs and using different technologies or systems to fit their purpose. Legislations, privacy concerns or organisational policies could also contribute to the data silo problems because different handling procedures and systems might be required to process and access certain data assets. Will a centralised data catalog be the ultimate solution to the problem of data silos?

From data cataloging to data discovery

A centralised data catalog serves as an inventory of whole organisation data assets metadata and gives users the information necessary to evaluate data accessibility, suitability, and location. Nevertheless, it won’t be a final solution. Not surprisingly, one of their first imperatives for most data champions is to build a data catalog. But soon, they are in the mire of catalog curation. Manual catalog curation might work for small organisations. However, it could never be an option for organisations with a large volume or fast changing datasets. Some organisations might opt to initiate a plethora of data consolidation/integration projects to create a centralised data repository/data lake in preparation for a centralised data catalog. But the complexity of the projects would imply years’ lead time before the inception of the data catalog. To solve the issue, we need more than just automation for the catalog curation. The catalog system is also required to federate & aggregate metadata from different data repositories & I.T. systems.

Furthermore, the accuracy and quality of the metadata are crucial to the overall discoverability of the organisation’s data assets. The metadata curation often involves manual work, and human errors are inevitable. On the other hand, fully automated metadata curation often results in low-quality metadata. Hence, to address the data discovery issue, the catalog system ought to have the capability of correcting and enhancing the metadata as it flows in.

Finally, yet importantly, a data catalog system that can help with data discovery is required to truly “understand” the metadata rather than only support keyword-based search. Take geospatial data as an example: can the data catalog system understand the geospatial coverage of the datasets and allow users to find data assets within a specific geospatial area? Or could it understand the temporal coverage (the date range that the dataset is related to) rather than simply creation date & update date? To achieve those, we need a catalog system with a more intelligent search engine that can truly understand the data assets. Hence, it can discover data assets more suitable for users’ research.

Magda: breaking data silos on day one

That’s why we started the Magda project. Magda is an open-sourced data catalog system that was initially developed in 2016 as a new version of “data.gov.au”. The first version of the “data.gov.au” was built on top of a CKAN system, which had worked fine as a data repository. However, as the central portal of Australian open government data, “data.gov.au” is required to federate & aggregate metadata from many different government agencies & research organisations, enhance the metadata quality and offer better & meaningful search. Those requirements motivate us to create a new data catalog system that doesn’t simply serve as a metadata inventory but truly focuses on solving the data discovery problem.

Magda is a cloud-native application that builds on top of microservices architecture in a modular way. The core of the Magda is a metadata store (we call it registry) built on top of an event-sourcing model. i.e. any CRUD (Create, read, update and delete) operations that happen to the metadata (via registry APIs) will trigger events that form an event stream that allows you to navigate back to any state of the system at a point in time. A single event will not only record the nature of the operation but also record the data changes in JSON Patch format.

e.g.:

[

{ “op”: “replace”, “path”: “/baz”, “value”: “boo” },

{ “op”: “add”, “path”: “/hello”, “value”: [“world”] },

{ “op”: “remove”, “path”: “/foo” }

]

Any other services or external plugins can choose to subscribe to certain types of data in the event stream to enhance the metadata or perform certain actions. This allows us to plug in different data sources and introduce different metadata enhancement plugins easily. Furthermore, the event listeners are required to send back the confirmation for every event to the registry via API. The registry will not process the next event unless it receives the confirmation of previous events. This guarantees that no events can be missed, which in turn allows more reliable data federation and metadata enhancement processes.

Figure 1: Metadata Processing Overview

Figure 1 shows the overview of the metadata processing workflow in Magda. Firstly, Magda offers a plugin system (named “Connectors”) that allows the system to federate & aggerate metadata from external systems. The automated ingestion of data will create an event stream that triggers actions of the pluggable metadata enhancement components (named “Minions”). Each of the “Minion’’ will enhance the metadata in one aspect. Eventually, the enhanced metadata will be indexed into the search engine to make it “smarter” (e.g. understand Geospatial Coverage) and ready to be used by the frontend web interface to produce more meaningful visualisation.

The automation of data ingestion offered by the “Connectors” makes it possible to break down data silos on day one. The “Minions” based enhancement system results in more accurate and high-quality metadata. Together with a more dataset-oriented search engine that can better use the metadata to offer the feature such as Geospatial coverage or Temporal Coverage search, Magda offers a more complete data discovery solution.

Looking Forward

Magda is an open-sourced software project that is built for the data governance community. You are welcome to use it to solve your data problem or build your own system on top of it. If you want to find more information about Magda, please visit our website: magda.io or Github repository. In the meantime, we are looking forward to any contributions from the community. Whether creating reporting a bug or building awareness about it in the marketplace, we are always grateful. We would like to continue making the Magda project a welcoming place for contributors as well. If you have any suggestions or good ideas, please visit our Github Discussion board and let us know.

This blog was first published on DataQG.

--

--