Between data and metadata
How research institutions have come together to create a data repository network
In December 2019 the São Paulo State Research Foundation (FAPESP) inaugurated a network of open data repositories, bringing together six public universities—the University of São Paulo (USP), the University of Campinas (UNICAMP), São Paulo State University (UNESP), the Federal University of São Paulo (UNIFESP), the Federal University of ABC (UFABC) and the Federal University of São Carlos (UFSCar)—in addition to the Aeronautics Institute of Technology (ITA) and the Brazilian Agricultural Research Corporation (EMBRAPA).
The network was constituted as a federation of independent repositories linked by a central node based at USP. Each of the participants are autonomous, with their own data management, governance, and personnel policies. The network was developed over three years, involving more than 100 people including IT professionals, librarians, database and other researchers, and university administrators.
Formulated with a view to expandability and independence, the network implementation faced challenges in data engineering, data communication protocols, and sustainability (primarily within each participant institution). Each organization made efforts to overcome internal obstacles, making their metadata available for collation into a single repository. We separated the internal aspects of each institution from the global ones, such as data communication and network.
From an external point of view, the central node is the interface of the network publicly accessible on the website Metabuscador de Dados de Pesquisa (Metasearch Engine for Research Data). This portal collates information, scanning each repository daily for metadata from the respective research data files. Communication between each institutional repository and the central node is carried out through the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol, with each repository exporting certain sections of its metadata on a daily basis. The Metasearch Engine was implemented to connect to systems using one of three types of research data management platform: DSpace, Dataverse, or CKAN.
Each institution internally created its own research data management group, with IT professionals, computer scientists, librarians and researchers working with large volumes of data across several knowledge areas such as life sciences, engineering, humanities, and exact sciences.
This group is normally associated with the research coordination pro-rectories of participating institutions. In addition, each organization drafted a series of regulations and policies by which their researchers could publish datasets.
As each member of the federation is independent, the repositories use different research data management platforms and organize their data in different ways, though all are obligated to follow a basic standard of eight metadata attributes defined after studies conducted during the network project.
The federated architecture was selected after performance studies, and in light of the significant operating differences between the member institutions. Furthermore, federation enables expansion of the network to new members, which took place in 2020 and 2022.
Thanks to good diligence on the project and its development, any institution with its own in-house open research data repository based in the state of São Paulo can sign up to the network with ease, provided that certain basic operational rules are followed. Subscription is mediated by the FAPESP scientific directorate and implemented computationally with the support of USP.
Inclusion of the COVID-19 Data Sharing/BR data repository in June 2020 is one example of the open and extendable nature of the project implementation. Bringing together data from five healthcare organizations (Instituto Fleury, Hospital Sírio-Libanês, Hospital Israelita Albert Einstein, USP’s Hospital das Clínicas and Beneficência Portuguesa de São Paulo), this is a FAPESP initiative in collaboration with USP and the first three institutions.
Its speed of implementation was only possible because the repository network was already functioning in the federated format described above. In 2022, Redape (the EMBRAPA research data repository) was onboarded to the network in less than one hour.
Network assembly and operational rollout faced several computing and legal challenges. The task now is to encourage researchers to publish their data openly, and the many barriers include technical and cultural factors. Observance of the Brazilian General Data Protection Law (LGPD), approval of local and national ethics committees, data and metadata curatorship, and hardware, software, and human-resources sustainability are examples of the issues dealt with by local management groups.
The important takeaway is that thanks to the establishment of this repository network, adhesion to the open science movement has been accelerated and expanded across all participating institutions. It should be noted that when the United Nations Educational, Scientific and Cultural Organization (UNESCO) voted on Open Science recommendations in November 2021, the network was already in full operation.
Claudia Bauzer Medeiros holds a Computer Science PhD from the University of Waterloo, Canada. She is a full professor at the University of Campinas (UNICAMP) Computing Institute, and a coordinating member of the FAPESP eScience and Data Science program.
Fátima L. S. Nunes – Sciences PhD from the University of São Paulo (USP), and a lecturer at the USP School of Arts, Sciences, and Humanities (EACH).
João Eduardo Ferreira – PhD in Computational Physics from the University of São Paulo (USP), and professor at the Department of Computer Science at the USP Institute of Mathematics and Statistics (IME).
The opinion pieces do not necessarily represent the views of Science Arena or the Hospital Israelita Albert Einstein.