Making science more open
Institutions adopt new strategies to improve raw research data storage and sharing
While many scientists understand the importance of sharing raw research data, few put it into practice. This is the warning given by an editorial published in the journal Nature in June, citing a study that analyzed 3,556 articles from 282 biomedical and health journals. The study, published in the Journal of Clinical Epidemiology in May, identified 1,792 papers by researchers who had stated that they would provide raw data upon request. However, more than 90% of the corresponding authors refused or ignored the requests. Only 120 (6.7%) provided the data in a usable format within the agreed time frame.
Data sharing is increasingly being encouraged worldwide. In some cases, the practice is a mandatory condition for research-funding applications. One such initiative was recently launched by the National Institutes of Health (NIH), one of the biggest biomedical research-funding agencies in the USA. Beginning in January 2023, the NIH will require most of the 300,000 researchers and 2,500 institutions it funds annually to include a data-management plan in grant applications.
The NIH Policy for Data Management and Sharing seeks to accelerate new discoveries by allowing research results to be more quickly validated, providing access to data of high scientific value and promoting the reuse of this data in other studies. The aim, according to the NIH, is for the policy to evolve flexibly over time, to keep pace with scientific and technological opportunities.
“We hope that researchers plan to share scientific data of sufficient quality that it can be validated and replicated by their peers,” Ryan Bayha, Director of Strategic Engagement at the NIH’s Office of Science Policy, told HUB Einstein. He explains that scientific data do not include lab notes, preliminary analyses, reports, draft articles, plans for future research, or communication with colleagues. Physical objects, such as laboratory specimens, are also not included.
In Brazil, scientists have been encouraged to share research data for some time. In 2017, the São Paulo Research Foundation (FAPESP) began requiring data-management plans in all funding applications except undergraduate research grants. Claudia Bauzer Medeiros, head of FAPESP’s eScience and Data Science program, points out that in general, sharing data—raw or otherwise—increases transparency in the production of scientific knowledge. “Another way to make research more transparent is to share the computer code used to generate the raw data,” says Medeiros.
Immunologist Luiz Vicente Rizzo, Director of Research at Hospital Israelita Albert Einstein (HIAE), highlights that data sharing has various functions. “Expanding access to data helps multiply interpretations of the results, encouraging discussion and consequently improving the scientific process.” Another advantage of the practice is that it tends to reduce research costs. According to Rizzo, so-called open science is beneficial to research groups with less financial resources.
“By accessing raw data produced by other scientists, such groups can participate in the generation of knowledge while at the same time contributing to the reproducibility of studies,” says Rizzo, emphasizing the importance of confirming research results in future studies as a way of identifying errors or fraud not detected in the peer-review process.
Challenges and resistance
Some of the obstacles to increasing raw data dissemination are cultural—such as a resistance among many scientists to share information with other researchers—while others are ethical or legal, explains Medeiros. “Studies show that in practice, between 60% and 70% of scientists worldwide do not want to share their data. Reasons include the technical difficulty of properly preparing data for sharing, concerns that the information will be misused, and a fear that colleagues will make discoveries and ‘get ahead.’
Rizzo notes that there are other factors that lead to distrust and prevent the growth of a data-sharing culture in the scientific community. “There are often commercial interests linked to the data, which causes concerns in relation to protecting intellectual property and privacy,” says the HIAE director.
He also notes the lack of reliable platforms for sharing data on a global scale that offer protection against hacking and tampering. In addition, he says, there are diplomatic barriers that hinder the exchange of information between scientists from different countries.
“My impression is that the greatest resistance isn’t coming from the scientists themselves, but from people in other spheres, especially political and technical actors,” says Rizzo. “There are important hurdles to overcome in a world where stealing data has become the twenty-first-century piracy.”
The level of resistance to data sharing varies depending on the researcher’s field of activity, according to Bayha, from the NIH. He points out that different research communities have different data-management standards and varying degrees of familiarity with the practice. “Some fields have a stronger tradition of sharing raw data, as is the case with genome research, for example.”
To illustrate, Bayha explains that the NIH provides repositories of lists that demonstrate the data shared by various studies. One is the Genotypes and Phenotypes database, which currently holds genome data from more than 2,000 studies.
One of the tasks of HIAE’s Office of Scientific Integrity, created in 2019, is to audit the way data is stored by researchers, explains Raymundo Machado de Azevedo Neto, a researcher and member of the office. “Many of our studies use databases and medical records from the hospitals managed by HIAE. Access to these data is highly restricted and protected,” he says.
“However, when used in a study, the data must be properly anonymized and protected. For example, when data is extracted and formatted in a spreadsheet, the content cannot contain any sensitive patient information, such as names, medical ID numbers, and phone numbers.” According to Neto, proper data storage must stringently protect the privacy of patients who voluntarily participate in clinical trials, in addition to using backups to avoid any data losses.
Importance during the pandemic
As well as reducing the need to re-collect primary data and stimulating new collaborations, sharing raw research data also helps accelerate the pace of discoveries—something especially desirable in emergency situations, such as the Covid-19 pandemic that began in March 2020.
One important contribution in this area was the Covid-19 Data Sharing/BR repository, developed by FAPESP in cooperation with the University of São Paulo (USP) to provide data that could contribute to research on the disease. The project also involved private institutions, including HIAE, which provided raw patient data to the initiative. “This is a rare case of collaboration between public and private organizations aimed at solving one of society’s major problems,” says Medeiros.
Studies that used the data in the repository resulted in numerous scientific articles and new products, in addition to contributing to the education of young scientists involved in the fight against Covid-19.
Another potentially successful data-sharing project is the Brazilian Reproducibility Initiative, funded by the Serrapilheira Institute, which is conducting a systematic survey of the reproducibility of laboratory findings published by Brazilian science. “Our initial aim is to diagnose problems, rather than fill in gaps,” says Olavo Amaral, a physician and researcher at the Institute of Medical Biochemistry of the Federal University of Rio de Janeiro (UFRJ) and head of the initiative.
“We hope to better understand the reproducibility landscape in Brazil and some of the factors that may be related to the success or failure of replication studies, thus contributing to the search for solutions,” explains Amaral.
According to him, the group is working on establishing a Brazilian Reproducibility Network, along the lines of the UK Reproducibility Network. The objective, says Amaral, will be to disseminate and encourage good practices for replicating research. “Some of the many roadblocks to reproducible research include selective data reporting, a lack of transparency in describing methods and results, and the misuse of statistics.”