Page 102 - Read Online
P. 102
Tsipouras et al. Rare Dis Orphan Drugs J 2023;2:17 https://dx.doi.org/10.20517/rdodj.2023.15 Page 3 of 6
Table 1. The data aggregation challenge. Comparison of risks and benefits between existing and federated databases
Databases Federated databases
Security and Movement and copying of sensitive information increases In a TRE and federation environment, data are not moved or
compliance the risk of data breach copied, reducing security risk
Data size and Lack of standardized formats and pipelines limits Fully standardized data, securely accessible by cloud-based
interoperability interoperability, and negatively impacts scalability, cost, platforms through federation, can be combined with global
and efficiency cohorts and disparate datasets
Collaboration Data cannot leave jurisdictional borders. Data sharing Federated approaches will eliminate a major barrier across
agreements are frequently difficult to negotiate and individual datasets, vastly improving the statistical power of
implement, hindering collaboration research
TRE: trusted research environment.
Federated data analysis platforms, which facilitate secure data access from multiple sources without the
need for data movement- where data could be vulnerable to interception, have emerged as a promising part
of a solution for safely sharing anonymized genomic data. Here, genomic data remains secure in the TRE,
which can then be linked virtually using a set of Application Programming Interfaces (APIs).
Traditional data access methods involve researchers downloading data to an institutional computing cluster.
With federated analysis, the analysis is brought to where the distributed data lies, thereby eliminating the
risky movement of data and removing many existing barriers to accessibility . Such technology means that
[13]
data can be made securely accessible but that data controllers (e.g., biobanks and healthcare providers)
retain jurisdictional autonomy over data, a key concern in international data sharing.
International initiatives such as the Global Alliance for Genomics and Health (GA4GH) set standards to
[14]
promote the international sharing of genomic and health-related data, in part by setting interoperability
standards and providing open-source APIs.
Common Data Models (CDMs) are crucial to ensuring data is interoperable, with several growing in
popularity in the life sciences sector recently, including OMOP (Observational Medical Outcomes
Partnership) CDM from the OHDSI (Observational Health Data Sciences and Informatics)-specifically for
clinical-genomic data. Examples of health organizations utilizing OMOP as their CDM include the UK
Biobank and All of Us from the US National Institutes for Health (NIH) [15,16] .
Additionally, extraction, transformation, and loading (ETL) pipelines that can automate this work to
process and convert raw data to analysis-ready data help further simplify this process for researchers.
Normalizing all data to internationally recognized standards allows researchers to perform joint analyses
across distributed datasets, which is key to ensuring diversity and representation of as many populations as
possible in studies.
These standardized and interoperable datasets could be combined seamlessly for analysis via federation,
enabling researchers to analyze this data collaboratively in conjunction with other complementary datasets.
Standardization of data formats and analytical approaches within and even between health systems can
bring substantial benefits in terms of comparability of data and contribute to continually improving
processes.
Illustrative examples with potential multiplier effects could include: