DISCLAIMER: AI AUTOGENERATED README. LAST UPDATE 2025-04-29.
OrcaVault is a comprehensive data warehouse system designed to manage and track genomic sequencing metadata across multiple research centers. It provides a structured approach to storing and accessing laboratory information, sequencing data locations, and workflow execution details.
The system implements a Data Vault 2.0 architecture pattern with hub, link, and satellite tables to maintain historical tracking and relationships between different entities such as libraries, samples, subjects, projects, and sequencing runs. It integrates data from multiple sources including Laboratory Information Management Systems (LIMS), workflow managers, and sequence run managers to provide a unified view of genomic research data.
orcavault/
βββ models/ # Data models directory
βββ dcl/ # Data Core Layer - Contains hub, link, and satellite tables
β βββ hub_*.sql # Hub tables defining core business entities
β βββ link_*.sql # Link tables defining relationships between entities
β βββ sat_*.sql # Satellite tables containing descriptive attributes
β βββ effsat_*.sql # Effective satellite tables for temporal tracking
βββ mart/ # Data Mart Layer - Contains subject-area specific models
β βββ centre/ # Centre-specific views and transformations
β βββ curation/ # Data curation related models
β βββ dawson/ # Dawson Research Group specific models
β βββ grimmond/ # Grimmond Research Group specific models
β βββ tothill/ # Tothill Research Group specific models
βββ ods/ # Operational Data Store - Source system configurations
βββ psa/ # Persistent Staging Area - Initial data landing
βββ tsa/ # Temporary Staging Area - Temporary landing zone
git clone <repository-url>
cd orcavault
pip install dbt-core dbt-postgres
profiles.yml
file in your ~/.dbt/
directory:
orcavault:
target: dev
outputs:
dev:
type: postgres
host: <your-host>
port: 5432
user: <your-user>
password: <your-password>
dbname: orcavault
schema: public
dbt deps
dbt run
dbt test
dbt run --models source:ods
dbt run --models mart.centre
dbt docs generate
dbt docs serve
Common Issues:
dbt debug
dbt ls --select model_name+
dbt test --select model_name
OrcaVault implements a multi-layered data architecture that processes genomic metadata from source systems through staging areas into a Data Vault model, finally presenting it in subject-area specific data marts.
Source Systems Staging Core Marts
[LIMS] --> [TSA/PSA] ----+
[Workflow Mgr] --> [ODS] --->[Data Vault Model] --> [Centre Mart]
[Sequence Mgr] --> [Staging] ----+ --> [Research Group Marts]
Key Integration Points: