orcahouse-doc

DISCLAIMER: AI AUTOGENERATED README. LAST UPDATE 2025-04-29.

OrcaVault: A Data Warehouse for Genomic Sequencing Metadata Management

OrcaVault is a comprehensive data warehouse system designed to manage and track genomic sequencing metadata across multiple research centers. It provides a structured approach to storing and accessing laboratory information, sequencing data locations, and workflow execution details.

The system implements a Data Vault 2.0 architecture pattern with hub, link, and satellite tables to maintain historical tracking and relationships between different entities such as libraries, samples, subjects, projects, and sequencing runs. It integrates data from multiple sources including Laboratory Information Management Systems (LIMS), workflow managers, and sequence run managers to provide a unified view of genomic research data.

Repository Structure

orcavault/
└── models/                  # Data models directory
    ├── dcl/                 # Data Core Layer - Contains hub, link, and satellite tables
    │   ├── hub_*.sql        # Hub tables defining core business entities
    │   ├── link_*.sql       # Link tables defining relationships between entities
    │   ├── sat_*.sql        # Satellite tables containing descriptive attributes
    │   └── effsat_*.sql     # Effective satellite tables for temporal tracking
    ├── mart/                # Data Mart Layer - Contains subject-area specific models
    │   ├── centre/          # Centre-specific views and transformations
    │   ├── curation/        # Data curation related models
    │   ├── dawson/          # Dawson Research Group specific models
    │   ├── grimmond/        # Grimmond Research Group specific models
    │   └── tothill/         # Tothill Research Group specific models
    ├── ods/                 # Operational Data Store - Source system configurations
    ├── psa/                 # Persistent Staging Area - Initial data landing
    └── tsa/                 # Temporary Staging Area - Temporary landing zone

Usage Instructions

Prerequisites

PostgreSQL 11+ database server
dbt (data build tool) Core 1.0.0+
Python 3.7+
Access to source systems (LIMS, workflow managers, sequence run managers)

Installation

Clone the repository:

git clone <repository-url>
cd orcavault

Install dependencies:
```
pip install dbt-core dbt-postgres
```

Configure database connection: Create a profiles.yml file in your ~/.dbt/ directory:

orcavault:
  target: dev
  outputs:
 dev:
   type: postgres
   host: <your-host>
   port: 5432
   user: <your-user>
   password: <your-password>
   dbname: orcavault
   schema: public

Quick Start

Initialize the project:
```
dbt deps
```
Run the models:
```
dbt run
```
Test the data:
```
dbt test
```

More Detailed Examples

Loading data from a specific source:
```
dbt run --models source:ods
```
Building specific data marts:
```
dbt run --models mart.centre
```
Generating documentation:
```
dbt docs generate
dbt docs serve
```

Troubleshooting

Common Issues:

Database Connection Errors
- Error: “Could not connect to the database”
- Solution: Verify credentials in profiles.yml and database accessibility
- Debug: dbt debug
Model Dependencies
- Error: “Could not find ref()”
- Solution: Ensure all referenced models exist and are built in correct order
- Debug: dbt ls --select model_name+
Data Quality Issues
- Error: “Assert test failed”
- Solution: Check source data quality and transformation logic
- Debug: dbt test --select model_name

Data Flow

OrcaVault implements a multi-layered data architecture that processes genomic metadata from source systems through staging areas into a Data Vault model, finally presenting it in subject-area specific data marts.

Source Systems       Staging              Core                 Marts
[LIMS]          --> [TSA/PSA] ----+
[Workflow Mgr]  --> [ODS]     --->[Data Vault Model] --> [Centre Mart]
[Sequence Mgr]  --> [Staging] ----+                  --> [Research Group Marts]

Key Integration Points:

Source systems provide raw data through database connections or file exports
Staging areas (TSA/PSA) provide data quality checks and standardization
Data Vault model (DCL) maintains historical tracking and relationships
Data marts provide subject-area specific views optimized for analysis
All transformations maintain data lineage and audit trails