orcahouse-doc

DISCLAIMER: AI AUTOGENERATED README. LAST UPDATE 2025-04-29.

OrcaVault: A Data Warehouse for Genomic Sequencing Metadata Management

OrcaVault is a comprehensive data warehouse system designed to manage and track genomic sequencing metadata across multiple research centers. It provides a structured approach to storing and accessing laboratory information, sequencing data locations, and workflow execution details.

The system implements a Data Vault 2.0 architecture pattern with hub, link, and satellite tables to maintain historical tracking and relationships between different entities such as libraries, samples, subjects, projects, and sequencing runs. It integrates data from multiple sources including Laboratory Information Management Systems (LIMS), workflow managers, and sequence run managers to provide a unified view of genomic research data.

Repository Structure

orcavault/
└── models/                  # Data models directory
    β”œβ”€β”€ dcl/                 # Data Core Layer - Contains hub, link, and satellite tables
    β”‚   β”œβ”€β”€ hub_*.sql        # Hub tables defining core business entities
    β”‚   β”œβ”€β”€ link_*.sql       # Link tables defining relationships between entities
    β”‚   β”œβ”€β”€ sat_*.sql        # Satellite tables containing descriptive attributes
    β”‚   └── effsat_*.sql     # Effective satellite tables for temporal tracking
    β”œβ”€β”€ mart/                # Data Mart Layer - Contains subject-area specific models
    β”‚   β”œβ”€β”€ centre/          # Centre-specific views and transformations
    β”‚   β”œβ”€β”€ curation/        # Data curation related models
    β”‚   β”œβ”€β”€ dawson/          # Dawson Research Group specific models
    β”‚   β”œβ”€β”€ grimmond/        # Grimmond Research Group specific models
    β”‚   └── tothill/         # Tothill Research Group specific models
    β”œβ”€β”€ ods/                 # Operational Data Store - Source system configurations
    β”œβ”€β”€ psa/                 # Persistent Staging Area - Initial data landing
    └── tsa/                 # Temporary Staging Area - Temporary landing zone

Usage Instructions

Prerequisites

Installation

  1. Clone the repository:
    git clone <repository-url>
    cd orcavault
    
  2. Install dependencies:
    pip install dbt-core dbt-postgres
    
  3. Configure database connection: Create a profiles.yml file in your ~/.dbt/ directory:
    orcavault:
      target: dev
      outputs:
     dev:
       type: postgres
       host: <your-host>
       port: 5432
       user: <your-user>
       password: <your-password>
       dbname: orcavault
       schema: public
    

Quick Start

  1. Initialize the project:
    dbt deps
    
  2. Run the models:
    dbt run
    
  3. Test the data:
    dbt test
    

More Detailed Examples

  1. Loading data from a specific source:
    dbt run --models source:ods
    
  2. Building specific data marts:
    dbt run --models mart.centre
    
  3. Generating documentation:
    dbt docs generate
    dbt docs serve
    

Troubleshooting

Common Issues:

  1. Database Connection Errors
    • Error: β€œCould not connect to the database”
    • Solution: Verify credentials in profiles.yml and database accessibility
    • Debug: dbt debug
  2. Model Dependencies
    • Error: β€œCould not find ref()”
    • Solution: Ensure all referenced models exist and are built in correct order
    • Debug: dbt ls --select model_name+
  3. Data Quality Issues
    • Error: β€œAssert test failed”
    • Solution: Check source data quality and transformation logic
    • Debug: dbt test --select model_name

Data Flow

OrcaVault implements a multi-layered data architecture that processes genomic metadata from source systems through staging areas into a Data Vault model, finally presenting it in subject-area specific data marts.

Source Systems       Staging              Core                 Marts
[LIMS]          --> [TSA/PSA] ----+
[Workflow Mgr]  --> [ODS]     --->[Data Vault Model] --> [Centre Mart]
[Sequence Mgr]  --> [Staging] ----+                  --> [Research Group Marts]

Key Integration Points:

  1. Source systems provide raw data through database connections or file exports
  2. Staging areas (TSA/PSA) provide data quality checks and standardization
  3. Data Vault model (DCL) maintains historical tracking and relationships
  4. Data marts provide subject-area specific views optimized for analysis
  5. All transformations maintain data lineage and audit trails