Structure • tidywigits

{tidywigits} is built on top of R’s R6 encapsulated object-oriented programming implementation, which helps with code organisation. It consists of several base classes like Config, Tool, and Workflow which we describe below. Each R6 class can contain public and private functions and non-functions (fields).

`Config`

A Config object contains functionality for interacting with YAML configuration files that are part of {tidywigits}. These configuration files (under inst/config) specify the schemas, types, patterns and field descriptions for the raw input files and tidy output tbls. See ?Config.

raw

Let’s look at some of the information for the raw PURPLE config, for instance:

tool <- "purple"
toolu <- toupper(tool)
conf <- Config$new(tool)
conf

## #--- Config purple ---#
## 
## |var   |value  |
## |:-----|:------|
## |tool  |purple |
## |nraw  |10     |
## |ntidy |10     |

You can access the individual fields in the classic R list-like manner, using the $ sign.

Patterns are used to fish out the relevant files from a directory listing.

conf$get_raw_patterns() |>
  knitr::kable(caption = glue("{toolu} raw file patterns."))

PURPLE raw file patterns.
name	value
cnvgenetsv	.purple.cnv.gene.tsv$
cnvsomtsv	.purple.cnv.somatic.tsv$
drivercatalog	.purple.driver.catalog.(germline\|somatic).tsv$
germdeltsv	.purple.germline.deletion.tsv$
purityrange	.purple.purity.range.tsv$
puritytsv	.purple.purity.tsv$
qc	.purple.qc$
somclonality	.purple.somatic.clonality.tsv$
somhist	.purple.somatic.hist.tsv$
version	^purple.version$

File descriptions are based on the available Hartwig documentation.

conf$get_raw_descriptions() |>
  knitr::kable(caption = glue("{toolu} raw file descriptions."))

PURPLE raw file descriptions.
name	value
cnvgenetsv	Copy number alterations of each gene in the HMF gene panel.
cnvsomtsv	Copy number profile of all (contiguous) segments of the tumor sample.
drivercatalog	Significant amplifications and deletions that occur in the HMF gene panel.
germdeltsv	Germline deletions.
purityrange	Best fit per purity sorted by score.
puritytsv	Purity/ploidy fit summary.
qc	QC metrics.
somclonality	Clonality peak model data.
somhist	Somatic variant histogram data.
version	Version of the tool.

Versions are used to distinguish changes in schema between individual tool versions. For example, after LINX v1.25, several columns were dropped from the breakends table, which is reflected in the available LINX schemas. For now we are using latest as a default version based on the most recent schema tests, and any discrepancies we see are labelled accordingly by the version of the tool that generated a file with a different schema.

conf$get_raw_versions() |>
  knitr::kable(caption = glue("{toolu} raw file versions."))

PURPLE raw file versions.
name	value
cnvgenetsv	latest
cnvsomtsv	latest
drivercatalog	latest
germdeltsv	latest
purityrange	latest
puritytsv	latest
qc	latest
somclonality	latest
somhist	latest
version	latest

The raw schemas specify the column name and type (e.g. character (c), integer (i), float/double (d)) for each input file (just showing a couple below):

(s <- conf$get_raw_schemas_all())

## # A tibble: 10 × 3
##    name          version schema           
##    <chr>         <chr>   <list>           
##  1 cnvgenetsv    latest  <tibble [18 × 2]>
##  2 cnvsomtsv     latest  <tibble [16 × 2]>
##  3 drivercatalog latest  <tibble [17 × 2]>
##  4 germdeltsv    latest  <tibble [16 × 2]>
##  5 purityrange   latest  <tibble [6 × 2]> 
##  6 puritytsv     latest  <tibble [25 × 2]>
##  7 qc            latest  <tibble [2 × 2]> 
##  8 somclonality  latest  <tibble [6 × 2]> 
##  9 somhist       latest  <tibble [3 × 2]> 
## 10 version       latest  <tibble [2 × 2]>

s |>
  dplyr::filter(name == "puritytsv") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")

## # A tibble: 25 × 2
##    field                type 
##    <chr>                <chr>
##  1 purity               d    
##  2 normFactor           d    
##  3 score                d    
##  4 diploidProportion    d    
##  5 ploidy               d    
##  6 gender               c    
##  7 status               c    
##  8 polyclonalProportion d    
##  9 minPurity            d    
## 10 maxPurity            d    
## # ℹ 15 more rows

tidy

Now let’s look at some of the information in the tidy PURPLE config. The difference between raw and tidy configs is mostly in the column names (they are standardised to lowercase separated by underscores, i.e. snake_case), and some raw files get split into multiple tidy tables (e.g. for normalisation purposes).

Tidy descriptions are the same as the raw descriptions for now.

conf$get_tidy_descriptions() |>
  knitr::kable(caption = glue("{toolu} tidy file descriptions."))

PURPLE tidy file descriptions.
name	value
cnvgenetsv	Copy number alterations of each gene in the HMF gene panel.
cnvsomtsv	Copy number profile of all (contiguous) segments of the tumor sample.
drivercatalog	Significant amplifications and deletions that occur in the HMF gene panel.
germdeltsv	Germline deletions.
purityrange	Best fit per purity sorted by score.
puritytsv	Purity/ploidy fit summary.
qc	QC metrics.
somclonality	Clonality peak model data.
somhist	Somatic variant histogram data.
version	Version of the tool.

(s <- conf$get_tidy_schemas_all())

## # A tibble: 10 × 4
##    name          version tbl   schema           
##    <chr>         <chr>   <chr> <list>           
##  1 cnvgenetsv    latest  tbl1  <tibble [18 × 3]>
##  2 cnvsomtsv     latest  tbl1  <tibble [16 × 3]>
##  3 drivercatalog latest  tbl1  <tibble [17 × 3]>
##  4 germdeltsv    latest  tbl1  <tibble [16 × 3]>
##  5 purityrange   latest  tbl1  <tibble [6 × 3]> 
##  6 puritytsv     latest  tbl1  <tibble [25 × 3]>
##  7 qc            latest  tbl1  <tibble [12 × 3]>
##  8 somclonality  latest  tbl1  <tibble [6 × 3]> 
##  9 somhist       latest  tbl1  <tibble [3 × 3]> 
## 10 version       latest  tbl1  <tibble [2 × 3]>

s |>
  dplyr::filter(.data$name == "puritytsv") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")

## # A tibble: 25 × 3
##    field                 type  description                                                          
##    <chr>                 <chr> <chr>                                                                
##  1 purity                d     purity of tumor in the sample                                        
##  2 norm_factor           d     internal factor to convert tumor ratio to cn.                        
##  3 score                 d     score of fit (lower is better)                                       
##  4 diploid_proportion    d     proportion of cn regions that have 1 (+- 0.2) minor and major allele 
##  5 ploidy                d     average ploidy of the tumor sample after adjusting for purity        
##  6 gender                c     one of male or female                                                
##  7 status                c     either pass or one or more warning or fail status                    
##  8 polyclonal_proportion d     proportion of copy number regions that are more than 0.25 from a who…
##  9 min_purity            d     minimum purity with score within 10% of best                         
## 10 max_purity            d     maximum purity with score within 10% of best                         
## # ℹ 15 more rows

`Tool`

Tool is the main organisation class for all file parsers and tidiers. It contains functions for parsing and tidying typical CSV/TSV files (with column names), and TXT files where the column names are missing. Currently it utilises the very simple readr::read_delim function from the {readr} package that reads all the data into memory. See ?Tool.

These simple parsers are used in 80-90% of cases, so in the future we can optimise the parsing if needed with faster packages such as {data.table}, {duckdb-r}/{duckplyr} or {r-polars}.

We can have different Tool children classes that inherit (or override) functions and fields from the Tool parent class. For example, we can create a Tool object for PURPLE as follows:

Initialise a Purple object:

ppl_path <- system.file("extdata/oa/purple", package = "tidywigits")
ppl <- Purple$new(path = ppl_path)
# each class comes with a print function
ppl

## #--- Tool purple ---#
## 
## |var    |value                                                        |
## |:------|:------------------------------------------------------------|
## |name   |purple                                                       |
## |path   |/home/runner/work/_temp/Library/tidywigits/extdata/oa/purple |
## |files  |11                                                           |
## |tidied |FALSE                                                        |

Its Config object is also constructed based on the name supplied - this is used internally to find files of interest and infer their schemas:

ppl$config

## #--- Config purple ---#
## 
## |var   |value  |
## |:-----|:------|
## |tool  |purple |
## |nraw  |10     |
## |ntidy |10     |

ppl$config$get_raw_patterns()

## # A tibble: 10 × 2
##    name          value                                                     
##    <chr>         <chr>                                                     
##  1 cnvgenetsv    "\\.purple\\.cnv\\.gene\\.tsv$"                           
##  2 cnvsomtsv     "\\.purple\\.cnv\\.somatic\\.tsv$"                        
##  3 drivercatalog "\\.purple\\.driver\\.catalog\\.(germline|somatic)\\.tsv$"
##  4 germdeltsv    "\\.purple\\.germline\\.deletion\\.tsv$"                  
##  5 purityrange   "\\.purple\\.purity\\.range\\.tsv$"                       
##  6 puritytsv     "\\.purple\\.purity\\.tsv$"                               
##  7 qc            "\\.purple\\.qc$"                                         
##  8 somclonality  "\\.purple\\.somatic\\.clonality\\.tsv$"                  
##  9 somhist       "\\.purple\\.somatic\\.hist\\.tsv$"                       
## 10 version       "^purple\\.version$"

ppl$config$get_raw_schema("puritytsv")

## # A tibble: 25 × 2
##    field                type 
##    <chr>                <chr>
##  1 purity               d    
##  2 normFactor           d    
##  3 score                d    
##  4 diploidProportion    d    
##  5 ploidy               d    
##  6 gender               c    
##  7 status               c    
##  8 polyclonalProportion d    
##  9 minPurity            d    
## 10 maxPurity            d    
## # ℹ 15 more rows

ppl$config$get_tidy_schema("puritytsv")

## # A tibble: 25 × 3
##    field                 type  description                                                          
##    <chr>                 <chr> <chr>                                                                
##  1 purity                d     purity of tumor in the sample                                        
##  2 norm_factor           d     internal factor to convert tumor ratio to cn.                        
##  3 score                 d     score of fit (lower is better)                                       
##  4 diploid_proportion    d     proportion of cn regions that have 1 (+- 0.2) minor and major allele 
##  5 ploidy                d     average ploidy of the tumor sample after adjusting for purity        
##  6 gender                c     one of male or female                                                
##  7 status                c     either pass or one or more warning or fail status                    
##  8 polyclonal_proportion d     proportion of copy number regions that are more than 0.25 from a who…
##  9 min_purity            d     minimum purity with score within 10% of best                         
## 10 max_purity            d     maximum purity with score within 10% of best                         
## # ℹ 15 more rows

We can list files that can be parsed with list_files():

(lf <- ppl$list_files())

## # A tibble: 11 × 10
##    tool_parser          parser bname    size lastmodified        path  pattern prefix schema   group
##    <glue>               <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue> <list>   <glu>
##  1 purple_version       versi… purp…      39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>      
##  2 purple_cnvgenetsv    cnvge… samp…   1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  3 purple_cnvsomtsv     cnvso… samp…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  4 purple_drivercatalog drive… samp…     819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  5 purple_drivercatalog drive… samp…     468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  6 purple_germdeltsv    germd… samp…   1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  7 purple_purityrange   purit… samp…     484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  8 purple_puritytsv     purit… samp…     462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  9 purple_qc            qc     samp…     228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 10 purple_somclonality  somcl… samp…     451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 11 purple_somhist       somhi… samp…     138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>

lf |> dplyr::slice(1) |> str()

## tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
##  $ tool_parser : 'glue' chr "purple_version"
##  $ parser      : chr "version"
##  $ bname       : chr "purple.version"
##  $ size        : 'fs_bytes' num 39
##  $ lastmodified: POSIXct[1:1], format: "2025-08-19 05:27:46"
##  $ path        : chr "/home/runner/work/_temp/Library/tidywigits/extdata/oa/purple/purple.version"
##  $ pattern     : chr "^purple\\.version$"
##  $ prefix      : 'glue' chr "version"
##  $ schema      :List of 1
##   ..$ : tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
##   .. ..$ field: chr [1:2] "variable" "value"
##   .. ..$ type : chr [1:2] "c" "c"
##  $ group       : 'glue' chr ""

We can parse and tidy files of interest using the tidy function. Note that this function is called on the object and not assigned anywhere:

# this will create a new field tbls containing the tidy data (and optionally
# the 'raw' parsed data)
ppl$tidy(tidy = TRUE, keep_raw = TRUE)
ppl$tbls

## # A tibble: 11 × 11
##    tool_parser parser bname    size lastmodified        path  pattern prefix group raw      tidy    
##    <glue>      <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue> <glu> <list>   <list>  
##  1 purple_ver… versi… purp…      39 2025-08-19 05:27:46 /hom… "^purp… versi…       <tibble> <tibble>
##  2 purple_cnv… cnvge… samp…   1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  3 purple_cnv… cnvso… samp…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  4 purple_dri… drive… samp…     819 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  5 purple_dri… drive… samp…     468 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  6 purple_ger… germd… samp…   1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  7 purple_pur… purit… samp…     484 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  8 purple_pur… purit… samp…     462 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
##  9 purple_qc   qc     samp…     228 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
## 10 purple_som… somcl… samp…     451 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>
## 11 purple_som… somhi… samp…     138 2025-08-19 05:27:46 /hom… "\\.pu… sampl…       <tibble> <tibble>

ppl$tbls$raw[[8]] |> dplyr::glimpse()

## Rows: 1
## Columns: 25
## $ purity                  <dbl> 0.87
## $ normFactor              <dbl> 0.5283
## $ score                   <dbl> 0.841
## $ diploidProportion       <dbl> 0.0036
## $ ploidy                  <dbl> 4.05
## $ gender                  <chr> "MALE"
## $ status                  <chr> "NORMAL"
## $ polyclonalProportion    <dbl> 0.1501
## $ minPurity               <dbl> 0.79
## $ maxPurity               <dbl> 0.96
## $ minPloidy               <dbl> 3.95
## $ maxPloidy               <dbl> 4.2
## $ minDiploidProportion    <dbl> 9e-04
## $ maxDiploidProportion    <dbl> 0.004
## $ somaticPenalty          <dbl> 0
## $ wholeGenomeDuplication  <chr> "true"
## $ msIndelsPerMb           <dbl> 0.0238
## $ msStatus                <chr> "MSS"
## $ tml                     <dbl> 29
## $ tmlStatus               <chr> "LOW"
## $ tmbPerMb                <dbl> 1.1161
## $ tmbStatus               <chr> "LOW"
## $ svTumorMutationalBurden <dbl> 64
## $ runMode                 <chr> "TUMOR_GERMLINE"
## $ targeted                <chr> "false"

# the tidy tibbles are nested to allow for more than one tidy tibble per file
ppl$tbls$tidy[[8]][["data"]][[1]] |> dplyr::glimpse()

## Rows: 1
## Columns: 25
## $ purity                     <dbl> 0.87
## $ norm_factor                <dbl> 0.5283
## $ score                      <dbl> 0.841
## $ diploid_proportion         <dbl> 0.0036
## $ ploidy                     <dbl> 4.05
## $ gender                     <chr> "MALE"
## $ status                     <chr> "NORMAL"
## $ polyclonal_proportion      <dbl> 0.1501
## $ min_purity                 <dbl> 0.79
## $ max_purity                 <dbl> 0.96
## $ min_ploidy                 <dbl> 3.95
## $ max_ploidy                 <dbl> 4.2
## $ min_diploid_proportion     <dbl> 9e-04
## $ max_diploid_proportion     <dbl> 0.004
## $ somatic_penalty            <dbl> 0
## $ whole_genome_duplication   <chr> "true"
## $ ms_indels_per_mb           <dbl> 0.0238
## $ ms_status                  <chr> "MSS"
## $ tml                        <dbl> 29
## $ tml_status                 <chr> "LOW"
## $ tmb_per_mb                 <dbl> 1.1161
## $ tmb_status                 <chr> "LOW"
## $ sv_tumor_mutational_burden <dbl> 64
## $ run_mode                   <chr> "TUMOR_GERMLINE"
## $ targeted                   <chr> "false"

We can also focus on a subset of files to tidy using the filter_files() function. The include and exclude arguments can specify which parsers to include or exclude in the analysis:

# create new Purple object
ppl2 <- Purple$new(path = ppl_path)
ppl2$files

## # A tibble: 11 × 10
##    tool_parser          parser bname    size lastmodified        path  pattern prefix schema   group
##    <glue>               <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue> <list>   <glu>
##  1 purple_version       versi… purp…      39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>      
##  2 purple_cnvgenetsv    cnvge… samp…   1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  3 purple_cnvsomtsv     cnvso… samp…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  4 purple_drivercatalog drive… samp…     819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  5 purple_drivercatalog drive… samp…     468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  6 purple_germdeltsv    germd… samp…   1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  7 purple_purityrange   purit… samp…     484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  8 purple_puritytsv     purit… samp…     462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  9 purple_qc            qc     samp…     228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 10 purple_somclonality  somcl… samp…     451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 11 purple_somhist       somhi… samp…     138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>

ppl2$filter_files(include = c("purple_qc", "purple_cnvsomtsv"))
ppl2$files

## # A tibble: 2 × 10
##   tool_parser      parser    bname      size lastmodified        path  pattern prefix schema   group
##   <glue>           <chr>     <chr>   <fs::b> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 purple_cnvsomtsv cnvsomtsv sample…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 2 purple_qc        qc        sample…     228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>

After tidying the data of interest, we can write the tidy tibbles to various formats, like Apache Parquet, PostgreSQL, CSV/TSV and R’s RDS. Below we can see that the id specified is added to the written files in an additional nemo_id column. This can be used e.g. to distinguish results from different runs in a data pipeline. When writing to a database like PostgreSQL, another column nemo_pfix is used to distinguish results from the same run from the same tool.

ppl2$tidy() # first need to tidy
outdir1 <- tempdir()
fmt <- "csv"
ppl2$write(odir = outdir1, format = fmt, id = "run123")
wfiles <- fs::dir_info(outdir1) |> dplyr::select(1:5)
wfiles |>
  dplyr::mutate(bname = basename(.data$path)) |>
  dplyr::select("bname", "size", "type")

## # A tibble: 17 × 3
##    bname                                  size type 
##    <chr>                           <fs::bytes> <fct>
##  1 file2858145e191c                      4.71K file 
##  2 file285817c13815                      4.71K file 
##  3 file28582a9bdf63                      4.71K file 
##  4 file28583357cd4e                      4.71K file 
##  5 file28583cb08685                      4.71K file 
##  6 file2858440e0950                      4.71K file 
##  7 file28584955028d                      4.71K file 
##  8 file2858499242e7                      4.71K file 
##  9 file28587390064f                      4.71K file 
## 10 file285874d245ca                      4.71K file 
## 11 file285876d1c889                      4.71K file 
## 12 file28587cb3bc21                      4.71K file 
## 13 file28587d382e33                      4.71K file 
## 14 file28587fcc52a6                      4.71K file 
## 15 rmarkdown-str285844bb6b9.html         1.13K file 
## 16 sample1_purple_cnvsomtsv.csv.gz         631 file 
## 17 sample1_purple_qc.csv.gz                186 file

# readr::read_csv(wfiles$path[1], show_col_types = F) # see bug #137

The nemofy function is a convenient wrapper for the process of filtering, tidying, and writing.

ppl3 <- Purple$new(path = ppl_path)
outdir2 <- file.path(tempdir(), "ppl3") |> fs::dir_create()
ppl3$files

## # A tibble: 11 × 10
##    tool_parser          parser bname    size lastmodified        path  pattern prefix schema   group
##    <glue>               <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue> <list>   <glu>
##  1 purple_version       versi… purp…      39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>      
##  2 purple_cnvgenetsv    cnvge… samp…   1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  3 purple_cnvsomtsv     cnvso… samp…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  4 purple_drivercatalog drive… samp…     819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  5 purple_drivercatalog drive… samp…     468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  6 purple_germdeltsv    germd… samp…   1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  7 purple_purityrange   purit… samp…     484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  8 purple_puritytsv     purit… samp…     462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
##  9 purple_qc            qc     samp…     228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 10 purple_somclonality  somcl… samp…     451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>      
## 11 purple_somhist       somhi… samp…     138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>

ppl3$nemofy(
  odir = outdir2,
  format = "tsv",
  id = "run_ppl3",
  exclude = c("purple_cnvgenetsv", "purple_cnvsomtsv", "purple_drivercatalog", "purple_germdeltsv")
)
wfiles2 <- fs::dir_info(outdir2) |>
  dplyr::mutate(bname = basename(.data$path))
wfiles2 |>
  dplyr::select("bname", "size", "type")

## # A tibble: 6 × 3
##   bname                                     size type 
##   <chr>                              <fs::bytes> <fct>
## 1 sample1_purple_purityrange.tsv.gz          204 file 
## 2 sample1_purple_puritytsv.tsv.gz            303 file 
## 3 sample1_purple_qc.tsv.gz                   186 file 
## 4 sample1_purple_somclonality.tsv.gz         154 file 
## 5 sample1_purple_somhist.tsv.gz              108 file 
## 6 version_purple_version.tsv.gz               77 file

readr::read_tsv(wfiles2$path[2], show_col_types = F)

## # A tibble: 1 × 26
##   nemo_id  purity norm_factor score diploid_proportion ploidy gender status polyclonal_proportion
##   <chr>     <dbl>       <dbl> <dbl>              <dbl>  <dbl> <chr>  <chr>                  <dbl>
## 1 run_ppl3   0.87       0.528 0.841             0.0036   4.05 MALE   NORMAL                 0.150
## # ℹ 17 more variables: min_purity <dbl>, max_purity <dbl>, min_ploidy <dbl>, max_ploidy <dbl>,
## #   min_diploid_proportion <dbl>, max_diploid_proportion <dbl>, somatic_penalty <dbl>,
## #   whole_genome_duplication <lgl>, ms_indels_per_mb <dbl>, ms_status <chr>, tml <dbl>,
## #   tml_status <chr>, tmb_per_mb <dbl>, tmb_status <chr>, sv_tumor_mutational_burden <dbl>,
## #   run_mode <chr>, targeted <lgl>

`Workflow`

A Workflow consists of a list of one or more Tools. We can construct a certain Workflow with different Tools, which would allow parsing and writing tidy tables from a variety of bioinformatic tools. See ?Workflow.

The Oncoanalyser Nextflow pipeline uses several tools from [WiGiTS], and we can construct a Oncoanalyser class as a Workflow child based on a suite of Tools under the tidywigits R package. Similarly to Tool, a Workflow object contains functions such as filter_files, list_files, tidy, write and nemofy:

oa <- system.file("extdata/oa/purple", package = "tidywigits") |>
  Oncoanalyser$new()
outdir3 <- file.path(tempdir(), "oa") |> fs::dir_create()
oa$list_files()

## # A tibble: 11 × 10
##    parser        bname      size lastmodified        path  pattern tool_parser prefix schema   group
##    <chr>         <chr>   <fs::b> <dttm>              <chr> <chr>   <glue>      <glue> <list>   <glu>
##  1 version       purple…      39 2025-08-19 05:27:46 /hom… "^purp… purple_ver… versi… <tibble>      
##  2 cnvgenetsv    sample…   1.44K 2025-08-19 05:27:46 /hom… "\\.pu… purple_cnv… sampl… <tibble>      
##  3 cnvsomtsv     sample…   1.32K 2025-08-19 05:27:46 /hom… "\\.pu… purple_cnv… sampl… <tibble>      
##  4 drivercatalog sample…     819 2025-08-19 05:27:46 /hom… "\\.pu… purple_dri… sampl… <tibble>      
##  5 drivercatalog sample…     468 2025-08-19 05:27:46 /hom… "\\.pu… purple_dri… sampl… <tibble>      
##  6 germdeltsv    sample…   1.24K 2025-08-19 05:27:46 /hom… "\\.pu… purple_ger… sampl… <tibble>      
##  7 purityrange   sample…     484 2025-08-19 05:27:46 /hom… "\\.pu… purple_pur… sampl… <tibble>      
##  8 puritytsv     sample…     462 2025-08-19 05:27:46 /hom… "\\.pu… purple_pur… sampl… <tibble>      
##  9 qc            sample…     228 2025-08-19 05:27:46 /hom… "\\.pu… purple_qc   sampl… <tibble>      
## 10 somclonality  sample…     451 2025-08-19 05:27:46 /hom… "\\.pu… purple_som… sampl… <tibble>      
## 11 somhist       sample…     138 2025-08-19 05:27:46 /hom… "\\.pu… purple_som… sampl… <tibble>

x <- oa$nemofy(
  odir = outdir3,
  format = "tsv",
  id = "oa_run1",
  exclude = c("cobalt_ratiotsv", "amber_baftsv", "isofox_altsj", "isofox_transdata")
)
wfiles3 <- fs::dir_info(outdir3) |>
  dplyr::select(1:5) |>
  dplyr::mutate(bname = basename(.data$path))
wfiles3 |>
  dplyr::select("bname", "size", "type")

## # A tibble: 11 × 3
##    bname                                               size type 
##    <chr>                                        <fs::bytes> <fct>
##  1 sample1_germline_purple_drivercatalog.tsv.gz         363 file 
##  2 sample1_purple_cnvgenetsv.tsv.gz                     444 file 
##  3 sample1_purple_cnvsomtsv.tsv.gz                      633 file 
##  4 sample1_purple_drivercatalog.tsv.gz                  293 file 
##  5 sample1_purple_germdeltsv.tsv.gz                     479 file 
##  6 sample1_purple_purityrange.tsv.gz                    202 file 
##  7 sample1_purple_puritytsv.tsv.gz                      303 file 
##  8 sample1_purple_qc.tsv.gz                             185 file 
##  9 sample1_purple_somclonality.tsv.gz                   153 file 
## 10 sample1_purple_somhist.tsv.gz                        107 file 
## 11 version_purple_version.tsv.gz                         76 file

readr::read_tsv(wfiles3$path[5], show_col_types = F)

## # A tibble: 10 × 17
##    nemo_id gene     chromosome chromosome_band region_start region_end depth_window_count exon_start
##    <chr>   <chr>    <chr>      <chr>                  <dbl>      <dbl>              <dbl>      <dbl>
##  1 oa_run1 ZYG11B   chr1       p32.3               52797001   52798000                  1          8
##  2 oa_run1 RPL31P12 chr1       p31.1               72300641   72346158                 45          1
##  3 oa_run1 AKNAD1   chr1       p13.3              108824397  108829338                  5         11
##  4 oa_run1 SLC16A1… chr1       p13.2              113006001  113012000                  6          4
##  5 oa_run1 LRIG2-DT chr1       p13.2              113006001  113012000                  6          3
##  6 oa_run1 GABPB2   chr1       q21.3              151073001  151117000                 44          2
##  7 oa_run1 RPS29P29 chr1       q21.3              151073001  151117000                 44          1
##  8 oa_run1 LCE1E    chr1       q21.3              152787001  152798000                 10          2
##  9 oa_run1 LCE1D    chr1       q21.3              152787001  152798000                 10          1
## 10 oa_run1 PTPRVP   chr1       q32.1              202171001  202173000                  1          6
## # ℹ 9 more variables: exon_end <dbl>, detection_method <chr>, germline_status <chr>,
## #   tumor_status <chr>, germline_cn <dbl>, tumor_cn <dbl>, filter <chr>, cohort_frequency <dbl>,
## #   reported <lgl>