Structure • tidywigits

{tidywigits} is an R package that parses and tidies outputs from the WiGiTS suite of genome and transcriptome analysis tools for cancer research and diagnostics, created by the Hartwig Medical Foundation.

In short, it traverses through a directory containing results from one or more runs of WiGiTS tools, parses any files it recognises, tidies them up (which includes data reshaping, normalisation, column name cleanup etc.), and writes them to the output format of choice e.g. Apache Parquet, PostgreSQL, Delta table.

{tidywigits} is built on top of R’s R6 encapsulated object-oriented programming implementation, which helps with code organisation. It consists of several base classes like Config, Tool, and Workflow which we describe below. Each R6 class can contain public and private functions and non-functions (fields).

`Config`

A Config object contains functionality for interacting with YAML configuration files that are part of {tidywigits}. These configuration files (under inst/config) specify the schemas, types, patterns and field descriptions for the raw input files and tidy output tbls. See ?Config.

raw

Let’s look at some of the information for the raw PURPLE config, for instance:

tool <- "purple"
toolu <- toupper(tool)
conf <- Config$new(tool)
conf
#> #--- Config purple ---#
#> 
#> |var   |value  |
#> |:-----|:------|
#> |tool  |purple |
#> |nraw  |10     |
#> |ntidy |10     |

You can access the individual fields in the classic R list-like manner, using the $ sign.

Patterns are used to fish out the relevant files from a directory listing. Note that the \ needs to be doubled in the R code since it’s an escaped character.

conf$get_raw_patterns() |>
  knitr::kable(caption = glue("{toolu} raw file patterns."))

PURPLE raw file patterns.
name	value
cnvgenetsv	.purple.cnv.gene.tsv$
cnvsomtsv	.purple.cnv.somatic.tsv$
drivercatalog	.purple.driver.catalog.(germline\|somatic).tsv$
germdeltsv	.purple.germline.deletion.tsv$
purityrange	.purple.purity.range.tsv$
puritytsv	.purple.purity.tsv$
qc	.purple.qc$
somclonality	.purple.somatic.clonality.tsv$
somhist	.purple.somatic.hist.tsv$
version	^purple.version$

File descriptions based on the Hartwig documentation.

conf$get_raw_descriptions() |>
  knitr::kable(caption = glue("{toolu} raw file descriptions."))

PURPLE raw file descriptions.
name	value
cnvgenetsv	Copy number alterations of each gene in the HMF gene panel.
cnvsomtsv	Copy number profile of all (contiguous) segments of the tumor sample.
drivercatalog	Significant amplifications and deletions that occur in the HMF gene panel.
germdeltsv	Germline deletions.
purityrange	Best fit per purity sorted by score.
puritytsv	Purity/ploidy fit summary.
qc	QC metrics.
somclonality	Clonality peak model data.
somhist	Somatic variant histogram data.
version	Version of the tool.

Versions are used to distinguish changes in schema between individual tool versions. For example, after LINX v1.25, several columns were dropped from the breakends table, which is reflected in the available LINX schemas. For now we are using latest as a default version based on the most recent schema tests, and any discrepancies we see are labelled accordingly by the version of the tool that generated a file with a different schema.

conf$get_raw_versions() |>
  knitr::kable(caption = glue("{toolu} raw file versions."))

PURPLE raw file versions.
name	value
cnvgenetsv	latest
cnvsomtsv	latest
drivercatalog	latest
germdeltsv	latest
purityrange	latest
puritytsv	latest
qc	latest
somclonality	latest
somhist	latest
version	latest

The raw schemas specify the column name and type (e.g. character (c), integer (i), float/double (d)) for each input file (just showing a couple below):

(s <- conf$get_raw_schemas_all())
#> # A tibble: 10 × 3
#>    name          version schema           
#>    <chr>         <chr>   <list>           
#>  1 cnvgenetsv    latest  <tibble [18 × 2]>
#>  2 cnvsomtsv     latest  <tibble [16 × 2]>
#>  3 drivercatalog latest  <tibble [17 × 2]>
#>  4 germdeltsv    latest  <tibble [16 × 2]>
#>  5 purityrange   latest  <tibble [6 × 2]> 
#>  6 puritytsv     latest  <tibble [25 × 2]>
#>  7 qc            latest  <tibble [2 × 2]> 
#>  8 somclonality  latest  <tibble [6 × 2]> 
#>  9 somhist       latest  <tibble [3 × 2]> 
#> 10 version       latest  <tibble [2 × 2]>
s |>
  dplyr::filter(name == "puritytsv") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")
#> # A tibble: 25 × 2
#>    field                type 
#>    <chr>                <chr>
#>  1 purity               d    
#>  2 normFactor           d    
#>  3 score                d    
#>  4 diploidProportion    d    
#>  5 ploidy               d    
#>  6 gender               c    
#>  7 status               c    
#>  8 polyclonalProportion d    
#>  9 minPurity            d    
#> 10 maxPurity            d    
#> # ℹ 15 more rows

tidy

Now let’s look at some of the information in the tidy PURPLE config. The difference between raw and tidy configs is mostly in the column names (they are standardised to lowercase separated by underscores, i.e. snake_case), and some raw files get split into multiple tidy tables (e.g. for normalisation purposes).

Tidy descriptions are the same as the raw descriptions for now.

conf$get_tidy_descriptions() |>
  knitr::kable(caption = glue("{toolu} tidy file descriptions."))

PURPLE tidy file descriptions.
name	value
cnvgenetsv	Copy number alterations of each gene in the HMF gene panel.
cnvsomtsv	Copy number profile of all (contiguous) segments of the tumor sample.
drivercatalog	Significant amplifications and deletions that occur in the HMF gene panel.
germdeltsv	Germline deletions.
purityrange	Best fit per purity sorted by score.
puritytsv	Purity/ploidy fit summary.
qc	QC metrics.
somclonality	Clonality peak model data.
somhist	Somatic variant histogram data.
version	Version of the tool.

(s <- conf$get_tidy_schemas_all())
#> # A tibble: 10 × 4
#>    name          version tbl   schema           
#>    <chr>         <chr>   <chr> <list>           
#>  1 cnvgenetsv    latest  tbl1  <tibble [18 × 3]>
#>  2 cnvsomtsv     latest  tbl1  <tibble [16 × 3]>
#>  3 drivercatalog latest  tbl1  <tibble [17 × 3]>
#>  4 germdeltsv    latest  tbl1  <tibble [16 × 3]>
#>  5 purityrange   latest  tbl1  <tibble [6 × 3]> 
#>  6 puritytsv     latest  tbl1  <tibble [25 × 3]>
#>  7 qc            latest  tbl1  <tibble [12 × 3]>
#>  8 somclonality  latest  tbl1  <tibble [6 × 3]> 
#>  9 somhist       latest  tbl1  <tibble [3 × 3]> 
#> 10 version       latest  tbl1  <tibble [2 × 3]>
s |>
  dplyr::filter(.data$name == "puritytsv") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")
#> # A tibble: 25 × 3
#>    field                 type  description                                      
#>    <chr>                 <chr> <chr>                                            
#>  1 purity                d     purity of tumor in the sample                    
#>  2 norm_factor           d     internal factor to convert tumor ratio to cn.    
#>  3 score                 d     score of fit (lower is better)                   
#>  4 diploid_proportion    d     proportion of cn regions that have 1 (+- 0.2) mi…
#>  5 ploidy                d     average ploidy of the tumor sample after adjusti…
#>  6 gender                c     one of male or female                            
#>  7 status                c     either pass or one or more warning or fail status
#>  8 polyclonal_proportion d     proportion of copy number regions that are more …
#>  9 min_purity            d     minimum purity with score within 10% of best     
#> 10 max_purity            d     maximum purity with score within 10% of best     
#> # ℹ 15 more rows

`Tool`

Tool is the main organisation class for all file parsers and tidiers. It contains functions for parsing and tidying typical CSV/TSV files (with column names), and TXT files where the column names are missing. Currently it utilises the very simple readr::read_delim function from the {readr} package that reads all the data into memory. See ?Tool.

These simple parsers are used in 80-90% of cases, so in the future we can optimise the parsing if needed with faster packages such as {data.table}, {duckdb-r} or {neo-r-polars}.

We can have different Tool children classes that inherit (or override) functions and fields from the Tool parent class. For example, we can create a Tool object for PURPLE as follows:

Initialise a Purple object:

ppl_path <- system.file("extdata/purple", package = "tidywigits")
ppl <- Purple$new(path = ppl_path)
# each class comes with a print function
ppl
#> #--- Tool purple ---#
#> 
#> |var    |value                                                     |
#> |:------|:---------------------------------------------------------|
#> |name   |purple                                                    |
#> |path   |/home/runner/work/_temp/Library/tidywigits/extdata/purple |
#> |files  |11                                                        |
#> |tidied |FALSE                                                     |

Its Config object is also constructed based on the name supplied - this is used internally to find files of interest and infer their schemas:

ppl$config
#> #--- Config purple ---#
#> 
#> |var   |value  |
#> |:-----|:------|
#> |tool  |purple |
#> |nraw  |10     |
#> |ntidy |10     |
ppl$config$get_raw_patterns()
#> # A tibble: 10 × 2
#>    name          value                                                     
#>    <chr>         <chr>                                                     
#>  1 cnvgenetsv    "\\.purple\\.cnv\\.gene\\.tsv$"                           
#>  2 cnvsomtsv     "\\.purple\\.cnv\\.somatic\\.tsv$"                        
#>  3 drivercatalog "\\.purple\\.driver\\.catalog\\.(germline|somatic)\\.tsv$"
#>  4 germdeltsv    "\\.purple\\.germline\\.deletion\\.tsv$"                  
#>  5 purityrange   "\\.purple\\.purity\\.range\\.tsv$"                       
#>  6 puritytsv     "\\.purple\\.purity\\.tsv$"                               
#>  7 qc            "\\.purple\\.qc$"                                         
#>  8 somclonality  "\\.purple\\.somatic\\.clonality\\.tsv$"                  
#>  9 somhist       "\\.purple\\.somatic\\.hist\\.tsv$"                       
#> 10 version       "^purple\\.version$"
ppl$config$get_raw_schema("puritytsv")
#> # A tibble: 25 × 2
#>    field                type 
#>    <chr>                <chr>
#>  1 purity               d    
#>  2 normFactor           d    
#>  3 score                d    
#>  4 diploidProportion    d    
#>  5 ploidy               d    
#>  6 gender               c    
#>  7 status               c    
#>  8 polyclonalProportion d    
#>  9 minPurity            d    
#> 10 maxPurity            d    
#> # ℹ 15 more rows
ppl$config$get_tidy_schema("puritytsv")
#> # A tibble: 25 × 3
#>    field                 type  description                                      
#>    <chr>                 <chr> <chr>                                            
#>  1 purity                d     purity of tumor in the sample                    
#>  2 norm_factor           d     internal factor to convert tumor ratio to cn.    
#>  3 score                 d     score of fit (lower is better)                   
#>  4 diploid_proportion    d     proportion of cn regions that have 1 (+- 0.2) mi…
#>  5 ploidy                d     average ploidy of the tumor sample after adjusti…
#>  6 gender                c     one of male or female                            
#>  7 status                c     either pass or one or more warning or fail status
#>  8 polyclonal_proportion d     proportion of copy number regions that are more …
#>  9 min_purity            d     minimum purity with score within 10% of best     
#> 10 max_purity            d     maximum purity with score within 10% of best     
#> # ℹ 15 more rows

We can list files that can be parsed with list_files:

(lf <- ppl$list_files())
#> # A tibble: 11 × 10
#>    tool_parser     parser bname    size lastmodified        path  pattern prefix
#>    <glue>          <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue>
#>  1 purple_cnvgene… cnvge… LXXX…   1.44K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  2 purple_cnvsomt… cnvso… LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  3 purple_driverc… drive… LXXX…     819 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  4 purple_driverc… drive… LXXX…     468 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  5 purple_germdel… germd… LXXX…   1.24K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  6 purple_purityr… purit… LXXX…     484 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  7 purple_purityt… purit… LXXX…     462 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  8 purple_qc       qc     LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  9 purple_somclon… somcl… LXXX…     451 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist  somhi… LXXX…     138 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 11 purple_version  versi… purp…      39 2025-08-04 00:55:12 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
lf |> dplyr::slice(1) |> str()
#> tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
#>  $ tool_parser : 'glue' chr "purple_cnvgenetsv"
#>  $ parser      : chr "cnvgenetsv"
#>  $ bname       : chr "LXXXXXXX.purple.cnv.gene.tsv"
#>  $ size        : 'fs_bytes' num 1.44K
#>  $ lastmodified: POSIXct[1:1], format: "2025-08-04 00:55:12"
#>  $ path        : chr "/home/runner/work/_temp/Library/tidywigits/extdata/purple/LXXXXXXX.purple.cnv.gene.tsv"
#>  $ pattern     : chr "\\.purple\\.cnv\\.gene\\.tsv$"
#>  $ prefix      : 'glue' chr "LXXXXXXX"
#>  $ schema      :List of 1
#>   ..$ : tibble [18 × 2] (S3: tbl_df/tbl/data.frame)
#>   .. ..$ field: chr [1:18] "chromosome" "start" "end" "gene" ...
#>   .. ..$ type : chr [1:18] "c" "d" "d" "c" ...
#>  $ group       : 'glue' chr ""

We can parse and tidy files of interest using the tidy function. Note that this function is called on the object and not assigned anywhere:

# this will create a new field tbls containing the tidy data (and optionally
# the 'raw' parsed data)
ppl$tidy(tidy = TRUE, keep_raw = TRUE)
ppl$tbls
#> # A tibble: 11 × 11
#>    tool_parser     parser bname    size lastmodified        path  pattern prefix
#>    <glue>          <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue>
#>  1 purple_cnvgene… cnvge… LXXX…   1.44K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  2 purple_cnvsomt… cnvso… LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  3 purple_driverc… drive… LXXX…     819 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  4 purple_driverc… drive… LXXX…     468 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  5 purple_germdel… germd… LXXX…   1.24K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  6 purple_purityr… purit… LXXX…     484 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  7 purple_purityt… purit… LXXX…     462 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  8 purple_qc       qc     LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  9 purple_somclon… somcl… LXXX…     451 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist  somhi… LXXX…     138 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 11 purple_version  versi… purp…      39 2025-08-04 00:55:12 /hom… "^purp… versi…
#> # ℹ 3 more variables: group <glue>, raw <list>, tidy <list>
ppl$tbls$raw[[8]] |> dplyr::glimpse()
#> Rows: 1
#> Columns: 12
#> $ QCStatus                      <chr> "PASS"
#> $ Method                        <chr> "NORMAL"
#> $ CopyNumberSegments            <chr> "354"
#> $ UnsupportedCopyNumberSegments <chr> "110"
#> $ Purity                        <chr> "0.8700"
#> $ AmberGender                   <chr> "MALE"
#> $ CobaltGender                  <chr> "MALE"
#> $ DeletedGenes                  <chr> "0"
#> $ Contamination                 <chr> "0.0"
#> $ GermlineAberrations           <chr> "NONE"
#> $ AmberMeanDepth                <chr> "73"
#> $ LohPercent                    <chr> "0.0321"
# the tidy tibbles are nested to allow for more than one tidy tibble per file
ppl$tbls$tidy[[8]][["data"]][[1]] |> dplyr::glimpse()
#> Rows: 1
#> Columns: 12
#> $ qc_status               <chr> "PASS"
#> $ method                  <chr> "NORMAL"
#> $ cn_segments             <int> 354
#> $ unsupported_cn_segments <int> 110
#> $ purity                  <dbl> 0.87
#> $ gender_amber            <chr> "MALE"
#> $ gender_cobalt           <chr> "MALE"
#> $ deleted_genes           <int> 0
#> $ contamination           <dbl> 0
#> $ germline_aberrations    <chr> "NONE"
#> $ mean_depth_amber        <dbl> 73
#> $ loh_percent             <dbl> 0.0321

We can also focus on a subset of files to tidy using the filter_files function. The include and exclude arguments can specify which parsers to include or exclude in the analysis:

# create new Purple object
ppl2 <- Purple$new(path = ppl_path)
ppl2$files
#> # A tibble: 11 × 10
#>    tool_parser     parser bname    size lastmodified        path  pattern prefix
#>    <glue>          <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue>
#>  1 purple_cnvgene… cnvge… LXXX…   1.44K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  2 purple_cnvsomt… cnvso… LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  3 purple_driverc… drive… LXXX…     819 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  4 purple_driverc… drive… LXXX…     468 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  5 purple_germdel… germd… LXXX…   1.24K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  6 purple_purityr… purit… LXXX…     484 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  7 purple_purityt… purit… LXXX…     462 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  8 purple_qc       qc     LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  9 purple_somclon… somcl… LXXX…     451 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist  somhi… LXXX…     138 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 11 purple_version  versi… purp…      39 2025-08-04 00:55:12 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
ppl2$filter_files(include = c("purple_qc", "purple_cnvsomtsv"))
ppl2$files
#> # A tibble: 2 × 10
#>   tool_parser      parser bname    size lastmodified        path  pattern prefix
#>   <glue>           <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue>
#> 1 purple_cnvsomtsv cnvso… LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 2 purple_qc        qc     LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> # ℹ 2 more variables: schema <list>, group <glue>

After tidying the data of interest, we can write the tidy tibbles to various formats, like Apache Parquet, PostgreSQL, CSV/TSV and R’s RDS. Below we can see that the id specified is added to the written files in an additional nemo_id column. This can be used e.g. to distinguish results from different runs in a data pipeline. When writing to a database like PostgreSQL, another column nemo_pfix is used to distinguish results from the same run from the same tool.

ppl2$tidy() # first need to tidy
outdir1 <- tempdir()
fmt <- "tsv"
ppl2$write(odir = outdir1, format = fmt, id = "run123")
(wfiles <- fs::dir_info(outdir1) |> dplyr::select(1:5))
#> # A tibble: 17 × 5
#>    path                            type     size permissions modification_time  
#>    <fs::path>                      <fct> <fs::b> <fs::perms> <dttm>             
#>  1 …XXXXXX_purple_cnvsomtsv.tsv.gz file      631 rw-r--r--   2025-08-04 00:56:05
#>  2 …scps/LXXXXXXX_purple_qc.tsv.gz file      186 rw-r--r--   2025-08-04 00:56:05
#>  3 …mp/RtmpyQscps/file22fe1acf1882 file    4.71K rw-r--r--   2025-08-04 00:56:04
#>  4 …mp/RtmpyQscps/file22fe20b4d710 file    4.71K rw-r--r--   2025-08-04 00:56:04
#>  5 …mp/RtmpyQscps/file22fe22b7aec5 file    4.71K rw-r--r--   2025-08-04 00:56:05
#>  6 …mp/RtmpyQscps/file22fe23265ba2 file    4.71K rw-r--r--   2025-08-04 00:56:03
#>  7 …mp/RtmpyQscps/file22fe28195e39 file    4.71K rw-r--r--   2025-08-04 00:56:03
#>  8 …mp/RtmpyQscps/file22fe331e03ce file    4.71K rw-r--r--   2025-08-04 00:56:03
#>  9 …mp/RtmpyQscps/file22fe38a0d4ef file    4.71K rw-r--r--   2025-08-04 00:56:03
#> 10 /tmp/RtmpyQscps/file22fe4391736 file    4.71K rw-r--r--   2025-08-04 00:56:03
#> 11 …mp/RtmpyQscps/file22fe5726ca13 file    4.71K rw-r--r--   2025-08-04 00:56:05
#> 12 …mp/RtmpyQscps/file22fe71611bda file    4.71K rw-r--r--   2025-08-04 00:56:04
#> 13 …mp/RtmpyQscps/file22fe78e5163f file    4.71K rw-r--r--   2025-08-04 00:56:04
#> 14 /tmp/RtmpyQscps/file22fee4953f0 file    4.71K rw-r--r--   2025-08-04 00:56:03
#> 15 /tmp/RtmpyQscps/file22fef50c966 file    4.71K rw-r--r--   2025-08-04 00:56:03
#> 16 /tmp/RtmpyQscps/file22fef8841a7 file    4.71K rw-r--r--   2025-08-04 00:56:03
#> 17 …rmarkdown-str22fe22d326c4.html file    1.13K rw-r--r--   2025-08-04 00:56:02
readr::read_tsv(wfiles$path[2], show_col_types = F)
#> # A tibble: 1 × 13
#>   nemo_id qc_status method cn_segments unsupported_cn_segments purity
#>   <chr>   <chr>     <chr>        <dbl>                   <dbl>  <dbl>
#> 1 run123  PASS      NORMAL         354                     110   0.87
#> # ℹ 7 more variables: gender_amber <chr>, gender_cobalt <chr>,
#> #   deleted_genes <dbl>, contamination <dbl>, germline_aberrations <chr>,
#> #   mean_depth_amber <dbl>, loh_percent <dbl>

The nemofy function is a convenient wrapper for the process of filtering, tidying, and writing.

ppl3 <- Purple$new(path = ppl_path)
outdir2 <- file.path(tempdir(), "ppl3") |> fs::dir_create()
ppl3$files
#> # A tibble: 11 × 10
#>    tool_parser     parser bname    size lastmodified        path  pattern prefix
#>    <glue>          <chr>  <chr> <fs::b> <dttm>              <chr> <chr>   <glue>
#>  1 purple_cnvgene… cnvge… LXXX…   1.44K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  2 purple_cnvsomt… cnvso… LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  3 purple_driverc… drive… LXXX…     819 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  4 purple_driverc… drive… LXXX…     468 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  5 purple_germdel… germd… LXXX…   1.24K 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  6 purple_purityr… purit… LXXX…     484 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  7 purple_purityt… purit… LXXX…     462 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  8 purple_qc       qc     LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#>  9 purple_somclon… somcl… LXXX…     451 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 10 purple_somhist  somhi… LXXX…     138 2025-08-04 00:55:12 /hom… "\\.pu… LXXXX…
#> 11 purple_version  versi… purp…      39 2025-08-04 00:55:12 /hom… "^purp… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
ppl3$nemofy(
  odir = outdir2,
  format = "tsv",
  id = "run_ppl3",
  exclude = c("purple_cnvgenetsv", "purple_cnvsomtsv", "purple_drivercatalog", "purple_germdeltsv")
)
(wfiles2 <- fs::dir_info(outdir2) |> dplyr::select(1:5))
#> # A tibble: 6 × 5
#>   path                               type   size permissions modification_time  
#>   <fs::path>                         <fct> <fs:> <fs::perms> <dttm>             
#> 1 …XXXXXXX_purple_purityrange.tsv.gz file    204 rw-r--r--   2025-08-04 00:56:05
#> 2 …/LXXXXXXX_purple_puritytsv.tsv.gz file    303 rw-r--r--   2025-08-04 00:56:05
#> 3 …ps/ppl3/LXXXXXXX_purple_qc.tsv.gz file    186 rw-r--r--   2025-08-04 00:56:05
#> 4 …XXXXXX_purple_somclonality.tsv.gz file    154 rw-r--r--   2025-08-04 00:56:05
#> 5 …l3/LXXXXXXX_purple_somhist.tsv.gz file    108 rw-r--r--   2025-08-04 00:56:05
#> 6 …pl3/version_purple_version.tsv.gz file     77 rw-r--r--   2025-08-04 00:56:05
readr::read_tsv(wfiles2$path[2], show_col_types = F)
#> # A tibble: 1 × 26
#>   nemo_id  purity norm_factor score diploid_proportion ploidy gender status
#>   <chr>     <dbl>       <dbl> <dbl>              <dbl>  <dbl> <chr>  <chr> 
#> 1 run_ppl3   0.87       0.528 0.841             0.0036   4.05 MALE   NORMAL
#> # ℹ 18 more variables: polyclonal_proportion <dbl>, min_purity <dbl>,
#> #   max_purity <dbl>, min_ploidy <dbl>, max_ploidy <dbl>,
#> #   min_diploid_proportion <dbl>, max_diploid_proportion <dbl>,
#> #   somatic_penalty <dbl>, whole_genome_duplication <lgl>,
#> #   ms_indels_per_mb <dbl>, ms_status <chr>, tml <dbl>, tml_status <chr>,
#> #   tmb_per_mb <dbl>, tmb_status <chr>, sv_tumor_mutational_burden <dbl>,
#> #   run_mode <chr>, targeted <lgl>

`Workflow`

A Workflow consists of a list of one or more Tools. We can construct a certain Workflow with different Tools, which would allow parsing and writing tidy tables from a variety of bioinformatic tools. See ?Workflow.

The Oncoanalyser Nextflow pipeline uses several tools from WiGiTS, and we can construct a Oncoanalyser class as a Workflow child based on a suite of Tools under the tidywigits R package. Similarly to Tool, a Workflow object contains functions such as filter_files, list_files, tidy, write and nemofy:

oa <- system.file("extdata/purple", package = "tidywigits") |>
  Oncoanalyser$new()
outdir3 <- file.path(tempdir(), "oa") |> fs::dir_create()
oa$list_files()
#> # A tibble: 11 × 10
#>    parser     bname    size lastmodified        path  pattern tool_parser prefix
#>    <chr>      <chr> <fs::b> <dttm>              <chr> <chr>   <glue>      <glue>
#>  1 cnvgenetsv LXXX…   1.44K 2025-08-04 00:55:12 /hom… "\\.pu… purple_cnv… LXXXX…
#>  2 cnvsomtsv  LXXX…   1.32K 2025-08-04 00:55:12 /hom… "\\.pu… purple_cnv… LXXXX…
#>  3 drivercat… LXXX…     819 2025-08-04 00:55:12 /hom… "\\.pu… purple_dri… LXXXX…
#>  4 drivercat… LXXX…     468 2025-08-04 00:55:12 /hom… "\\.pu… purple_dri… LXXXX…
#>  5 germdeltsv LXXX…   1.24K 2025-08-04 00:55:12 /hom… "\\.pu… purple_ger… LXXXX…
#>  6 purityran… LXXX…     484 2025-08-04 00:55:12 /hom… "\\.pu… purple_pur… LXXXX…
#>  7 puritytsv  LXXX…     462 2025-08-04 00:55:12 /hom… "\\.pu… purple_pur… LXXXX…
#>  8 qc         LXXX…     228 2025-08-04 00:55:12 /hom… "\\.pu… purple_qc   LXXXX…
#>  9 somclonal… LXXX…     451 2025-08-04 00:55:12 /hom… "\\.pu… purple_som… LXXXX…
#> 10 somhist    LXXX…     138 2025-08-04 00:55:12 /hom… "\\.pu… purple_som… LXXXX…
#> 11 version    purp…      39 2025-08-04 00:55:12 /hom… "^purp… purple_ver… versi…
#> # ℹ 2 more variables: schema <list>, group <glue>
x <- oa$nemofy(
  odir = outdir3,
  format = "tsv",
  id = "oa_run1",
  exclude = c("cobalt_ratiotsv", "amber_baftsv", "isofox_altsj", "isofox_transdata")
)
(wfiles3 <- fs::dir_info(outdir3) |> dplyr::select(1:5))
#> # A tibble: 11 × 5
#>    path                              type   size permissions modification_time  
#>    <fs::path>                        <fct> <fs:> <fs::perms> <dttm>             
#>  1 …line_purple_drivercatalog.tsv.gz file    363 rw-r--r--   2025-08-04 00:56:07
#>  2 …XXXXXXX_purple_cnvgenetsv.tsv.gz file    444 rw-r--r--   2025-08-04 00:56:07
#>  3 …LXXXXXXX_purple_cnvsomtsv.tsv.gz file    633 rw-r--r--   2025-08-04 00:56:07
#>  4 …XXXX_purple_drivercatalog.tsv.gz file    293 rw-r--r--   2025-08-04 00:56:07
#>  5 …XXXXXXX_purple_germdeltsv.tsv.gz file    479 rw-r--r--   2025-08-04 00:56:07
#>  6 …XXXXXX_purple_purityrange.tsv.gz file    202 rw-r--r--   2025-08-04 00:56:07
#>  7 …LXXXXXXX_purple_puritytsv.tsv.gz file    303 rw-r--r--   2025-08-04 00:56:07
#>  8 …cps/oa/LXXXXXXX_purple_qc.tsv.gz file    185 rw-r--r--   2025-08-04 00:56:07
#>  9 …XXXXX_purple_somclonality.tsv.gz file    153 rw-r--r--   2025-08-04 00:56:07
#> 10 …a/LXXXXXXX_purple_somhist.tsv.gz file    107 rw-r--r--   2025-08-04 00:56:07
#> 11 …oa/version_purple_version.tsv.gz file     76 rw-r--r--   2025-08-04 00:56:07
readr::read_tsv(wfiles3$path[5], show_col_types = F)
#> # A tibble: 10 × 17
#>    nemo_id gene        chromosome chromosome_band region_start region_end
#>    <chr>   <chr>       <chr>      <chr>                  <dbl>      <dbl>
#>  1 oa_run1 ZYG11B      chr1       p32.3               52797001   52798000
#>  2 oa_run1 RPL31P12    chr1       p31.1               72300641   72346158
#>  3 oa_run1 AKNAD1      chr1       p13.3              108824397  108829338
#>  4 oa_run1 SLC16A1-AS1 chr1       p13.2              113006001  113012000
#>  5 oa_run1 LRIG2-DT    chr1       p13.2              113006001  113012000
#>  6 oa_run1 GABPB2      chr1       q21.3              151073001  151117000
#>  7 oa_run1 RPS29P29    chr1       q21.3              151073001  151117000
#>  8 oa_run1 LCE1E       chr1       q21.3              152787001  152798000
#>  9 oa_run1 LCE1D       chr1       q21.3              152787001  152798000
#> 10 oa_run1 PTPRVP      chr1       q32.1              202171001  202173000
#> # ℹ 11 more variables: depth_window_count <dbl>, exon_start <dbl>,
#> #   exon_end <dbl>, detection_method <chr>, germline_status <chr>,
#> #   tumor_status <chr>, germline_cn <dbl>, tumor_cn <dbl>, filter <chr>,
#> #   cohort_frequency <dbl>, reported <lgl>