{tidywigits} is built on top of R’s R6
encapsulated object-oriented programming implementation, which helps
with code organisation. It consists of several base classes like
Config
, Tool
, and Workflow
which
we describe below. Each R6 class can contain public and private
functions and non-functions (fields).
Config
A Config
object contains functionality for interacting
with YAML configuration files that are part of {tidywigits}. These
configuration files (under inst/config
) specify the
schemas, types, patterns and field descriptions for the raw
input files and tidy output tbls. See
?Config
.
raw
Let’s look at some of the information for the raw PURPLE config, for instance:
## #--- Config purple ---#
##
## |var |value |
## |:-----|:------|
## |tool |purple |
## |nraw |10 |
## |ntidy |10 |
You can access the individual fields in the classic R list-like
manner, using the $
sign.
Patterns are used to fish out the relevant files from a directory listing.
conf$get_raw_patterns() |>
knitr::kable(caption = glue("{toolu} raw file patterns."))
name | value |
---|---|
cnvgenetsv | .purple.cnv.gene.tsv$ |
cnvsomtsv | .purple.cnv.somatic.tsv$ |
drivercatalog | .purple.driver.catalog.(germline|somatic).tsv$ |
germdeltsv | .purple.germline.deletion.tsv$ |
purityrange | .purple.purity.range.tsv$ |
puritytsv | .purple.purity.tsv$ |
qc | .purple.qc$ |
somclonality | .purple.somatic.clonality.tsv$ |
somhist | .purple.somatic.hist.tsv$ |
version | ^purple.version$ |
File descriptions are based on the available Hartwig documentation.
conf$get_raw_descriptions() |>
knitr::kable(caption = glue("{toolu} raw file descriptions."))
name | value |
---|---|
cnvgenetsv | Copy number alterations of each gene in the HMF gene panel. |
cnvsomtsv | Copy number profile of all (contiguous) segments of the tumor sample. |
drivercatalog | Significant amplifications and deletions that occur in the HMF gene panel. |
germdeltsv | Germline deletions. |
purityrange | Best fit per purity sorted by score. |
puritytsv | Purity/ploidy fit summary. |
qc | QC metrics. |
somclonality | Clonality peak model data. |
somhist | Somatic variant histogram data. |
version | Version of the tool. |
Versions are used to distinguish changes in schema between individual
tool versions. For example, after LINX v1.25, several columns were
dropped from the breakends
table, which is reflected in the
available LINX schemas. For now we are using latest
as a
default version based on the most recent schema tests, and any
discrepancies we see are labelled accordingly by the version of the tool
that generated a file with a different schema.
conf$get_raw_versions() |>
knitr::kable(caption = glue("{toolu} raw file versions."))
name | value |
---|---|
cnvgenetsv | latest |
cnvsomtsv | latest |
drivercatalog | latest |
germdeltsv | latest |
purityrange | latest |
puritytsv | latest |
qc | latest |
somclonality | latest |
somhist | latest |
version | latest |
The raw schemas specify the column name and type (e.g. character
(c
), integer (i
), float/double
(d
)) for each input file (just showing a couple below):
(s <- conf$get_raw_schemas_all())
## # A tibble: 10 × 3
## name version schema
## <chr> <chr> <list>
## 1 cnvgenetsv latest <tibble [18 × 2]>
## 2 cnvsomtsv latest <tibble [16 × 2]>
## 3 drivercatalog latest <tibble [17 × 2]>
## 4 germdeltsv latest <tibble [16 × 2]>
## 5 purityrange latest <tibble [6 × 2]>
## 6 puritytsv latest <tibble [25 × 2]>
## 7 qc latest <tibble [2 × 2]>
## 8 somclonality latest <tibble [6 × 2]>
## 9 somhist latest <tibble [3 × 2]>
## 10 version latest <tibble [2 × 2]>
## # A tibble: 25 × 2
## field type
## <chr> <chr>
## 1 purity d
## 2 normFactor d
## 3 score d
## 4 diploidProportion d
## 5 ploidy d
## 6 gender c
## 7 status c
## 8 polyclonalProportion d
## 9 minPurity d
## 10 maxPurity d
## # ℹ 15 more rows
tidy
Now let’s look at some of the information in the tidy PURPLE config. The difference between raw and tidy configs is mostly in the column names (they are standardised to lowercase separated by underscores, i.e. snake_case), and some raw files get split into multiple tidy tables (e.g. for normalisation purposes).
Tidy descriptions are the same as the raw descriptions for now.
conf$get_tidy_descriptions() |>
knitr::kable(caption = glue("{toolu} tidy file descriptions."))
name | value |
---|---|
cnvgenetsv | Copy number alterations of each gene in the HMF gene panel. |
cnvsomtsv | Copy number profile of all (contiguous) segments of the tumor sample. |
drivercatalog | Significant amplifications and deletions that occur in the HMF gene panel. |
germdeltsv | Germline deletions. |
purityrange | Best fit per purity sorted by score. |
puritytsv | Purity/ploidy fit summary. |
qc | QC metrics. |
somclonality | Clonality peak model data. |
somhist | Somatic variant histogram data. |
version | Version of the tool. |
(s <- conf$get_tidy_schemas_all())
## # A tibble: 10 × 4
## name version tbl schema
## <chr> <chr> <chr> <list>
## 1 cnvgenetsv latest tbl1 <tibble [18 × 3]>
## 2 cnvsomtsv latest tbl1 <tibble [16 × 3]>
## 3 drivercatalog latest tbl1 <tibble [17 × 3]>
## 4 germdeltsv latest tbl1 <tibble [16 × 3]>
## 5 purityrange latest tbl1 <tibble [6 × 3]>
## 6 puritytsv latest tbl1 <tibble [25 × 3]>
## 7 qc latest tbl1 <tibble [12 × 3]>
## 8 somclonality latest tbl1 <tibble [6 × 3]>
## 9 somhist latest tbl1 <tibble [3 × 3]>
## 10 version latest tbl1 <tibble [2 × 3]>
## # A tibble: 25 × 3
## field type description
## <chr> <chr> <chr>
## 1 purity d purity of tumor in the sample
## 2 norm_factor d internal factor to convert tumor ratio to cn.
## 3 score d score of fit (lower is better)
## 4 diploid_proportion d proportion of cn regions that have 1 (+- 0.2) minor and major allele
## 5 ploidy d average ploidy of the tumor sample after adjusting for purity
## 6 gender c one of male or female
## 7 status c either pass or one or more warning or fail status
## 8 polyclonal_proportion d proportion of copy number regions that are more than 0.25 from a who…
## 9 min_purity d minimum purity with score within 10% of best
## 10 max_purity d maximum purity with score within 10% of best
## # ℹ 15 more rows
Tool
Tool
is the main organisation class for all file parsers
and tidiers. It contains functions for parsing and tidying typical
CSV/TSV files (with column names), and TXT files where the column names
are missing. Currently it utilises the very simple
readr::read_delim
function from the {readr}
package that reads all the data into memory. See ?Tool
.
These simple parsers are used in 80-90% of cases, so in the future we can optimise the parsing if needed with faster packages such as {data.table}, {duckdb-r}/{duckplyr} or {r-polars}.
We can have different Tool
children classes that inherit
(or override) functions and fields from the Tool
parent
class. For example, we can create a Tool
object for PURPLE
as follows:
- Initialise a
Purple
object:
ppl_path <- system.file("extdata/oa/purple", package = "tidywigits")
ppl <- Purple$new(path = ppl_path)
# each class comes with a print function
ppl
## #--- Tool purple ---#
##
## |var |value |
## |:------|:------------------------------------------------------------|
## |name |purple |
## |path |/home/runner/work/_temp/Library/tidywigits/extdata/oa/purple |
## |files |11 |
## |tidied |FALSE |
- Its
Config
object is also constructed based on thename
supplied - this is used internally to find files of interest and infer their schemas:
ppl$config
## #--- Config purple ---#
##
## |var |value |
## |:-----|:------|
## |tool |purple |
## |nraw |10 |
## |ntidy |10 |
ppl$config$get_raw_patterns()
## # A tibble: 10 × 2
## name value
## <chr> <chr>
## 1 cnvgenetsv "\\.purple\\.cnv\\.gene\\.tsv$"
## 2 cnvsomtsv "\\.purple\\.cnv\\.somatic\\.tsv$"
## 3 drivercatalog "\\.purple\\.driver\\.catalog\\.(germline|somatic)\\.tsv$"
## 4 germdeltsv "\\.purple\\.germline\\.deletion\\.tsv$"
## 5 purityrange "\\.purple\\.purity\\.range\\.tsv$"
## 6 puritytsv "\\.purple\\.purity\\.tsv$"
## 7 qc "\\.purple\\.qc$"
## 8 somclonality "\\.purple\\.somatic\\.clonality\\.tsv$"
## 9 somhist "\\.purple\\.somatic\\.hist\\.tsv$"
## 10 version "^purple\\.version$"
ppl$config$get_raw_schema("puritytsv")
## # A tibble: 25 × 2
## field type
## <chr> <chr>
## 1 purity d
## 2 normFactor d
## 3 score d
## 4 diploidProportion d
## 5 ploidy d
## 6 gender c
## 7 status c
## 8 polyclonalProportion d
## 9 minPurity d
## 10 maxPurity d
## # ℹ 15 more rows
ppl$config$get_tidy_schema("puritytsv")
## # A tibble: 25 × 3
## field type description
## <chr> <chr> <chr>
## 1 purity d purity of tumor in the sample
## 2 norm_factor d internal factor to convert tumor ratio to cn.
## 3 score d score of fit (lower is better)
## 4 diploid_proportion d proportion of cn regions that have 1 (+- 0.2) minor and major allele
## 5 ploidy d average ploidy of the tumor sample after adjusting for purity
## 6 gender c one of male or female
## 7 status c either pass or one or more warning or fail status
## 8 polyclonal_proportion d proportion of copy number regions that are more than 0.25 from a who…
## 9 min_purity d minimum purity with score within 10% of best
## 10 max_purity d maximum purity with score within 10% of best
## # ℹ 15 more rows
We can list files that can be parsed with
list_files()
:
(lf <- ppl$list_files())
## # A tibble: 11 × 10
## tool_parser parser bname size lastmodified path pattern prefix schema group
## <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <list> <glu>
## 1 purple_version versi… purp… 39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>
## 2 purple_cnvgenetsv cnvge… samp… 1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 3 purple_cnvsomtsv cnvso… samp… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 4 purple_drivercatalog drive… samp… 819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 5 purple_drivercatalog drive… samp… 468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 6 purple_germdeltsv germd… samp… 1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 7 purple_purityrange purit… samp… 484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 8 purple_puritytsv purit… samp… 462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 9 purple_qc qc samp… 228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 10 purple_somclonality somcl… samp… 451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 11 purple_somhist somhi… samp… 138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
## $ tool_parser : 'glue' chr "purple_version"
## $ parser : chr "version"
## $ bname : chr "purple.version"
## $ size : 'fs_bytes' num 39
## $ lastmodified: POSIXct[1:1], format: "2025-08-19 05:27:46"
## $ path : chr "/home/runner/work/_temp/Library/tidywigits/extdata/oa/purple/purple.version"
## $ pattern : chr "^purple\\.version$"
## $ prefix : 'glue' chr "version"
## $ schema :List of 1
## ..$ : tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
## .. ..$ field: chr [1:2] "variable" "value"
## .. ..$ type : chr [1:2] "c" "c"
## $ group : 'glue' chr ""
We can parse and tidy files of interest using the tidy
function. Note that this function is called on the object and not
assigned anywhere:
# this will create a new field tbls containing the tidy data (and optionally
# the 'raw' parsed data)
ppl$tidy(tidy = TRUE, keep_raw = TRUE)
ppl$tbls
## # A tibble: 11 × 11
## tool_parser parser bname size lastmodified path pattern prefix group raw tidy
## <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <glu> <list> <list>
## 1 purple_ver… versi… purp… 39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble> <tibble>
## 2 purple_cnv… cnvge… samp… 1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 3 purple_cnv… cnvso… samp… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 4 purple_dri… drive… samp… 819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 5 purple_dri… drive… samp… 468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 6 purple_ger… germd… samp… 1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 7 purple_pur… purit… samp… 484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 8 purple_pur… purit… samp… 462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 9 purple_qc qc samp… 228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 10 purple_som… somcl… samp… 451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
## 11 purple_som… somhi… samp… 138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble> <tibble>
ppl$tbls$raw[[8]] |> dplyr::glimpse()
## Rows: 1
## Columns: 25
## $ purity <dbl> 0.87
## $ normFactor <dbl> 0.5283
## $ score <dbl> 0.841
## $ diploidProportion <dbl> 0.0036
## $ ploidy <dbl> 4.05
## $ gender <chr> "MALE"
## $ status <chr> "NORMAL"
## $ polyclonalProportion <dbl> 0.1501
## $ minPurity <dbl> 0.79
## $ maxPurity <dbl> 0.96
## $ minPloidy <dbl> 3.95
## $ maxPloidy <dbl> 4.2
## $ minDiploidProportion <dbl> 9e-04
## $ maxDiploidProportion <dbl> 0.004
## $ somaticPenalty <dbl> 0
## $ wholeGenomeDuplication <chr> "true"
## $ msIndelsPerMb <dbl> 0.0238
## $ msStatus <chr> "MSS"
## $ tml <dbl> 29
## $ tmlStatus <chr> "LOW"
## $ tmbPerMb <dbl> 1.1161
## $ tmbStatus <chr> "LOW"
## $ svTumorMutationalBurden <dbl> 64
## $ runMode <chr> "TUMOR_GERMLINE"
## $ targeted <chr> "false"
# the tidy tibbles are nested to allow for more than one tidy tibble per file
ppl$tbls$tidy[[8]][["data"]][[1]] |> dplyr::glimpse()
## Rows: 1
## Columns: 25
## $ purity <dbl> 0.87
## $ norm_factor <dbl> 0.5283
## $ score <dbl> 0.841
## $ diploid_proportion <dbl> 0.0036
## $ ploidy <dbl> 4.05
## $ gender <chr> "MALE"
## $ status <chr> "NORMAL"
## $ polyclonal_proportion <dbl> 0.1501
## $ min_purity <dbl> 0.79
## $ max_purity <dbl> 0.96
## $ min_ploidy <dbl> 3.95
## $ max_ploidy <dbl> 4.2
## $ min_diploid_proportion <dbl> 9e-04
## $ max_diploid_proportion <dbl> 0.004
## $ somatic_penalty <dbl> 0
## $ whole_genome_duplication <chr> "true"
## $ ms_indels_per_mb <dbl> 0.0238
## $ ms_status <chr> "MSS"
## $ tml <dbl> 29
## $ tml_status <chr> "LOW"
## $ tmb_per_mb <dbl> 1.1161
## $ tmb_status <chr> "LOW"
## $ sv_tumor_mutational_burden <dbl> 64
## $ run_mode <chr> "TUMOR_GERMLINE"
## $ targeted <chr> "false"
We can also focus on a subset of files to tidy using the
filter_files()
function. The include
and
exclude
arguments can specify which parsers to include or
exclude in the analysis:
# create new Purple object
ppl2 <- Purple$new(path = ppl_path)
ppl2$files
## # A tibble: 11 × 10
## tool_parser parser bname size lastmodified path pattern prefix schema group
## <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <list> <glu>
## 1 purple_version versi… purp… 39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>
## 2 purple_cnvgenetsv cnvge… samp… 1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 3 purple_cnvsomtsv cnvso… samp… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 4 purple_drivercatalog drive… samp… 819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 5 purple_drivercatalog drive… samp… 468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 6 purple_germdeltsv germd… samp… 1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 7 purple_purityrange purit… samp… 484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 8 purple_puritytsv purit… samp… 462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 9 purple_qc qc samp… 228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 10 purple_somclonality somcl… samp… 451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 11 purple_somhist somhi… samp… 138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
ppl2$filter_files(include = c("purple_qc", "purple_cnvsomtsv"))
ppl2$files
## # A tibble: 2 × 10
## tool_parser parser bname size lastmodified path pattern prefix schema group
## <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <list> <glu>
## 1 purple_cnvsomtsv cnvsomtsv sample… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 2 purple_qc qc sample… 228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
After tidying the data of interest, we can write the tidy tibbles to
various formats, like Apache Parquet, PostgreSQL, CSV/TSV and R’s RDS.
Below we can see that the id
specified is added to the
written files in an additional nemo_id
column. This can be
used e.g. to distinguish results from different runs in a data pipeline.
When writing to a database like PostgreSQL, another column
nemo_pfix
is used to distinguish results from the same run
from the same tool.
ppl2$tidy() # first need to tidy
outdir1 <- tempdir()
fmt <- "csv"
ppl2$write(odir = outdir1, format = fmt, id = "run123")
wfiles <- fs::dir_info(outdir1) |> dplyr::select(1:5)
wfiles |>
dplyr::mutate(bname = basename(.data$path)) |>
dplyr::select("bname", "size", "type")
## # A tibble: 17 × 3
## bname size type
## <chr> <fs::bytes> <fct>
## 1 file2858145e191c 4.71K file
## 2 file285817c13815 4.71K file
## 3 file28582a9bdf63 4.71K file
## 4 file28583357cd4e 4.71K file
## 5 file28583cb08685 4.71K file
## 6 file2858440e0950 4.71K file
## 7 file28584955028d 4.71K file
## 8 file2858499242e7 4.71K file
## 9 file28587390064f 4.71K file
## 10 file285874d245ca 4.71K file
## 11 file285876d1c889 4.71K file
## 12 file28587cb3bc21 4.71K file
## 13 file28587d382e33 4.71K file
## 14 file28587fcc52a6 4.71K file
## 15 rmarkdown-str285844bb6b9.html 1.13K file
## 16 sample1_purple_cnvsomtsv.csv.gz 631 file
## 17 sample1_purple_qc.csv.gz 186 file
# readr::read_csv(wfiles$path[1], show_col_types = F) # see bug #137
The nemofy
function is a convenient wrapper for the
process of filtering, tidying, and writing.
ppl3 <- Purple$new(path = ppl_path)
outdir2 <- file.path(tempdir(), "ppl3") |> fs::dir_create()
ppl3$files
## # A tibble: 11 × 10
## tool_parser parser bname size lastmodified path pattern prefix schema group
## <glue> <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <list> <glu>
## 1 purple_version versi… purp… 39 2025-08-19 05:27:46 /hom… "^purp… versi… <tibble>
## 2 purple_cnvgenetsv cnvge… samp… 1.44K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 3 purple_cnvsomtsv cnvso… samp… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 4 purple_drivercatalog drive… samp… 819 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 5 purple_drivercatalog drive… samp… 468 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 6 purple_germdeltsv germd… samp… 1.24K 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 7 purple_purityrange purit… samp… 484 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 8 purple_puritytsv purit… samp… 462 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 9 purple_qc qc samp… 228 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 10 purple_somclonality somcl… samp… 451 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
## 11 purple_somhist somhi… samp… 138 2025-08-19 05:27:46 /hom… "\\.pu… sampl… <tibble>
ppl3$nemofy(
odir = outdir2,
format = "tsv",
id = "run_ppl3",
exclude = c("purple_cnvgenetsv", "purple_cnvsomtsv", "purple_drivercatalog", "purple_germdeltsv")
)
wfiles2 <- fs::dir_info(outdir2) |>
dplyr::mutate(bname = basename(.data$path))
wfiles2 |>
dplyr::select("bname", "size", "type")
## # A tibble: 6 × 3
## bname size type
## <chr> <fs::bytes> <fct>
## 1 sample1_purple_purityrange.tsv.gz 204 file
## 2 sample1_purple_puritytsv.tsv.gz 303 file
## 3 sample1_purple_qc.tsv.gz 186 file
## 4 sample1_purple_somclonality.tsv.gz 154 file
## 5 sample1_purple_somhist.tsv.gz 108 file
## 6 version_purple_version.tsv.gz 77 file
readr::read_tsv(wfiles2$path[2], show_col_types = F)
## # A tibble: 1 × 26
## nemo_id purity norm_factor score diploid_proportion ploidy gender status polyclonal_proportion
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 run_ppl3 0.87 0.528 0.841 0.0036 4.05 MALE NORMAL 0.150
## # ℹ 17 more variables: min_purity <dbl>, max_purity <dbl>, min_ploidy <dbl>, max_ploidy <dbl>,
## # min_diploid_proportion <dbl>, max_diploid_proportion <dbl>, somatic_penalty <dbl>,
## # whole_genome_duplication <lgl>, ms_indels_per_mb <dbl>, ms_status <chr>, tml <dbl>,
## # tml_status <chr>, tmb_per_mb <dbl>, tmb_status <chr>, sv_tumor_mutational_burden <dbl>,
## # run_mode <chr>, targeted <lgl>
Workflow
A Workflow
consists of a list of one or more
Tool
s. We can construct a certain Workflow
with different Tool
s, which would allow parsing and writing
tidy tables from a variety of bioinformatic tools. See
?Workflow
.
The Oncoanalyser Nextflow pipeline uses several
tools from [WiGiTS], and we can construct a Oncoanalyser
class as a Workflow
child based on a suite of
Tool
s under the tidywigits R package. Similarly to
Tool
, a Workflow
object contains functions
such as filter_files
, list_files
,
tidy
, write
and nemofy
:
oa <- system.file("extdata/oa/purple", package = "tidywigits") |>
Oncoanalyser$new()
outdir3 <- file.path(tempdir(), "oa") |> fs::dir_create()
oa$list_files()
## # A tibble: 11 × 10
## parser bname size lastmodified path pattern tool_parser prefix schema group
## <chr> <chr> <fs::b> <dttm> <chr> <chr> <glue> <glue> <list> <glu>
## 1 version purple… 39 2025-08-19 05:27:46 /hom… "^purp… purple_ver… versi… <tibble>
## 2 cnvgenetsv sample… 1.44K 2025-08-19 05:27:46 /hom… "\\.pu… purple_cnv… sampl… <tibble>
## 3 cnvsomtsv sample… 1.32K 2025-08-19 05:27:46 /hom… "\\.pu… purple_cnv… sampl… <tibble>
## 4 drivercatalog sample… 819 2025-08-19 05:27:46 /hom… "\\.pu… purple_dri… sampl… <tibble>
## 5 drivercatalog sample… 468 2025-08-19 05:27:46 /hom… "\\.pu… purple_dri… sampl… <tibble>
## 6 germdeltsv sample… 1.24K 2025-08-19 05:27:46 /hom… "\\.pu… purple_ger… sampl… <tibble>
## 7 purityrange sample… 484 2025-08-19 05:27:46 /hom… "\\.pu… purple_pur… sampl… <tibble>
## 8 puritytsv sample… 462 2025-08-19 05:27:46 /hom… "\\.pu… purple_pur… sampl… <tibble>
## 9 qc sample… 228 2025-08-19 05:27:46 /hom… "\\.pu… purple_qc sampl… <tibble>
## 10 somclonality sample… 451 2025-08-19 05:27:46 /hom… "\\.pu… purple_som… sampl… <tibble>
## 11 somhist sample… 138 2025-08-19 05:27:46 /hom… "\\.pu… purple_som… sampl… <tibble>
x <- oa$nemofy(
odir = outdir3,
format = "tsv",
id = "oa_run1",
exclude = c("cobalt_ratiotsv", "amber_baftsv", "isofox_altsj", "isofox_transdata")
)
wfiles3 <- fs::dir_info(outdir3) |>
dplyr::select(1:5) |>
dplyr::mutate(bname = basename(.data$path))
wfiles3 |>
dplyr::select("bname", "size", "type")
## # A tibble: 11 × 3
## bname size type
## <chr> <fs::bytes> <fct>
## 1 sample1_germline_purple_drivercatalog.tsv.gz 363 file
## 2 sample1_purple_cnvgenetsv.tsv.gz 444 file
## 3 sample1_purple_cnvsomtsv.tsv.gz 633 file
## 4 sample1_purple_drivercatalog.tsv.gz 293 file
## 5 sample1_purple_germdeltsv.tsv.gz 479 file
## 6 sample1_purple_purityrange.tsv.gz 202 file
## 7 sample1_purple_puritytsv.tsv.gz 303 file
## 8 sample1_purple_qc.tsv.gz 185 file
## 9 sample1_purple_somclonality.tsv.gz 153 file
## 10 sample1_purple_somhist.tsv.gz 107 file
## 11 version_purple_version.tsv.gz 76 file
readr::read_tsv(wfiles3$path[5], show_col_types = F)
## # A tibble: 10 × 17
## nemo_id gene chromosome chromosome_band region_start region_end depth_window_count exon_start
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 oa_run1 ZYG11B chr1 p32.3 52797001 52798000 1 8
## 2 oa_run1 RPL31P12 chr1 p31.1 72300641 72346158 45 1
## 3 oa_run1 AKNAD1 chr1 p13.3 108824397 108829338 5 11
## 4 oa_run1 SLC16A1… chr1 p13.2 113006001 113012000 6 4
## 5 oa_run1 LRIG2-DT chr1 p13.2 113006001 113012000 6 3
## 6 oa_run1 GABPB2 chr1 q21.3 151073001 151117000 44 2
## 7 oa_run1 RPS29P29 chr1 q21.3 151073001 151117000 44 1
## 8 oa_run1 LCE1E chr1 q21.3 152787001 152798000 10 2
## 9 oa_run1 LCE1D chr1 q21.3 152787001 152798000 10 1
## 10 oa_run1 PTPRVP chr1 q32.1 202171001 202173000 1 6
## # ℹ 9 more variables: exon_end <dbl>, detection_method <chr>, germline_status <chr>,
## # tumor_status <chr>, germline_cn <dbl>, tumor_cn <dbl>, filter <chr>, cohort_frequency <dbl>,
## # reported <lgl>