- π Docs: https://umccr.github.io/tidywigits:
Overview
{tidywigits} is an R package that parses and tidies outputs from the WiGiTS suite of genome and transcriptome analysis tools for cancer research and diagnostics, created by the Hartwig Medical Foundation.
In short, it traverses through a directory containing results from one or more runs of WiGiTS tools, parses any files it recognises, tidies them up (which includes data reshaping, normalisation, column name cleanup etc.), and writes them to the output format of choice e.g.Β Apache Parquet, PostgreSQL, TSV, RDS.
π¨ Quick Start
The starting point of {tidywigits} is a directory with WiGiTS results. Letβs look at some sample data (tracked via DVC) under https://github.com/umccr/tidywigits/tree/main/inst/extdata/oa:
Click here
system.file("extdata/oa", package = "tidywigits") |>
fs::dir_tree(invert = TRUE, glob = "*.dvc")
/Users/pdiakumis/Library/R/arm64/4.5/library/tidywigits/extdata/oa
βββ alignments
β βββ sample1.duplicate_freq.tsv
βββ amber
β βββ sample1.amber.baf.pcf
β βββ sample1.amber.contamination.tsv
β βββ sample1.amber.homozygousregion.tsv
β βββ sample1.amber.qc
βββ bamtools
β βββ sample1.wgsmetrics
βββ chord
β βββ sample1.chord.mutation_contexts.tsv
β βββ sample1.chord.prediction.tsv
βββ cobalt
β βββ cobalt.version
β βββ sample1.cobalt.gc.median.tsv
β βββ sample1.cobalt.ratio.median.tsv
β βββ sample1.cobalt.ratio.pcf
βββ cuppa
β βββ sample1.cuppa.pred_summ.tsv
β βββ sample1.cuppa.vis_data.tsv
β βββ sample1.cuppa_data.tsv.gz
βββ lilac
β βββ sample1.lilac.candidates.coverage.tsv
β βββ sample1.lilac.qc.tsv
β βββ sample1.lilac.tsv
βββ linx
β βββ germline_annotations
β β βββ linx.version
β β βββ sample1.linx.germline.breakend.tsv
β β βββ sample1.linx.germline.clusters.tsv
β β βββ sample1.linx.germline.disruption.tsv
β β βββ sample1.linx.germline.driver.catalog.tsv
β β βββ sample1.linx.germline.links.tsv
β β βββ sample1.linx.germline.svs.tsv
β βββ somatic_annotations
β βββ linx.version
β βββ sample1.linx.breakend.tsv
β βββ sample1.linx.clusters.tsv
β βββ sample1.linx.driver.catalog.tsv
β βββ sample1.linx.drivers.tsv
β βββ sample1.linx.fusion.tsv
β βββ sample1.linx.links.tsv
β βββ sample1.linx.svs.tsv
β βββ sample1.linx.vis_copy_number.tsv
β βββ sample1.linx.vis_fusion.tsv
β βββ sample1.linx.vis_gene_exon.tsv
β βββ sample1.linx.vis_protein_domain.tsv
β βββ sample1.linx.vis_segments.tsv
β βββ sample1.linx.vis_sv_data.tsv
βββ purple
β βββ purple.version
β βββ sample1.purple.cnv.gene.tsv
β βββ sample1.purple.cnv.somatic.tsv
β βββ sample1.purple.driver.catalog.germline.tsv
β βββ sample1.purple.driver.catalog.somatic.tsv
β βββ sample1.purple.germline.deletion.tsv
β βββ sample1.purple.purity.range.tsv
β βββ sample1.purple.purity.tsv
β βββ sample1.purple.qc
β βββ sample1.purple.somatic.clonality.tsv
β βββ sample1.purple.somatic.hist.tsv
βββ sage
β βββ germline
β β βββ sample1.sage.bqr.tsv
β β βββ sample2.sage.bqr.tsv
β β βββ sample2.sage.exon.medians.tsv
β β βββ sample2.sage.gene.coverage.tsv
β βββ somatic
β βββ sample1.sage.bqr.tsv
β βββ sample1.sage.exon.medians.tsv
β βββ sample1.sage.gene.coverage.tsv
β βββ sample2.sage.bqr.tsv
βββ sigs
β βββ sample1.sig.allocation.tsv
β βββ sample1.sig.snv_counts.csv
βββ virusbreakend
β βββ sample1.virusbreakend.vcf.summary.tsv
βββ virusinterpreter
βββ sample1.virus.annotated.tsv
We can parse, tidy up, and write the WiGiTS results into e.g.Β Parquet format or a PostgreSQL database as follows:
- Parquet:
in_dir <- system.file("extdata/oa", package = "tidywigits")
out_dir <- tempdir() |> fs::dir_create("parquet_example")
oa <- Oncoanalyser$new(in_dir)
res <- oa$nemofy(odir = out_dir, format = "parquet", id = "parquet_example")
fs::dir_info(out_dir) |>
dplyr::mutate(bname = basename(.data$path)) |>
dplyr::select("bname", "size", "type")
# A tibble: 64 Γ 3
bname size type
<chr> <fs::bytes> <fct>
1 sample1_2_sage_bqrtsv.parquet 3.1K file
2 sample1_alignments_dupfreq.parquet 1.95K file
3 sample1_amber_bafpcf.parquet 3.27K file
4 sample1_amber_contaminationtsv.parquet 4.13K file
5 sample1_amber_homozygousregion.parquet 3.18K file
6 sample1_amber_qc.parquet 2.35K file
7 sample1_bamtools_wgsmetrics_histo.parquet 4.19K file
8 sample1_bamtools_wgsmetrics_metrics.parquet 10.12K file
9 sample1_chord_prediction.parquet 3.43K file
10 sample1_chord_signatures.parquet 2.17K file
# βΉ 54 more rows
- PostgreSQL:
in_dir <- system.file("extdata/oa", package = "tidywigits")
out_dir <- tempdir() |> fs::dir_create("parquet_example")
oa <- Oncoanalyser$new(in_dir)
dbconn <- DBI::dbConnect(
drv = RPostgres::Postgres(),
dbname = "nemo",
user = "orcabus"
)
res <- oa$nemofy(
format = "db",
id = "db_example",
dbconn = dbconn
)
IMPORTANT: support for VCFs is under active development.
π Installation
Using {remotes} directly from GitHub:
install.packages("remotes")
remotes::install_github("umccr/tidywigits") # latest main commit
remotes::install_github("umccr/tidywigits@v0.0.3") # released version
Alternatively:
- conda package: https://anaconda.org/umccr/r-tidywigits
- Docker image: https://github.com/umccr/tidywigits/pkgs/container/tidywigits
For more details see: https://umccr.github.io/tidywigits/articles/installation
π CLI
A tidywigits.R
command line interface is available for convenience.
- If youβre using the conda package, the
tidywigits.R
command will already be available inside the activated conda environment. - If youβre not using the conda package, you need to export the
tidywigits/inst/cli/
directory to yourPATH
in order to usetidywigits.R
.
tw_cli=$(Rscript -e 'x = system.file("cli", package = "tidywigits"); cat(x, "\n")' | xargs)
export PATH="${tw_cli}:${PATH}"
$ tidywigits.R --version
tidywigits.R 0.0.3
#-----------------------------------#
$ tidywigits.R --help
usage: tidywigits.R [-h] [-v] {tidy,list} ...
π WiGiTS Output Tidying π’
positional arguments:
{tidy,list} sub-command help
tidy Tidy WiGiTS Workflow Outputs
list List Parsable WiGiTS Workflow Outputs
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
'
#-----------------------------------#
#------- Tidy ----------------------#
$ tidywigits.R tidy --help
usage: tidywigits.R tidy [-h] -d IN_DIR [-o OUT_DIR] [-f FORMAT] -i ID
[--dbname DBNAME] [--dbuser DBUSER]
[--include INCLUDE] [--exclude EXCLUDE] [-q]
options:
-h, --help show this help message and exit
-d IN_DIR, --in_dir IN_DIR
π Input directory.
-o OUT_DIR, --out_dir OUT_DIR
π Output directory.
-f FORMAT, --format FORMAT
π¨ Format of output (def: parquet). Choices: parquet,
db, tsv, csv, rds
-i ID, --id ID π© ID to use for this run.
--dbname DBNAME πΆ Database name.
--dbuser DBUSER π’ Database user.
--include INCLUDE β
Include only these files (comma,sep).
--exclude EXCLUDE β Exclude these files (comma,sep).
-q, --quiet π΄ Shush all the logs.
#-----------------------------------#
#------- List ----------------------#
$ tidywigits.R list --help
usage: tidywigits.R list [-h] -d IN_DIR [-q]
options:
-h, --help show this help message and exit
-d IN_DIR, --in_dir IN_DIR
π Input directory.
-q, --quiet π΄ Shush all the logs.