Structure • nemo

{nemo} is built on top of R’s R6 encapsulated object-oriented programming implementation, which helps with code organisation. It consists of several base classes like Config, Tool, and Workflow which we describe below. Each R6 class can contain public and private functions and non-functions (fields).

Other R packages like {tidywigits} and {dracarys} can create their own Tool and Workflow children classes that inherit (or override) functions from the {nemo} parent classes. This allows for custom parsers and tidiers for specific bioinformatic tools and workflows.

Here we use the Tool1 and Workflow1 classes as examples to illustrate the structure of the package.

`Config`

A Config object contains functionality for interacting with YAML configuration files (under inst/config) that specify the schemas, types, patterns and field descriptions for the raw input files and tidy output tbls. See ?Config.

raw

Let’s look at some of the information for the raw Tool1 config, for instance:

tool <- params$tool
workflow <- params$workflow
conf <- Config$new(tool, pkg = "nemo")
conf

## #--- Config tool1 ---#
## 
## |var   |value |
## |:-----|:-----|
## |tool  |tool1 |
## |nraw  |4     |
## |ntidy |4     |

You can access the individual fields in the classic R list-like manner, using the $ sign.

Patterns are used to fish out the relevant files from a directory listing.

conf$get_raw_patterns() |>
  knitr::kable(caption = glue("{tool} raw file patterns."))

Tool1 raw file patterns.
name	value
table1	.tool1.table1.tsv$
table2	.tool1.table2.tsv$
table3	.tool1.table3.tsv$

File descriptions are based on the available open source documentation.

conf$get_raw_descriptions() |>
  knitr::kable(caption = glue("{tool} raw file descriptions."))

Tool1 raw file descriptions.
name	value
table1	Table1 for tool1.
table2	Table2 for tool1.
table3	Table3 for tool1.

Versions are used to distinguish changes in schema between individual tool versions. For example, after Tool1 v1.2.3 metrics Y and Z were added to table1, which is reflected in the available schemas. For now we are using latest as a default version based on the most recent schema tests, and any discrepancies we see are labelled accordingly by the version of the tool that generated a file with a different schema.

conf$get_raw_versions() |>
  knitr::kable(caption = glue("{tool} raw file versions."))

Tool1 raw file versions.
name	value
table1	v1.2.3
table1	latest
table2	latest
table3	latest

The raw schemas specify the column name and type (e.g. character (c), integer (i), float/double (d)) for each input file:

(s <- conf$get_raw_schemas_all())

## # A tibble: 4 × 3
##   name   version schema          
##   <chr>  <chr>   <list>          
## 1 table1 v1.2.3  <tibble [5 × 2]>
## 2 table1 latest  <tibble [7 × 2]>
## 3 table2 latest  <tibble [3 × 2]>
## 4 table3 latest  <tibble [2 × 2]>

s |>
  dplyr::filter(name == "table1", version == "v1.2.3") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")

## # A tibble: 5 × 2
##   field      type 
##   <chr>      <chr>
## 1 SampleID   c    
## 2 Chromosome c    
## 3 Start      i    
## 4 End        i    
## 5 metricX    d

tidy

Now let’s look at some of the information in the tidy Tool1 config. The difference between raw and tidy configs is mostly in the column names (they are standardised to lowercase separated by underscores, i.e. snake_case), and some raw files get split into multiple tidy tables (e.g. for normalisation purposes).

Tidy descriptions are the same as the raw descriptions for now.

conf$get_tidy_descriptions() |>
  knitr::kable(caption = glue("{tool} tidy file descriptions."))

Tool1 tidy file descriptions.
name	value
table1	Table1 for tool1.
table2	Table2 for tool1.
table3	Table3 for tool1.

(s <- conf$get_tidy_schemas_all())

## # A tibble: 4 × 4
##   name   version tbl   schema          
##   <chr>  <chr>   <chr> <list>          
## 1 table1 v1.2.3  tbl1  <tibble [5 × 3]>
## 2 table1 latest  tbl1  <tibble [7 × 3]>
## 3 table2 latest  tbl1  <tibble [3 × 3]>
## 4 table3 latest  tbl1  <tibble [5 × 3]>

s |>
  dplyr::filter(.data$name == "table1", version == "v1.2.3") |>
  dplyr::select("schema") |>
  tidyr::unnest("schema")

## # A tibble: 5 × 3
##   field      type  description   
##   <chr>      <chr> <chr>         
## 1 sample_id  c     sample ID     
## 2 chromosome c     chromosome    
## 3 start      i     start position
## 4 end        i     end position  
## 5 metric_x   d     metric X

`Tool`

Tool is the main organisation class for all file parsers and tidiers. It contains functions for parsing and tidying typical CSV/TSV files (with column names), and TXT files where the column names are missing. Currently it utilises the very simple readr::read_delim function from the {readr} package that reads all the data into memory. See ?Tool.

These simple parsers are used in 80-90% of cases, so in the future we can optimise the parsing if needed with faster packages such as {data.table}, {duckdb-r}/{duckplyr} or {r-polars}.

We can have different Tool children classes that inherit (or override) functions and fields from the Tool parent class. For example, we can create a Tool object for Tool1 as follows:

Initialise a Tool1 object:

tool1_path <- system.file("extdata/tool1", package = "nemo")
t1 <- nemo::Tool1$new(path = tool1_path)
# each class comes with a print function
t1

## #--- Tool tool1 ---#
## 
## |var    |value                                                                     |
## |:------|:-------------------------------------------------------------------------|
## |name   |tool1                                                                     |
## |path   |/home/runner/miniconda3/envs/pkgdown_env/lib/R/library/nemo/extdata/tool1 |
## |files  |4                                                                         |
## |tidied |FALSE                                                                     |

Its Config object is also constructed based on the name supplied - this is used internally to find files of interest and infer their schemas:

t1$config

## #--- Config tool1 ---#
## 
## |var   |value |
## |:-----|:-----|
## |tool  |tool1 |
## |nraw  |4     |
## |ntidy |4     |

t1$config$get_raw_patterns()

## # A tibble: 3 × 2
##   name   value                     
##   <chr>  <chr>                     
## 1 table1 "\\.tool1\\.table1\\.tsv$"
## 2 table2 "\\.tool1\\.table2\\.tsv$"
## 3 table3 "\\.tool1\\.table3\\.tsv$"

t1$config$get_raw_schema("table1", v = "v1.2.3")

## # A tibble: 5 × 2
##   field      type 
##   <chr>      <chr>
## 1 SampleID   c    
## 2 Chromosome c    
## 3 Start      i    
## 4 End        i    
## 5 metricX    d

# t1$config$get_raw_schema("table1", v = "latest") # default
t1$config$get_tidy_schema("table1", v = "v1.2.3")

## # A tibble: 5 × 3
##   field      type  description   
##   <chr>      <chr> <chr>         
## 1 sample_id  c     sample ID     
## 2 chromosome c     chromosome    
## 3 start      i     start position
## 4 end        i     end position  
## 5 metric_x   d     metric X

# t1$config$get_tidy_schema("table1", v = "latest") # default

We can list files that can be parsed with list_files():

(lf <- t1$list_files())

## # A tibble: 4 × 10
##   tool_parser  parser bname             size lastmodified        path  pattern prefix schema   group
##   <glue>       <chr>  <chr>            <fs:> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 tool1_table1 table1 sampleA.tool1.t…   113 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 2 tool1_table1 table1 sampleA.tool1.t…   153 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble> _2   
## 3 tool1_table2 table2 sampleA.tool1.t…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 4 tool1_table3 table3 sampleA.tool1.t…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>

lf |> dplyr::slice(1) |> str()

## tibble [1 × 10] (S3: tbl_df/tbl/data.frame)
##  $ tool_parser : 'glue' chr "tool1_table1"
##  $ parser      : chr "table1"
##  $ bname       : chr "sampleA.tool1.table1.tsv"
##  $ size        : 'fs_bytes' num 113
##  $ lastmodified: POSIXct[1:1], format: "2025-09-07 13:03:45"
##  $ path        : chr "/home/runner/miniconda3/envs/pkgdown_env/lib/R/library/nemo/extdata/tool1/v1.2.3/sampleA.tool1.table1.tsv"
##  $ pattern     : chr "\\.tool1\\.table1\\.tsv$"
##  $ prefix      : 'glue' chr "sampleA"
##  $ schema      :List of 1
##   ..$ : tibble [7 × 2] (S3: tbl_df/tbl/data.frame)
##   .. ..$ field: chr [1:7] "SampleID" "Chromosome" "Start" "End" ...
##   .. ..$ type : chr [1:7] "c" "c" "i" "i" ...
##  $ group       : 'glue' chr ""

We can parse and tidy files of interest using the tidy function. Note that this function is called on the object and not assigned anywhere:

# this will create a new field tbls containing the tidy data (and optionally
# the 'raw' parsed data)
t1$tidy(tidy = TRUE, keep_raw = TRUE)
t1$tbls

## # A tibble: 4 × 11
##   tool_parser  parser bname    size lastmodified        path  pattern prefix group raw      tidy    
##   <glue>       <chr>  <chr>   <fs:> <dttm>              <chr> <chr>   <glue> <glu> <list>   <list>  
## 1 tool1_table1 table1 sample…   113 2025-09-07 13:03:45 /hom… "\\.to… sampl…       <tibble> <tibble>
## 2 tool1_table1 table1 sample…   153 2025-09-07 13:03:45 /hom… "\\.to… sampl… _2    <tibble> <tibble>
## 3 tool1_table2 table2 sample…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl…       <tibble> <tibble>
## 4 tool1_table3 table3 sample…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl…       <tibble> <tibble>

t1$tbls$raw[[1]] |> dplyr::glimpse()

## Rows: 3
## Columns: 5
## $ SampleID   <chr> "sampleA", "sampleA", "sampleA"
## $ Chromosome <chr> "chr1", "chr2", "chr3"
## $ Start      <int> 10, 100, 1000
## $ End        <int> 50, 500, 5000
## $ metricX    <dbl> 0.1, 0.2, 0.3

# the tidy tibbles are nested to allow for more than one tidy tibble per file
t1$tbls$tidy[[1]][["data"]][[1]] |> dplyr::glimpse()

## Rows: 3
## Columns: 5
## $ sample_id  <chr> "sampleA", "sampleA", "sampleA"
## $ chromosome <chr> "chr1", "chr2", "chr3"
## $ start      <int> 10, 100, 1000
## $ end        <int> 50, 500, 5000
## $ metric_x   <dbl> 0.1, 0.2, 0.3

We can also focus on a subset of files to tidy using the filter_files() function. The include and exclude arguments can specify which parsers to include or exclude in the analysis:

# create new Tool1 object
t2 <- nemo::Tool1$new(path = tool1_path)
t2$files

## # A tibble: 4 × 10
##   tool_parser  parser bname             size lastmodified        path  pattern prefix schema   group
##   <glue>       <chr>  <chr>            <fs:> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 tool1_table1 table1 sampleA.tool1.t…   113 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 2 tool1_table1 table1 sampleA.tool1.t…   153 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble> _2   
## 3 tool1_table2 table2 sampleA.tool1.t…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 4 tool1_table3 table3 sampleA.tool1.t…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>

t2$filter_files(include = c("tool1_table2", "tool1_table3"))
t2$files

## # A tibble: 2 × 10
##   tool_parser  parser bname             size lastmodified        path  pattern prefix schema   group
##   <glue>       <chr>  <chr>            <fs:> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 tool1_table2 table2 sampleA.tool1.t…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 2 tool1_table3 table3 sampleA.tool1.t…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>

After tidying the data of interest, we can write the tidy tibbles to various formats, like Apache Parquet, PostgreSQL, CSV/TSV and R’s RDS. Below we can see that the id specified is added to the written files in an additional nemo_id column. This can be used e.g. to distinguish results from different runs in a data pipeline. When writing to a database like PostgreSQL, another column nemo_pfix is used to distinguish results from the same run from the same tool.

t2$tidy() # first need to tidy
outdir1 <- tempdir()
fmt <- "csv"
t2$write(odir = outdir1, format = fmt, id = "run123")
wfiles <- fs::dir_info(outdir1) |> dplyr::select(1:5)
wfiles |>
  dplyr::mutate(bname = basename(.data$path)) |>
  dplyr::select("bname", "size", "type")

## # A tibble: 18 × 3
##    bname                                size type 
##    <chr>                         <fs::bytes> <fct>
##  1 filefff15044b64                     4.71K file 
##  2 filefff199777b                      4.71K file 
##  3 filefff1c267643                     4.71K file 
##  4 filefff212c0b9f                     4.71K file 
##  5 filefff244c73ec                     4.71K file 
##  6 filefff2e5423f0                     4.71K file 
##  7 filefff49e685f7                     4.71K file 
##  8 filefff608d94c3                     4.71K file 
##  9 filefff625b69f5                     4.71K file 
## 10 filefff6766b1f5                     4.71K file 
## 11 filefff69c53be3                     4.71K file 
## 12 filefff72df5f5b                     4.71K file 
## 13 filefff916ac26                      4.71K file 
## 14 filefff9763ad7                      4.71K file 
## 15 filefffce755cd                      4.71K file 
## 16 rmarkdown-strfff35ffc4b1.html       1.12K file 
## 17 sampleA_tool1_table2.csv.gz            80 file 
## 18 sampleA_tool1_table3.csv.gz            92 file

# readr::read_csv(wfiles$path[1], show_col_types = F) # see bug #137

The nemofy function is a convenient wrapper for the process of filtering, tidying, and writing.

t3 <- nemo::Tool1$new(path = tool1_path)
outdir2 <- file.path(tempdir(), "t3") |> fs::dir_create()
t3$files

## # A tibble: 4 × 10
##   tool_parser  parser bname             size lastmodified        path  pattern prefix schema   group
##   <glue>       <chr>  <chr>            <fs:> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 tool1_table1 table1 sampleA.tool1.t…   113 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 2 tool1_table1 table1 sampleA.tool1.t…   153 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble> _2   
## 3 tool1_table2 table2 sampleA.tool1.t…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 4 tool1_table3 table3 sampleA.tool1.t…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>

t3$nemofy(
  odir = outdir2,
  format = "tsv",
  id = "run_t3"
)
wfiles2 <- fs::dir_info(outdir2) |>
  dplyr::mutate(bname = basename(.data$path))
wfiles2 |>
  dplyr::select("bname", "size", "type")

## # A tibble: 4 × 3
##   bname                                size type 
##   <chr>                         <fs::bytes> <fct>
## 1 sampleA_2_tool1_table1.tsv.gz         124 file 
## 2 sampleA_tool1_table1.tsv.gz           105 file 
## 3 sampleA_tool1_table2.tsv.gz            80 file 
## 4 sampleA_tool1_table3.tsv.gz            92 file

readr::read_tsv(wfiles2$path[2], show_col_types = F)

## # A tibble: 3 × 6
##   nemo_id sample_id chromosome start   end metric_x
##   <chr>   <chr>     <chr>      <dbl> <dbl>    <dbl>
## 1 run_t3  sampleA   chr1          10    50      0.1
## 2 run_t3  sampleA   chr2         100   500      0.2
## 3 run_t3  sampleA   chr3        1000  5000      0.3

`Workflow`

A Workflow consists of a list of one or more Tools. We can construct a certain Workflow with different Tools, which would allow parsing and writing tidy tables from a variety of bioinformatic tools. See ?Workflow.

For example, {nemo} contains a Workflow1 class as a Workflow child (containing only a single Tool1 for simplicity). Similarly to Tool, a Workflow object contains functions such as filter_files, list_files, tidy, write and nemofy:

w <- system.file("extdata/tool1", package = "nemo") |>
  nemo::Workflow1$new()
outdir3 <- file.path(tempdir(), "oa") |> fs::dir_create()
w$list_files()

## # A tibble: 4 × 10
##   tool_parser  parser bname             size lastmodified        path  pattern prefix schema   group
##   <glue>       <chr>  <chr>            <fs:> <dttm>              <chr> <chr>   <glue> <list>   <glu>
## 1 tool1_table1 table1 sampleA.tool1.t…   113 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 2 tool1_table1 table1 sampleA.tool1.t…   153 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble> _2   
## 3 tool1_table2 table2 sampleA.tool1.t…    70 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>      
## 4 tool1_table3 table3 sampleA.tool1.t…    83 2025-09-07 13:03:45 /hom… "\\.to… sampl… <tibble>

x <- w$nemofy(
  odir = outdir3,
  format = "tsv",
  id = "workflow1_run1"
)
wfiles3 <- fs::dir_info(outdir3) |>
  dplyr::select(1:5) |>
  dplyr::mutate(bname = basename(.data$path))
wfiles3 |>
  dplyr::select("bname", "size", "type")

## # A tibble: 4 × 3
##   bname                                size type 
##   <chr>                         <fs::bytes> <fct>
## 1 sampleA_2_tool1_table1.tsv.gz         133 file 
## 2 sampleA_tool1_table1.tsv.gz           111 file 
## 3 sampleA_tool1_table2.tsv.gz            88 file 
## 4 sampleA_tool1_table3.tsv.gz           100 file

readr::read_tsv(wfiles3$path[1], show_col_types = F)

## # A tibble: 3 × 8
##   nemo_id        sample_id chromosome start   end metric_x metric_y metric_z
##   <chr>          <chr>     <chr>      <dbl> <dbl>    <dbl>    <dbl>    <dbl>
## 1 workflow1_run1 sampleA   chr1          10    50      0.1      0.4      0.7
## 2 workflow1_run1 sampleA   chr2         100   500      0.2      0.5      0.8
## 3 workflow1_run1 sampleA   chr3        1000  5000      0.3      0.6      0.9