The PCGR software comes with a range of options to configure the analyses performed and to customize the output report.
Key settings
Below, we will outline key settings that are important with respect to the sequencing assay that has been conducted.
Whole-genome, whole-exome or targeted sequencing assay
By default, PCGR expects that the input VCF comes from whole-exome
sequencing. This can be adjusted with the assay
option,
which takes three alternative values:
--assay <WGS|WES|TARGETED>
If the input VCF comes from a targeted sequencing assay/panel
(i.e. --assay TARGETED
), the target size of the assay/panel
should be adjusted accordingly, which is critical for tumor mutational
burden estimation. Assay target size, reflecting the protein-coding
portion of the assayed genomic region in megabases, can be
configured through the following option:
--target_size_mb <value>
Tumor-control vs. tumor-only
By default, PCGR expects that the input VCF contains somatic variants identified from a tumor-control sequencing setup. This implies that the VCF contains information with respect to variant allelic depth/support both for the tumor sample and the corresponding control sample.
If the input VCF comes from a tumor-only assay, turn
on the --tumor_only
option. In this mode, PCGR conducts a
set of successive filtering steps on the raw input set of variants,
aiming to exclude the majority of germline variants from the tumor-only
input set. In addition to default filtering applied against variants
found in the gnomAD/1000 Genomes Project databases (population-specific
allele frequency thresholds can be configured, see below), additional
filtering procedures can be explicitly set, i.e.:
-
--exclude_dbsnp_nonsomatic
(recommended) --exclude_likely_het_germline
--exclude_likely_hom_germline
--exclude_nonexonic
If the user has provided a panel-of-normals VCF file
as input(--pon_vcf
), one may also opt to remove variants
found in that variant collection using a dedicated option:
--exclude_pon
IMPORTANT NOTE 1: Note that a number of the analyses available in PCGR is not particularly well-suited for tumor-only input, e.g. mutational signature analysis and MSI status prediction. IMPORTANT NOTE 2: If you run PCGR on tumor-only WGS assays, we strongly recommend that you pre-filter your VCF against the exome, since we have frequently encountered that the large unfiltered variant set from such assays will cause a crash in PCGR. See also this issue.
Sample properties
A few properties of the tumor sample can be fed to the report, currently only used for display/reporting:
-
--tumor_ploidy <value>
- ploidy estimate -
--tumor_purity <value>
- purity estimate -
--cell_line
- cell line sample
Allelic support
Proper designation of INFO tags in the input VCF that denote variant depth/allelic fraction is critical to make the report as comprehensive as possible. See the input section on how to accomplish this.
Several options are used to inform PCGR about tags in the VCF INFO field that encodes this information:
--tumor_dp_tag <value>
--tumor_af_tag <value>
--control_dp_tag <value>
--control_af_tag <value>
--call_conf_tag <value>
If these tags are set correctly, one may set thresholds (sequencing depth and/or allelic fraction) on the variants to be included in the report, i.e. through:
--tumor_dp_min <value>
--tumor_af_min <value>
--control_dp_min <value>
--control_af_max <value>
Tumor site
The primary tissue/site of the tumor sample that is analyzed can and should be be provided, through the following option:
--tumor_site <value>
This option takes a value between 0 and 30, reflecting the main primary sites/tissues for human cancers. Note that the default value is 0 (implies Any tissue/site), which essentially prohibits the presence of tier 1 variants occurring in the report. If you want to check the nature of cancer subtypes attributed to each of the primary tissue/sites, take a look at the table with phenotype terms and associated primary sites.
Large input sets (VCF)
For input VCF files with > 500,000 variants, note that these will be subject to filtering prior to the final step in PCGR (step 4: reporting). For such scenarios, raw input sets will be filtered according to consequence type, in the sense that intergenic/intronic/upstream/downstream gene variants will be excluded prior to reporting. This is a necessary step to avoid memory issues etc. during processing with the PCGR R package.
If you have a large input VCF, and have sufficient memory capacity on
your compute platform, we also recommend to increase the VEP
buffer size (option --vep_buffer_size
), as this will
speed up the VEP processing significantly.
All options
A tumor sample report is generated by running the pcgr command, which takes the following arguments and options:
usage:
pcgr -h [options]
--input_vcf <INPUT_VCF>
--pcgr_dir <PCGR_DIR>
--output_dir <OUTPUT_DIR>
--genome_assembly <GENOME_ASSEMBLY>
--sample_id <SAMPLE_ID>
Personal Cancer Genome Reporter (PCGR) workflow for clinical interpretation of somatic nucleotide variants and copy number aberration segments
Required arguments:
--input_vcf INPUT_VCF - VCF input file with somatic variants in tumor sample, SNVs/InDels
--pcgr_dir PCGR_DIR - PCGR base directory with accompanying data directory, e.g. ~/pcgr-0.9.4
--output_dir OUTPUT_DIR - Output directory
--genome_assembly {grch37,grch38} - Human genome assembly build: grch37 or grch38
--sample_id SAMPLE_ID - Tumor sample/cancer genome identifier - prefix for output files
vcfanno options:
--vcfanno_n_proc VCFANNO_N_PROC - Number of vcfanno processes (option '-p' in vcfanno), default: 4
VEP options:
--vep_n_forks VEP_N_FORKS - Number of forks (option '--fork' in VEP), default: 4
--vep_buffer_size VEP_BUFFER_SIZE - Variant buffer size (variants read into memory simultaneously, option '--buffer_size' in VEP)
set lower to reduce memory usage, default: 100
--vep_pick_order VEP_PICK_ORDER - Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence
(option '--pick_order' in VEP), default: canonical,appris,biotype,ccds,rank,tsl,length,mane
--vep_no_intergenic - Skip intergenic variants during processing (option '--no_intergenic' in VEP), default: False
--vep_regulatory - Add VEP regulatory annotations (option '--regulatory' )or non-coding interpretation, default: False
--vep_gencode_all - Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option '--gencode_basic' in VEP is used by default in PCGR) default: False
Tumor mutational burden (TMB) and MSI options:
--target_size_mb TARGET_SIZE_MB - For mutational burden analysis - approximate protein-coding target size in Mb of sequencing assay (default: 34 (WES/WGS))
--estimate_tmb - Estimate tumor mutational burden from the total number of somatic mutations and target region size, default: False
--estimate_msi_status - Predict microsatellite instability status from patterns of somatic mutations/indels, default: False
--tmb_algorithm {all_coding,nonsyn} - Method for calculation of TMB, all coding variants (Chalmers et al., Genome Medicine, 2017), or non-synonymous variants only, default: all_coding
Mutational signature options:
--estimate_signatures - Estimate relative contributions of reference mutational signatures in query sample and detect potential kataegis events, default: False
--min_mutations_signatures MIN_MUTATIONS_SIGNATURES - Minimum number of SNVs required for reconstruction of mutational signatures (SBS) by MutationalPatterns (default: 200, minimum n = 100)
--all_reference_signatures - Use all reference mutational signatures (SBS, n = 67) in signature reconstruction rather than only those already attributed to the tumor type (default: False)
--include_artefact_signatures - Include sequencing artefacts in the collection of reference signatures (default: False
--prevalence_reference_signatures {1,2,5,10,15,20} - Minimum tumor-type prevalence (in percent) of reference signatures to be included in refitting procedure (default: 5)
Tumor-only options:
--tumor_only - Input VCF comes from tumor-only sequencing, calls will be filtered for variants of germline origin, (default: False)
--cell_line - Input VCF comes from tumor cell line sequencing (requires --tumor_only), calls will be filtered for variants of germline origin, (default: False)
--pon_vcf PON_VCF - VCF file with germline calls from Panel of Normals (PON) - i.e. blacklisted variants, (default: None)
--maf_onekg_eur MAF_ONEKG_EUR - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - European pop, default: 0.002)
--maf_onekg_amr MAF_ONEKG_AMR - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - Ad Mixed American pop, default: 0.002)
--maf_onekg_afr MAF_ONEKG_AFR - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - African pop, default: 0.002)
--maf_onekg_eas MAF_ONEKG_EAS - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - East Asian pop, default: 0.002)
--maf_onekg_sas MAF_ONEKG_SAS - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - South Asian pop, default: 0.002)
--maf_onekg_global MAF_ONEKG_GLOBAL - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold (1000 Genomes Project - global pop, default: 0.002)
--maf_gnomad_nfe MAF_GNOMAD_NFE - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - European (non-Finnish), default: 0.002)
--maf_gnomad_asj MAF_GNOMAD_ASJ - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - Ashkenazi Jewish, default: 0.002)
--maf_gnomad_fin MAF_GNOMAD_FIN - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - European (Finnish), default: 0.002)
--maf_gnomad_oth MAF_GNOMAD_OTH - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - Other, default: 0.002)
--maf_gnomad_amr MAF_GNOMAD_AMR - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - Latino/Admixed American, default: 0.002)
--maf_gnomad_afr MAF_GNOMAD_AFR - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - African/African-American, default: 0.002)
--maf_gnomad_eas MAF_GNOMAD_EAS - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - East Asian, default: 0.002)
--maf_gnomad_sas MAF_GNOMAD_SAS - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - South Asian, default: 0.002)
--maf_gnomad_global MAF_GNOMAD_GLOBAL - Exclude variants in tumor (SNVs/InDels, tumor-only mode) with MAF above the given percent threshold, (gnomAD - global population, default: 0.002)
--exclude_pon - Exclude variants occurring in PoN (Panel of Normals, if provided as VCF (--pon_vcf), default: False)
--exclude_likely_hom_germline - Exclude likely homozygous germline variants (100 pct allelic fraction for alternate allele in tumor, very unlikely somatic event, default: False)
--exclude_likely_het_germline - Exclude likely heterozygous germline variants (40-60 pct allelic fraction, AND presence in dbSNP + gnomAD, AND not existing as somatic event in COSMIC/TCGA, default: False)
--exclude_dbsnp_nonsomatic - Exclude variants found in dbSNP (only those that are NOT found in ClinVar(somatic origin)/DoCM/TCGA/COSMIC, defult: False)
--exclude_nonexonic - Exclude non-exonic variants, default: False)
Allelic support options:
--tumor_dp_tag TUMOR_DP_TAG - Specify VCF INFO tag for sequencing depth (tumor, must be Type=Integer, default: _NA_
--tumor_af_tag TUMOR_AF_TAG - Specify VCF INFO tag for variant allelic fraction (tumor, must be Type=Float, default: _NA_
--control_dp_tag CONTROL_DP_TAG - Specify VCF INFO tag for sequencing depth (control, must be Type=Integer, default: _NA_
--control_af_tag CONTROL_AF_TAG - Specify VCF INFO tag for variant allelic fraction (control, must be Type=Float, default: _NA_
--call_conf_tag CALL_CONF_TAG - Specify VCF INFO tag for somatic variant call confidence (must be categorical, e.g. Type=String, default: _NA_
--tumor_dp_min TUMOR_DP_MIN - If VCF INFO tag for sequencing depth (tumor) is specified and found, set minimum required depth for inclusion in report (default: 0)
--tumor_af_min TUMOR_AF_MIN - If VCF INFO tag for variant allelic fraction (tumor) is specified and found, set minimum required AF for inclusion in report (default: 0)
--control_dp_min CONTROL_DP_MIN - If VCF INFO tag for sequencing depth (control) is specified and found, set minimum required depth for inclusion in report (default: 0)
--control_af_max CONTROL_AF_MAX - If VCF INFO tag for variant allelic fraction (control) is specified and found, set maximum tolerated AF for inclusion in report (default: 1)
Other options:
--input_cna INPUT_CNA - Somatic copy number alteration segments (tab-separated values)
--logr_gain LOGR_GAIN - Log ratio-threshold (minimum) for segments containing copy number gains/amplifications (default: 0.8)
--logr_homdel LOGR_HOMDEL - Log ratio-threshold (maximum) for segments containing homozygous deletions (default: -0.8)
--cna_overlap_pct CNA_OVERLAP_PCT - Mean percent overlap between copy number segment and gene transcripts for reporting of gains/losses in tumor suppressor genes/oncogenes, (default: 50)
--tumor_site TSITE - Optional integer code to specify primary tumor type/site of query sample, choose any of the following identifiers:
0 = Any, 1 = Adrenal Gland, 2 = Ampulla of Vater, 3 = Biliary Tract, 4 = Bladder/Urinary Tract, 5 = Bone, 6 = Breast, 7 = Cervix, 8 = CNS/Brain, 9 = Colon/Rectum, 10 = Esophagus/Stomach,
11 = Eye, 12 = Head and Neck, 13 = Kidney, 14 = Liver, 15 = Lung, 16 = Lymphoid, 17 = Myeloid, 18 = Ovary/Fallopian Tube, 19 = Pancreas, 20 = Peripheral Nervous System,
21 = Peritoneum, 22 = Pleura, 23 = Prostate, 24 = Skin, 25 = Soft Tissue, 26 = Testis, 27 = Thymus, 28 = Thyroid, 29 = Uterus, 30 = Vulva/Vagina,
(default: 0 - any tumor type)
--tumor_purity TUMOR_PURITY - Estimated tumor purity (between 0 and 1, (default: None)
--tumor_ploidy TUMOR_PLOIDY - Estimated tumor ploidy (default: None)
--cpsr_report CPSR_REPORT - CPSR report file (Gzipped JSON - file ending with 'cpsr.<genome_assembly>.json.gz' - germline report of patient's blood/control sample
--vcf2maf Generate a MAF file for input VCF using https://github.com/mskcc/vcf2maf (default: False)
--show_noncoding List non-coding (i.e. non protein-altering) variants in report, default: False
--assay {WES,WGS,TARGETED} - Type of DNA sequencing assay performed for input data (VCF) default: WES
--include_trials (Beta) Include relevant ongoing or future clinical trials, focusing on studies with molecularly targeted interventions
--preserved_info_tags PRESERVED_INFO_TAGS Comma-separated string of VCF INFO tags from query VCF that should be kept in PCGR output TSV file
--report_theme {default,cerulean,journal,flatly,readable,spacelab,united,cosmo,lumen,paper,sandstone,simplex,yeti} - Visual report theme (rmarkdown)
--report_nonfloating_toc - Do not float the table of contents (TOC) in output report (rmarkdown), default: False
--force_overwrite By default, the script will fail with an error if any output file already exists. You can force the overwrite of existing result files by using this flag, default: False
--version show program's version number and exit
--basic Run functional variant annotation on VCF through VEP/vcfanno, omit other analyses (i.e. Tier assignment/MSI/TMB/Signatures etc. and report generation (STEP 4), default: False
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator, default: False
--docker_uid DOCKER_USER_ID - Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)
--no_docker Run the PCGR workflow in a non-Docker mode (see install_no_docker/ folder for instructions)
--debug Print full Docker commands to log, default: False
Example run
The examples folder contains input VCF files from two tumor samples sequenced within TCGA (GRCh37 only). A report for a colorectal tumor case can be generated by running the following command in your terminal window:
$ (base) conda activate pcgr
$ (pcgr)
pcgr \
--pcgr_dir /Users/you/dir2/data \
--output_dir /Users/you/dir3/pcgr_outputs \
--sample_id tumor_sample.COAD \
--tumor_dp_tag TDP \
--tumor_af_tag TVAF \
--call_conf_tag TAL \
--genome_assembly grch37 \
--input_vcf /Users/you/pcgr/examples/tumor_sample.COAD.vcf.gz \
--tumor_site 9 \
--input_cna /Users/you/pcgr/examples/tumor_sample.COAD.cna.tsv \
--tumor_purity 0.9 \
--tumor_ploidy 2.0 \
--include_trials \
--assay WES \
--estimate_signatures \
--estimate_msi_status \
--estimate_tmb \
--no_docker \
--force_overwrite
This command will run the Conda-based PCGR workflow and produce the following output files in the examples folder:
- tumor_sample.COAD.pcgr_acmg.grch37.html - An interactive HTML report for clinical interpretation (rmarkdown)
- tumor_sample.COAD.pcgr_acmg.grch37.flexdb.html - An interactive HTML report for clinical interpretation (flexdashboard)
- tumor_sample.COAD.pcgr_acmg.grch37.vcf.gz (.tbi) - Bgzipped VCF file with rich set of annotations for precision oncology
- tumor_sample.COAD.pcgr_acmg.grch37.pass.vcf.gz (.tbi) - Bgzipped VCF file with rich set of annotations for precision oncology (PASS variants only)
- tumor_sample.COAD.pcgr_acmg.grch37.pass.tsv.gz - Compressed vcf2tsv-converted file with rich set of annotations for precision oncology
- tumor_sample.COAD.pcgr_acmg.grch37.snvs_indels.tiers.tsv - Tab-separated values file with variants organized according to tiers of functional relevance
- tumor_sample.COAD.pcgr_acmg.grch37.mutational_signatures.tsv - Tab-separated values file with information on contribution of mutational signatures
- tumor_sample.COAD.pcgr_acmg.grch37.json.gz - Compressed JSON dump of HTML report content
- tumor_sample.COAD.pcgr_acmg.grch37.cna_segments.tsv.gz - Compressed tab-separated values file with annotations of gene transcripts that overlap with somatic copy number aberrations
- tumor_sample.COAD.pcgr_config.rds - PCGR configuration object (RDS format), mostly for debugging purposes