NanotatoR: a tool for enhanced annotation of genomic structural variants - BMC Genomics
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Bhattacharya et al. BMC Genomics (2021) 22:10 https://doi.org/10.1186/s12864-020-07182-w SOFTWARE Open Access nanotatoR: a tool for enhanced annotation of genomic structural variants Surajit Bhattacharya1, Hayk Barseghyan1,2,3, Emmanuèle C. Délot1,2 and Eric Vilain1,2* Abstract Background: Whole genome sequencing is effective at identification of small variants, but because it is based on short reads, assessment of structural variants (SVs) is limited. The advent of Optical Genome Mapping (OGM), which utilizes long fluorescently labeled DNA molecules for de novo genome assembly and SV calling, has allowed for increased sensitivity and specificity in SV detection. However, compared to small variant annotation tools, OGM- based SV annotation software has seen little development, and currently available SV annotation tools do not provide sufficient information for determination of variant pathogenicity. Results: We developed an R-based package, nanotatoR, which provides comprehensive annotation as a tool for SV classification. nanotatoR uses both external (DGV; DECIPHER; Bionano Genomics BNDB) and internal (user-defined) databases to estimate SV frequency. Human genome reference GRCh37/38-based BED files are used to annotate SVs with overlapping, upstream, and downstream genes. Overlap percentages and distances for nearest genes are calculated and can be used for filtration. A primary gene list is extracted from public databases based on the patient’s phenotype and used to filter genes overlapping SVs, providing the analyst with an easy way to prioritize variants. If available, expression of overlapping or nearby genes of interest is extracted (e.g. from an RNA-Seq dataset, allowing the user to assess the effects of SVs on the transcriptome). Most quality-control filtration parameters are customizable by the user. The output is given in an Excel file format, subdivided into multiple sheets based on SV type and inheritance pattern (INDELs, inversions, translocations, de novo, etc.). nanotatoR passed all quality and run time criteria of Bioconductor, where it was accepted in the April 2019 release. We evaluated nanotatoR’s annotation capabilities using publicly available reference datasets: the singleton sample NA12878, mapped with two types of enzyme labeling, and the NA24143 trio. nanotatoR was also able to accurately filter the known pathogenic variants in a cohort of patients with Duchenne Muscular Dystrophy for which we had previously demonstrated the diagnostic ability of OGM. Conclusions: The extensive annotation enables users to rapidly identify potential pathogenic SVs, a critical step toward use of OGM in the clinical setting. Keywords: Optical genome mapping, Annotation, Structural variants * Correspondence: evilain@gwu.edu 1 Center for Genetic Medicine Research, Children’s Research Institute, Children’s National Hospital, Washington, DC 20010, USA 2 Department of Genomics and Precision Medicine, School of Medicine and Health Sciences, George Washington University, Washington, DC 20052, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 2 of 16 Background complex genomic rearrangements and increase SV break With the advent of the high-throughput short-read sequen- point resolution [13, 14]. They have been critical to shed cing (SRS) techniques, identification of molecular underpin- light on the “dark” regions of the genome where short nings of genetic disorders has become faster, more accurate reads had been insufficient for accurate assembly [15]. and cost-effective [1]. SRS platforms used for whole exome However, as LRS still isn’t in wide use, SV detection pipe- (WES) or genome (WGS) DNA sequencing produce billion lines have seen slower development than SRS-based algo- of reads per run, typically limited in length to 100–150 base rithms, and both quantity and quality of identified SVs pairs (bp) [2]. When WES started being used in clinical vary significantly between tools [16]. diagnostic practice, it was reported to be effective in identi- In parallel, a method not based on sequencing, optical fying pathogenic genetic variants in approximately 30% of genome mapping (OGM, Bionano Genomics), provides cases [3–5]. Even with technological evolution and more much higher sensitivity and specificity for identification of widespread practice, reports of diagnostic yields between 8 large SVs, including balanced events, compared to karyo- and 70%, depending on the disease [6] suggest that a large type, CMA, LRS and SRS [17–20]. For example, a compari- fraction of cases remain undiagnosed. WGS was shown to son of OGM, PacBio LRS and Illumina-based SRS on the be more effective than WES in identifying single nucleotide same genome showed that about a third of deletions and variants (SNVs; a change or variation of a single bp in the three quarters of insertions above 10 kb were detected only genome) or small insertions and deletions (INDELs; inser- by OGM [17]. For OGM, purified high-molecular-weight tion or deletion of 1 to 50 bps) than WES [7, 8]. However, DNA is fluorescently labeled at specific sequence motifs both WES and WGS are ineffective in identification of throughout the genome (reviewed in [21]). The labeled structural variants (SVs, insertion, deletion, duplication, in- DNA is imaged through nanochannel arrays for de novo version, or translocation greater than 50 bps in size) or copy genome assembly. Assembly and variant calling are per- number variants (CNVs; duplication or deletion SVs that formed using algorithms provided by Bionano Genomics affect larger regions of the chromosome) because short and/or tools developed by the community, such as OMblast reads cannot span repetitive elements or provide contextual [22] or OMtools [23]. OGM has been effective in identify- information. Many algorithms have been designed for de- ing pathogenic variants in patients with cancer [24–26], tection of SVs in short-read-based sequences (69 of them Duchenne muscular dystrophy [27], and facioscapulohum- were compared in [9]). However, performance analyses eral muscular dystrophy [28, 29]. Importantly OGM has highlight their limitations such as low concordance, poor allowed refinement of intractable, low-complexity regions precision, and high rate of false positive calls [9, 10]. An- of the genome and discovery of genomic content missing in other benchmarking study comparing 10 different SV cal- the reference genome assembly [19]. lers against robust truth sets showed that the total number Although, OGM is effective in identifying clinically rele- of calls made by the different algorithms varied by greater vant SVs, the currently available SV annotation tools do than two orders of magnitude [11]. Region of genome ana- not provide sufficient variant information for determination lyzed (repeats vs. high-complexity regions), noise of data of variant pathogenicity. Here, we report the development (platform-specific sequencing or assembly errors), complex- of an annotation tool in R language, nanotatoR, that pro- ity of the SV, and library properties (e.g. insert size) all vides extensive annotation for SVs identified by OGM. It affect specificity, sensitivity and/or processing speed of the determines population variant frequency using publicly various variant-calling algorithms [10]. available databases, as well as user-created internal data- Chromosomal microarray (CMA) is the established bases. It offers multiple filtration options based on quality method for high-accuracy detection of CNVs, but it can parameters thresholds. It also determines the percentage of only identify gains or losses of genetic material and is virtu- overlap of genes with the SV, as well as distance between ally blind towards identification of balanced rearrangements nearest genes and SV breakpoints, both upstream and such as inversions or translocations. CMA clinical applica- downstream. It offers an option for incorporating RNA-Seq tion is typically limited to CNVs above 25–50 kb, although read counts, which has been shown to enhance variant clas- higher resolution CNV maps have been built and are being sification [6], as well as user-specified disease-specific gene used to design disease-specific paths to diagnostic detection lists extracted from NCBI databases. The final output is of smaller variants (e.g. [12]). Breakpoint resolution is lim- provided in an Excel worksheet, with segregated SV types ited by the density of probes on the array. and inheritance patterns, facilitating filtration and identifi- Novel approaches which analyze single, long DNA mol- cation of pathogenic variants. ecules hold the promise of detecting the previously in- accessible SVs. Long-read sequencing (LRS) technologies Methods such as nanopore-based sequencing (Oxford Nanopore Sample data sets Technologies) or single-molecule real-time sequencing Optically mapped genomes for 8 different reference hu- (Pacific Biosciences) have the potential to both detect man samples were used to construct the internal cohort
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 3 of 16 database for evaluating nanotatoR’s performance. All sample tumor vs. normal) variants’ presence is evaluated in datasets, including “Utah woman” (Genome in a Bottle Con- self-molecules as well as the control sample mole- sortium sample NA12878), “Ashkenazi family” (NA24143 [or cules. For proband-only analysis variants’ presence GM24143]: Mother, NA24149 [or GM24149]: Father, and is evaluated in self-molecules only. NA24385 [or GM24385]: Son), GM11428 (6-year-old female with duplicated chromosome), GM09888 (8-year-old female nanotatoR functions with trichorhinophalangeal syndrome), GM08331 (4-year-old The annotations provided by nanotatoR are currently with chromosome deletion) and GM06226 (6-year-old male subdivided into 5 categories described below: 1) calcula- with chromosome 1–16 translocation and associated 16p tion of SV frequency in external and internal databases; CNV), were obtained from the Bionano Genomics public 2) determination of gene overlaps; 3) integration of gene datasets (https://bionanogenomics.com/library/datasets/). OG expression data; 4) extraction of relevant phenotypic in- M-based genome assembly and variant calling and annota- formation from public databases for 5) variant filtration. tion were performed using Solve version 3.5 (Bionano Gen- Finally, all of the sub-functions are compiled into a Main omics). Subsequently, samples were annotated with function. nanotatoR to examine the performance. Additionally, we tested nanotatoR’s ability to accur- Function 1: Structural variant frequency ately annotate the known disease variants in a previously Variant frequency is one of the most important filtration published cohort of 11 Duchenne Muscular Dystrophy characteristics for the identification of rare, possibly samples [27]. For this, internal cohort frequency calcula- pathogenic, variants. Because OGM is not sequence tions are based on these 11 samples. For gene expression based the average SV breakpoint uncertainty is 3.3 kbp integration, NA12878 fastq files (RNA-Seq) were ob- [20]. As a result, compared with SNV frequency calcula- tained from Sequence Read archive (SRA) (Sample tions, frequency estimates for SVs pose greater difficulty, GSM754335) and aligned to reference hg19, using STAR due to the breakpoint variability between “same” struc- [30]. Read counts were estimated using RSEM [31], re- tural variants identified by different techniques. ported in Transcripts per Million (TPM). Function 1.1 - External databases nanotatoR uses 3 nanotatoR input file formats external databases: Database of Genomic Variants (DGV) nanotatoR was written in R language. The nanotatoR [32], Database of Chromosomal Imbalance and Phenotype pipeline takes as input Bionano-annotated SV files, in in Humans Using Ensembl Resources (DECIPHER) [33] the format of either unmodified SMAP (BNG’s SVcaller and Bionano Genomics control database (BNDB). The re- output) or text (TXT) files that retain information from spective functions are named: DGVfrequency, DECIPHER the SMAPs, but also append additional fields. The two frequency and BNDBfrequency. The 3 datasets are access- main differences between the input files are: ible through the nanotatoR GitHub repository (https:// github.com/VilainLab/nanotatoRexternalDB). a) Enzyme: If a combination of restriction endonucleases BNDB is provided by Bionano Genomics in a subdivided (Nt.BspQI and Nb.BssSI) are used for DNA labeling, set of 4 files based on the type of SVs (indels, duplications, genome assembly and variant calling, the resultant SV inversions, and translocations) for two different human call sets from each enzyme are merged into a single reference genomes (GRCh37/hg19 and GRCh38/hg38). TXT file in an SMAP format (SVmerge function, nanotatoR aggregates the variant files of the user-selected Bionano Genomics). If a single direct labeling enzyme reference genome (hg19 or hg38), into a single format such as DLE1 is used for DNA labeling, the resultant (e.g. TXT) used for frequency calculation. This action is SV call set is kept in a single SMAP file format. Both performed as part of the function BNDBfrequency with file types (SVmerge TXT and SMAP) can serve as the following input parameters: buildBNInternalDB = input files for nanotatoR. TRUE, InternalDBpattern = “hg19” or InternalDBpattern = b) Family: Depending on the availability of family “hg38”. The following steps are used to calculate the fre- members, the SV-containing input files (TXT/ quency of a query SV in external databases: SMAP) contain additional information derived from the Variant Annotation Pipeline (BN_VAP, Bionano 1.1a Variant-to-variant similarity: Estimating the Genomics). For trio analysis (proband, mother, frequency of a query SV first requires determining father), BN_VAP performs molecule checks for SVs whether the variant is the same as the ones found identified in the proband (self-molecules) and in a database of interest. In order for the SVs to be checks whether the SV is present in the parents’ considered “same”, nanotatoR, by default, checks molecules. For duo analysis (proband vs. control whether two independent variants of the same type -any family member or unrelated individual- or (e.g. deletion) are on the same chromosome, have
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 4 of 16 50% or greater size similarity, and if the SV (BNDB) SVs for which the zygosity is unknown by breakpoint start and end positions are within 10 counting the number of alleles as 2. If the query SV kilobase pairs (kbp) for insertions/deletions/ matches with multiple variants in the BNDB from duplications and within 50 kbp for inversions/ the same BNDB sample, nanotatoR counts these as translocations. For example, if there is a deletion on a single variant/sample, with allele count of 2 for chromosome 1 with a breakpoint start at position homozygous/unknown and 1 for heterozygous chr1:350,000 and end at chr1:550,000 on the matches. reference, all deletion variants in chr1:340,000– 1.1dFrequency calculations: for DECIPHER and 560,000 with a size similarity of 50% would be DGV, SV frequency is calculated by dividing the extracted from the database. Similarly, if the variant number of query matched database variants (step was an inversion, nanotatoR would search for 1.1a) by the total number of alleles in the database, variants of the same type and on the same i.e. 2x the number of samples, which are diploid, chromosome, with a breakpoint start between and multiplying with 100 to get percentage chr1:300,000 and chr1:400,000 and breakpoint end frequency (Formula 1). between chr1:500,000 and chr1:600,000. Currently the 50% size similarity cutoff is not implemented by default for inversions and translocations, as sizes have only started to be provided in the SVcaller Formula 1: External public (DGV and DECIPHER) database SV output recently; however, users have an option to frequency calculation. Numerator: number of variants that pass the similarity criteria (step 1.1.a). Denominator is twice the number of run the size similarity, and future releases of samples, i.e. the number of alleles. This ratio is multiplied by 100 to nanotatoR will perform the size similarity express the frequency as a percentage. calculations by default. The percentage similarity parameters (DECIPHER Number of matching variants ðstep 1:1aÞ and BNDB functions: input parameter perc_ External DB SV frequency ¼ X 100 2 X total number of samples similarity; DGV function: input parameter perc_ similarity_DGV) and breakpoint start and end error For BNDB two types of frequency calculations are (DECIPHER and BNG functions insertion, deletion performed: filtered and unfiltered. For filtered frequency and duplication: input parameter win_indel; DGV calculations the following criteria must be met: 1.1a; function insertion, deletion and duplication: win_ 1.1b; 1.1c. For unfiltered variants frequency calculation indel_DGV; DECIPHER and BNG functions only 1.1a and 1.1c criteria are enforced. The resultant inversion and translocation: win_inv_trans; DGV number of identified counts is divided by the number of function inversion and translocation: win_inv_ alleles in BNDB (currently 468 for 234 diploid samples). trans_DGV) are modifiable by the user. The result is multiplied by 100 to get a percentage 1.1bVariant size and confidence score: Two (Formula 2). additional criteria are implemented to select for high-quality variants in BNDB. Bionano’s SVcaller calculates a confidence score for insertions, dele- Formula 2: BNDB database filtered SV frequency calculation. The tions, inversions, and translocations. To calculate al- variants that pass the similarity criterion (step 1.1.a) are filtered with size threshold and quality score (step 1.1.b). The number of variants is lele frequency, nanotatoR takes into account the estimated as mentioned in step 1.1.c. Denominator is the number of BNDB variants above a threshold quality score of alleles. This ratio is multiplied by 100 to get the frequency in percentage. 0.5 for insertions and deletions (indelconf), 0.01 for inversions (invconf) and 0.1 for translocations Number of matching filtered variants ðsteps 1:1a; b; cÞ Filtered SV frequency ¼ X 100 (transconf). These thresholds can be modified by 2 X Number of BNDB samples the user. In addition, nanotatoR filters out SVs below 1 kbp in size to decrease the likelihood of Output: The output is appended to the original input file false positive calls [20]. in individual columns. For DECIPHER, this consists of a 1.1c Zygosity: Variants in BNDB are reported as single column termed “DECIPHER_Freq_Perc”. As DGV homozygous, heterozygous or “unknown” (DGV provides information on number of samples in addition to and DECIPHER do not report zygosity). This is frequency, nanotatoR prints two columns: “DGV_Count” used to refine frequency calculation for BNDB SVs: (with the total number of unique DGV samples containing nanotatoR attributes an allele count of 2 for variants matching the query SV) and “DGV_Freq_Perc” homozygous SVs and 1 heterozygous SVs. (for the percentage calculated using Formula 1). For the Currently, nanotatoR overestimates the frequency BNDB, in addition to “BNG_Freq_Perc_Filtered”, “BNG_ for variants that overlap with reference database Freq_Perc_UnFiltered”, a third column reports “BNG_
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 5 of 16 Homozygotes” (number of homozygous variants that pass Pipeline, which examines individual molecules for sup- the filtration criteria). port of the identified SV [34] (Found_in_self_BSPQI_- molecules = “yes” or Function 1.2 - Internal databases The internal cohort Found_in_self_BSSSI_molecule = “yes”) for SVmerge analysis is designed to calculate variant frequency based datasets, or (Found_in_self_molecules = “yes”) for a on aggregation of SVs for samples ran within an single-enzyme dataset. institution or laboratory and provides parental zygosity information for inherited variants in familial cases. The For family analyses (duos and trios), the function consists of two distinct parts: internalFrequencyTrio_Duo function is used to identify parental/control sample zygosity based on the nanoID 1.2a Building the internal cohort database: coding using criteria described in sections 1.1.a/c (note Individual (solo) SMAP files for each of the samples that, here, the default size similarity percentage used is are concatenated to build an internal database ≥90% as inherited variants are expected to be virtually (buildSVInternalDB = TRUE), which is stored in the identical). Zygosity information for the identified form of a text file. This step creates a unique variants is extracted and appended into two separate sample identifier (nanoID) based on a key provided columns (fatherZygosity and motherZygosity). This that ensures unique sample ID and encodes family functionality is available for both SVmerge (merged relatedness. The nanoID is written as NR < Family outputs from 2 enzymes) and single enzyme labeling. # > . < Relationship #>. For example, the proband in SVs with the same family ID as the query are not a family of three (trio) would be denoted as NR23.1, included in the overall internal frequency calculation as with NR23 denoting the family ID and 1 denoting described in the previous paragraph. Five columns: the proband. For the parents of this proband, the “MotherZygosity”, “FatherZygosity”, “Internal_Freq_ nanoID would be NR23.2 for the mother and Perc_Filtered”, “Internal_Freq_Perc_Unfiltered”, and NR23.3 for the father. Currently, only trio analyses “Internal_Homozygotes” are appended to each of the are supported, future updates will include larger annotated input files (nonrelevant fields contain dashes). family analyses. If multiple projects exist within the same institution and are coded with project-specific Function 2: Gene overlap identifiers nanotatoR will append the project- The gene overlap function (overlapnearestgeneSearch) specific identifier in front of the nanoID (e.g. Pro- identifies known gene and non-coding RNA genomic lo- ject1_NR23.1 and Project2_NR42.1). cations that overlap with or are immediately upstream or 1.2bCalculating internal frequency and determining downstream from the identified SVs. This function takes parental zygosity: For singleton analyses, the function as input an SV-containing file (TXT or SMAP) and a internalFrequency_Solo (for both DLE labeling and modified Browser Extensible Data (BED) file where hu- SVmerge) calculates internal database frequency of man X and Y chromosomes are numbered as 23 and 24 queried SVs based on the same principles explained in respectively. The user has an option of either providing a section 1.1d (Formula 2) for BNDB frequency Bionano-provided modified BED file (inputfmtBed = calculations. However, additional filtration criteria are “BNBED”) or a BED file from UCSC Genome Browser or implemented to increase the accuracy of frequency GENCODE (inputfmtBed = “BED”). For the latter, nanota- estimation. SVs overlapping gaps in hg19/hg38 are toR supports conversion of a BED file into BNBED stand- annotated in the output SMAPs as “nbase” calls (e.g. ard with buildrunBNBedFiles function. The BED file is “deletion_nbase”) and are likely to be false. nanotatoR used to extract location/orientation of genes and overlap filters out “nbase”-containing SVs when estimating this information with SVs (overlapGenes function, called internal frequency. For duplications, inversions, and from overlapnearestgeneSearch). The function calculates translocations nanotatoR evaluates whether chimeric the percentage overlap, by calculating its distance from scores “pass” the thresholds set by the Bionano the breakpoint start (if the gene is partially upstream of SVcaller during de-novo genome assembly [34] the SV) or breakpoint end (if the gene is partially down- ensuring that SVs that “fail” this criterion are stream of the SV), and dividing it by the length of the gene eliminated from internal frequency calculations (calculated by nanotatoR from genomic coordinates infor- (Fail_BSPQI_assembly_chimeric_score = “pass” or mation in the BED file). By default, nanotatoR applies a 3 Fail_BSSSI_assembly_chimeric_score = “pass”) for kbp gene overlap window (breakpoint start – 3 kbp; SVmerge datasets, or breakpoint end + 3 kbp) to account for the typical OGM (Fail_assembly_chimeric_score = “pass”) for a single- breakpoint error [20] when searching for genes overlap- enzyme dataset. Lastly, nanotatoR checks whether the ping insertions, deletions and duplications. For inversions SVs were confirmed with Bionano Variant Annotation and translocations overlapping genes are limited to +/− 10
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 6 of 16 kbp from the breakpoint start/end (both parameters are Function 4: Entrez extract user-selectable). For the nonOverlapGenes function (also The gene_list_generation function assembles a list of genes called from overlapnearestgeneSearch), genes located near, based on the patient’s phenotype and overlaps it with gene but not overlapping with, SVs are reported along with the names that span SVs. User-provided, phenotype-based key- corresponding distances from the SV. Genes are sorted words are used to generate a gene list from the following based on their distance from the SV breakpoints. The databases: ClinVar [35], OMIM (https://omim.org/), GTR default number of reported nonoverlap genes is 3 (also [36], and the NCBI’s Gene database (www.ncbi.nlm.nih. user-selectable). The output produces 3 additional col- gov/gene). The input to the function is a term, which can umns: “OverlapGenes_strand_perc”, “Upstream_nonOver- be provided as a single term input (method = “Single”), a lapGenes_dist_kb” and “Downstream_nonOverlapGenes_ vector of terms (method = “Multiple”), or a text file dist_kb”. (method = “Text”). The output can be a dataframe or text. The rentrez [37] and VarfromPDB [38] R-language pack- ages are used to extract data related to each of the user- Function 3: Expression data integration provided phenotypic terms, from the individual databases. The SVexpression_solo/_duo/_trio functions for For the Gene database rentrez provides the entrez IDs asso- singletons, dyads and trios respectively provide the user ciated with each gene, which are converted in nanotatoR to with tools to integrate tissue-specific gene expression gene symbols using org.Hs.eg.db [37], a Bioconductor pack- values with SVs. The function takes as input a matrix of age. For OMIM, rentrez provides the OMIM record IDs, gene names (first column) and corresponding expression which are used to extract the corresponding disease- values for each sample, with the sample names as col- associated genes from the OMIM ID-to-gene ID conver- umn headers (individual files can be merged by the sion dataset (mim2gene.txt). For GTR, rentrez extracts the RNAseqcombine function for dyads/trios or RNAseqcom- GTR record IDs, which are then used to extract corre- bine_solo function for singletons). The SVexpression_ sponding gene symbols from the downloaded GTR data- solo/_duo/_trio function can take as an input read esti- base. For ClinVar, VarfromPDB is used to extract genes mation from any transcript quantification tool, provided corresponding to the input term. All genes to which the the input format mentioned above is followed. query keyword is attached, irrespective of their clinical sig- To differentiate between sample types (probands and nificance, are extracted; genes of clinical significance (i.e. affected/unaffected parents), we recommend the user those for which Pathogenic/Likely Pathogenic variants are add a code (or pattern) to the file name. For example, reported) are further reported in a separate column. The for the proband of family 23, the expression file name user also has the option to download the ClinVar and GTR would be Sample23_P_expression.txt, where P denotes databases by choosing downloadClinvar = TRUE and proband. Currently, by default nanotatoR recognizes “P” downloadGTR = TRUE, which may improve run times. as proband, “UM/AM” for unaffected/affected mother The user has an option to save the datasets (removeClin- and “UF/AF” for unaffected/affected father. A function var = FALSE and removeGTR = FALSE) or delete the data- to encode more complex intra-family relatedness and base after the analysis is completed (removeClinvar = identify individual samples, termed nanoID, is in devel- TRUE and removeGTR = TRUE). opment and will be available by default in the next re- The output is provided in CSV format with 3 columns: lease of nanotatoR. “Genes”, “Terms”, and “ClinicalSignificance”. The “Terms” Expression values for overlapping and non-overlapping column contains the list of terms associated to each gene SV genes are extracted from the genome-wide expres- and corresponding database from where the association sion matrix and appended into separate columns in the was derived. The “ClinicalSignificance” column contains overall SV input file. For example, an overlap of gene X genes that have clinical significance (Pathogenic/Likely with a SV in the proband, with an expression value of Pathogenic variants) for the associated term, derived from 10, would be represented as gene X (10), and printed in the ClinVar database. The output of entrez extract serves the “OverlapProbandEXP” column. The appended num- as input for the subsequent variant filtration step. ber of columns in the output is dependent on which functions were run. For a trio analysis, 9 columns are Function 5: Variant filtration added: “OverlapProbandEXP”, “OverlapFatherEXP”, and The filtration function has two major functionalities: “OverlapMotherEXP” for overlapping genes, and a simi- categorization of variants into groups (such as de novo, lar set for up- and down-stream non-overlapping genes inherited from mother/father; potential compound (e.g. “NonOverlapUPprobandEXP”, “NonOverlapDNpro- heterozygous; inversions or translocations) and integration bandEXP”). In case of dyads the number of columns of the primary gene list (either provided by the user or would be 6 (with the parent column being either mother generated by nanotatoR as described in section 4) into the or father) and 3 for singletons. input SV-containing file. In this function, genes overlapping
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 7 of 16 and near SVs that are present in the primary gene list are respectively (e.g. printed in separate columns. For genes that overlap with a Found_in_parents_molecules = “mother” or “father”). SV, the output is printed in columns termed “Overlap_PG” ◊ indel_dup_cmpdHET if present in the (primary genes that are common with the genes that over- heterozygous state in parents. This output is a lap SVs) and “Overlap_PG_Terms” (terms and database the combination of Indel_dup_mother and genes are being extracted from). Genes that are up/down- Indel_dup_father datasets. The user can manually stream of, but not overlapping with, the SV are divided into inspect this list to find potential compound “Non_Overlap_UP_PG”,“Non_Overlap_UP_Terms”, “Non_ heterozygote variants, i.e. 2 variants with genomic Overlap_DN_PG”, and “Non_Overlap_DN_Terms”. coordinates overlapping with the same gene, one The final annotated SV calls are divided into multiple present in the mother, the other in the father. sheets: Finally, inversion and translocations are reported as All: contains all identified variant types and follows, for singletons, dyads, or trios: annotations. all_PG_OV: Variants overlapping with the primary Variant Type = “inversion” or “inversion_paired” or gene list. “inversion_partial” or “inversion_repeat”. Mismatch: Variant Type = “MisMatch” in SVmerge Variant Type = “translocation_intrachr” or output (available for dual enzyme labeling output “translocation_interchr” if within a single chromosome only) contains SV calls discordant between the two or between two chromosomes respectively. enzymes datasets (e.g. one called a deletion and the second an insertion). There is a total of 6 variant filtration functions, based on enzyme type (SVmerge for dual-enzyme labeling or single By default, for the rest of the sheets, nanotatoR enzyme) and sample type (singleton, dyad, or trio). For ex- performs self-molecule checks for all SV types (described ample, for a single-enzyme singleton dataset the function in section 1.2b, Found_in_self_ molecules = yes) and is run_bionano_filter_SE_solo. Others are run_bionano_fil- chimeric score quality checks for duplications, inversions ter_SE_duo, run_bionano_filter_SE_trio, run_bionano_fil- and translocations (also described in section 1.2b, Fail_ ter_SVmerge_solo, run_bionano_filter_SVmerge_duo, and BSPQI_assembly_chimeric_score = “pass”) and reports on run_bionano_filter_SVmerge_trio. Variant filtration output zygosity. (These two criteria are henceforth called “default is an Excel file with each of the different output groups nanotatoR filtration”). For indel_dup, this comprises: represented as separate tabs. The output is written in Excel file format using the openxlsx [39] package, with the For singleton studies (solo): Variant following default naming convention “Sample1_solo.xlsx”. type = “insertion”, “deletion”, “duplication”, “duplication_split” or “duplication_inverted”. Main function For dyads: indel_dup_notShared or The main function sequentially runs the available indel_dup_Shared respectively: Insertions, deletions nanotatoR sub-functions by merging the outputs from and duplications not present in a control sample, each step. There are in total 6 main functions depending mother or father on the type of input: nanotatoR_main_Solo_SE; nanota- (Found_in_control_molecules = “no”) or present in a toR_main_Duo_SE; nanotatoR_main_Trio_SE; nanota- control sample, mother or father toR_main_Solo_SVmerge; nanotatoR_Duo_SVmerge and (Found_in_control_molecules = “yes”). nanotatoR_SVmerge_Trio. The Main function takes the For trios, insertions, deletions or duplications are SMAP file, DGV file, BED file, internal database file and reported in: phenotype term list as inputs and provides a single out- ◊ indel_dup_denovo if not present in the parents put file in a form of an Excel spreadsheet. The output lo- (Found_in_parents_BSPQI_molecules/ cation and file name are user-specified. Found_in_parents_BSSSI_molecules = “none” or “-” for SVmerge or Results Found_in_parents_molecules = “none” for single The nanotatoR pipeline, written in R, integrates five enzyme). individual sub-functions based on the enzyme and the sam- ◊ indel_dup_both if present in the proband, as ple type. A visual representation of the steps is shown in well as both parents (e.g. Fig. 1. nanotatoR passed all runtime and space criteria for Found_in_parents_molecules = “both”). the Bioconductor repository (https://www.bioconductor. ◊ indel_dup_mother and indel_dup_father if org/developers/package-guidelines/), where it was accepted inherited from only the mother or father, in the April, 2019 cycle. The Bioconductor link for
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 8 of 16 Fig. 1 Workflow of the nanotatoR pipeline: The nanotatoR pipeline is divided into 3 layers. Input Layer: Takes as input the OGM text or Smap file. Annotation Layer: The annotation layer comprises of five methods. The method that extracts the overlapping genes and the genes near (downstream and upstream) takes as input a BED file, and calculates the overlap percentage and the distance between nearest genes and SV using chromosomal locations. Next, the frequency calculation function calculates external and internal frequency taking DGV, DECIPHER and BNDB database as input for external frequency calculation, while input solo files, merged to form the internal frequency database, are taken as input to calculate the internal frequency. If RNA- Seq data is available, the expression count matrix is taken as input. Finally, output from all these methods as well as a primary gene list created from terms, is integrated, filtered based on quality criteria and written into an Excel file. Output Layer: The output is an Excel workbook, with each tab representing different SV types. The output files and number of tabs depend on the sample type and enzyme type: Singleton samples have 5 tabs for DLE and 6 tabs for SVmerge; 6 tabs for DLE and 7 tabs for SVmerge are created for dyad analyses; Trio analyses have 9 tabs for DLE and 10 tabs for SVmerge nanotatoR is https://bioconductor.org/packages/devel/bioc/ previously described truth sets: a control trio mapped with html/nanotatoR.html, and the latest version update is avail- the single-enzyme technique, a control singleton sample able at https://github.com/VilainLab/nanotatoR. mapped with both DLE and two-enzyme techniques, and a The output is in the form of an Excel workbook cohort of patients, for which we have previously established subdivided into variant types and inheritance modes in the efficacy of OGM to identify the SV causing Duchenne familial cases. The user has an option to either filter the Muscular Dystrophy [27]. data based on input parameters or perform the filtrations steps in the final Excel sheets. Theoretical examples of the Example I: Annotation of a control trio single labeling nanotatoR annotation process and output are illustrated dataset in Fig. 2 for various types of SVs. We used the so-called “Ashkenazi trio” reference data- To demonstrate the various functionalities of nanotatoR, sets, mapped using the DLE labeling methodology, to we present below the annotation results obtained from test the trio analysis function of nanotatoR (expression
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 9 of 16 Fig. 2 nanotatoR annotates genes overlapping or near a SV: a The cartoon shows three hypothetical scenarios: one deletion in the region upstream of Gene X (yellow) which may contain regulatory regions, indicated as solid purple in the reference genome (top) and lilac in the patient’s genome (bottom); one insertion (green) into Gene Y (blue), and a complete deletion of Gene Z (coral in patient genome). RNA-Seq reads are depicted as blue lines below the genes. b nanotatoR annotation snapshot: nanotatoR annotates the three variants with the overlapping genes and percentage overlap, nearest genes upstream and downstream, distance to the breakpoints in kilobases, BNDB frequency, internal frequency, overlap gene expression value (in transcripts per million or TPM), nearest genes expression in TPM and overlapping genes term from NCBI databases data was not available for these samples). Bionano Confidence and frequency filtration SVcaller identified a total of 9387 SVs in the proband To demonstrate the importance of frequency filtration (NA24385), shown in the last tab of the 9-tab report, in identifying rare variants, a manual filtration was termed “all”. The complete, unfiltered nanotatoR output performed using a 1% threshold for external databases file for NA24385 (GM24385) is available in Supplemen- (DGV, DECIPHER and BNDB) and the internal cohort tary Table S1. database. A confidence threshold (0.5 for insertions and deletions, 0.01 for inversions) was also applied. This eliminated 89% of the variants, leaving a total of 1005 nanotatoR filtration and variant type annotation SVs, including 53 inv. variants and 952 indel_dup After nanotatoR filtration (method described in (shown in top right pie chart, Fig. 3). Function 5 Variant Filtration of Materials and Methods section), 8804 variants (93.8%) remained of the original 9387. The variants were annotated based on criteria Inheritance annotation (Fig. 3, bottom right pie chart) described in Methods section 5. SVs were distributed as All inversion variants were inherited. Out of the 952 shown in Fig. 3 (left pie chart). The vast majority insertions, deletions and duplications, only 8 were de novo (8680) were in the “indel_dup” tabs; 114 (out of 279 (“indel_dup_denovo” tab), 794 found in both parents unfiltered) inversions were reported in the “inv” tab, (“indel_dup_both”), 68 in the mother only (“indel_dup_ and 0 (out of 84) translocations in the “trans” tab. All mother”), and 82 in the father only (“indel_dup_father”). the translocations called by the Bionano SVcaller in These numbers can be used to evaluate pathogenicity of this sample were in the categories “trans_interchr_ the SVs. 150 SVs would be reported in the “indel_dup_ common” and “trans_intrachr_common”, which are cmdHet” column (found in either the mother or father, classified as likely false by the Bionano annotation but not both), which can be manually inspected to identify pipeline [34, 40]. potential compound heterozygous SVs.
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 10 of 16 Fig. 3 Filtration and annotation of SV distribution in the NA24385 trio dataset: Out of the total 9387 variants found by SVcaller, 8804 passed the nanotatoR filtration of “Present in self molecules” and “Pass chimeric score” conditions. Of these, 2787 were deletions (dark blue), 5837 were insertions (light blue), 66 were duplications (grey), 114 were inversions (orange) and 0 were translocations (not shown), left pie chart. Note, nanotatoR outputs deletions, insertions and duplications in a single excel sheet “indel_dup”. The number of variants was dramatically reduced after filtering for rare variants and imposing a confidence threshold (top right pie chart). With a threshold of less than 1% internal frequency, DGV frequency, BNDB frequency, and DECIPHER frequency and confidence thresholds of > 0.5 for INDELs, > 0.01 for inversions and > 0.1 for translocations, 1005 rare variants remain. These were further categorized with nanotatoR by inheritance (bottom right pie chart). All 53 inversions were inherited. Of the 952 indel_dup variants, only 8 were de novo, 794 are were identified as indel_dup_both (found in both mother and father), 68 are indel_dup_mother (found in only the mother), and 82 are indel_dup_father. This annotation can be used to evaluate relevance of the variants to the condition studied SVs overlapping with the primary gene list the list of genes with pathogenic variants, as no such As these samples are those of healthy individuals, we variant is currently reported in ClinVar. could not use a disease term to generate a gene list. 169 SVs were found to be overlapping with the primary However, analysis of the genomes with OGM had gene list, and were shown in the “all_PG_OV” Tab. revealed a deletion variant affecting the UGT2B17 gene A deletion in UGT2B17 gene is observed in both the in the son and the mother [41]. A 150 kb deletion on “indel_dup_mother” tab and the “all_PG_OV” tab, as chromosome 4q13.2 spanning the whole UGT2B17 gene expected [41]. The SV overlaps with the UGT2B17 gene has been associated with osteoporosis [42]. To check and 4 pseudogenes UGT2B29P, AC147055.2, AC147055.3, whether our tool can efficiently annotate the variant, we and AC147055.4 as illustrated in Figure S1. used the term ‘osteoporosis’ to generate a primary gene list. This yielded a list of over 307 genes of which only 4 Internal cohort frequency and zygosity calculations for the had pathogenic or likely pathogenic variants in ClinVar, UGT2B17 variant highlighting the importance of this nanotatoR function To investigate the frequency of the variant in the 8- (Supplementary Table S2). The complete extracted gene sample internal cohort database, we first selected all var- list can be used for gene discovery, while the pathogenic iants with within − 10 kb of start breakpoint and + 10 kb list is most efficient to identify variants in genes known of the end breakpoint (i.e. between hg19 genomic coor- to be associated with the proband’s phenotype. Note dinates chr4:69,362,091 and chr4:69,500,860). A total of that while UGT2B17 was accurately extracted into the 6 variants passed the filtration criteria (“GM24385_del_ primary gene list by nanotatoR, as its association with totalData” tab in Table S3). Of these, 3 are from the osteoporosis is reported in OMIM, it does not appear in query family: one is the proband’s variant (all annotation
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 11 of 16 shown in the “GM24385_Variant_UGT2B17” tab in on number of variants as well as network speed (for Table S3) and the other two are found in his mother. Of gene_list_generation function). the maternal variants, one has exactly same start and To generate the primary gene list for the trio sample, end breakpoints as the proband’s. Bionano SVcaller also we downloaded the ClinVar and GTR databases, using called another variant in the mother with the same end the downloadClinvar = TRUE and downloadGTR = breakpoint, and a similar, but not identical, start break- TRUE parameters in the gene_list_generation function. point. Both were retained as they have a size similarity > The gene_list_generation function took ~ 2 min to run 90% with the proband’s variant. As described in for the sample. The time for this function is dependent Methods section 1.2b, nanotatoR selects the variants that on the number of input terms, as well as the pass size similarity and breakpoint criteria, and reports computational/internet bandwidth available to the user. the zygosity for each in the parents (“GM24385_del_ex- To make this process faster, the input parameters ample_Zygosity” tab in Table S3). removeGTR and removeClinvar can be switched to In addition to the family, the variant was found in two FALSE; nanotatoR will then use the pre-downloaded other samples of the internal cohort. The first was database files for subsequent runs. It is recommended to heterozygous, the second homozygous, so the total number download these databases periodically as they get up- of alleles carrying the SV was counted as 3. As the internal dated frequently. control cohort was composed of 8 samples, of which 3 All the other databases (OMIM, DGV, DECIPHER, were part of the Ashkenazi family, the total number of BNDB) must be downloaded manually from the database alleles in the internal cohort was calculated as 10 = 2 x (8– websites or from the nanotatoR database GitHub page 3), where 2 is for diploid genomes, 8 is the total number of (https://github.com/VilainLab/nanotatoRexternalDB). For samples in the cohort and 3 is the number of related this example, the internal frequency database was built individuals. The final internal frequency thus is (3/10) based on a cohort of 8 samples (time ~ 10 min). The time *100 = 30% (for both filtered and unfiltered, as all variants taken for database creation depends on the user’s cohort passed the quality filters). Note that the nanotatoR size. annotation process has detected the erroneous duplicate call made by SVcaller in sample NA12878, where two Example II: Annotation of a control singleton dataset variants with identical characteristics were called under two labeled with several enzymes different SVIndex numbers (rows 6&7, totalData tab, Table We also investigated the OGM datasets available for the S3). Only one is taken into account for internal frequency sample NA12878 to test the annotation effectiveness of calculation, as shown in Table S3, where tabs “GM24385_ nanotatoR on multi-enzyme labeling and integration of del_example_filter” and “GM24385_del_example_unfilt” RNA-Seq data. OGM data is available for three labeling show the samples used for the calculations. enzymes Nt.BspQI, Nb.BssSI and DLE1. The Nt.BspQI Next, we calculated the filtered and unfiltered frequency and Nb.BssSI SMAP outputs were merged using (Formula 2, Function 1.1d) of the deletion overlapping SVmerge. 10,087 and 6814 SVs were reported by SVcal- UGT2B17 in the Bionano reference database. We identified ler for single enzyme (DLE1) and SVmerge output re- a total of 58 variants in BNDB (“GM24385_data_all” tab in spectively (Fig. 4). Supplementary Table S4). Of these 33 (“GM24385_data_ SV annotation for a single enzyme took approximately filtered” and “GM24385_data_unfiltered” tabs of 28 min and 24 min for SVmerge data output. The time Supplementary Table S4) passed the nanotatoR default taken for each of the functions is reported in filtration criteria and were used for frequency calculation. Supplementary Table S5. Currently, the expression data Twelve were homozygotes and 21 heterozygotes for a total aggregation function takes the longest time as it extracts number of variant alleles of 43. The total number of each of the genes overlapping the SVs and finds the samples in BNDB is 234, hence the variant frequency is corresponding expression values from the RNA-Seq calculated as (43/468) *100 = 9.18%. (Note that, here too, datasets. The run time for this function largely depends the number of variants was the same before and after on the number of called SVs and the number of genes filtration, yielding the same frequency value). that overlap with them. The complete nanotatoR-anno- tated output Excel files can be found in Supplementary Tables S6 and S7 for DLE1 labeling and SVmerge Run times (Nt.BspQI/Nb.BssSI) respectively. The SMAPs were annotated on an Intel core i7–6700 For DLE1 labeling, out of the 10,087 variants called by CPU with 16 GB RAM, Windows 10 system. It took ~ the SVcaller, 9584 (~ 5%) remained after default 15 min to annotate the trio sample (Supplementary nanotatoR filtration (“Present in self molecules” and Table S5 has run time for each of the functions “Pass chimeric score”, see Methods function 5) (Fig. 4a). individually). The runtime for nanotatoR is dependent As for the trio genomes, the vast majority of called SVs
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 12 of 16 Fig. 4 Variant distribution for the singleton NA12878 sample. a Unfiltered and filtered variants distribution in DLE and SVmerge datasets: The total number of unfiltered variants for NA12878 DLE are 10,087, out of which 9584 variants are filtered using nanotatoR criteria (found in self molecules, passed chimeric score threshold). For NA12878 SVmerge, out of 6814 variants, 6201 pass the filtration. Mismatches were not considered in this analysis but are shown in Table S7. b SV distribution in the NA12878 DLE and SVmerge filtered datasets: Deletions (dark blue), insertions (light blue), duplications (grey) and inversions (orange) numbers are as shown in the pie charts. Bottom table shows distribution of SVs by type in percentages. For DLE, the majority of the identified SVs were insertions (64.9%), followed by deletions (33.5%), inversions (1.1%) and finally duplications (0.5%). While the total number of variants called is different between DLE and SVmerge, a similar pattern is seen in the SVmerge dataset. Many more duplications and inversions were called in the dual labeling than single DLE labeling method were indels, with a similar number of inversions (110) single enzyme and SVmerge datasets. SV breakpoints and zero translocations. Further breakdown of the indel_ reported in the original publication and in the nanotatoR- dup tab reveals 3207 deletions, 6218 insertions, and 49 annotated data sets are shown in Supplementary Table S8; duplications (Fig. 4b; left panel). For dual-enzyme label- SV type and gene names are highlighted in the all_PG_ ing, fewer variants were called in the SVmerge dataset OV tab in Tables S6 (DLE) and S7 (SVmerge). (6814), of which ~ 9% were filtered out by nanotatoR de- fault filtration. Of the remaining 6201, 5982 are “indel_ Example III: Duchenne muscular dystrophy cohort dup”, 219 “inv”, 0 “trans” and 11 mismatches. Breakdown of We have previously published validation of the OGM the indel SVs between deletions, insertions and duplications technology to identify variants in the DMD gene in a is shown in Fig. 4b, right panel). Proportions of insertions cohort of patients with Duchenne muscular dystrophy and deletions are similar in the two data sets, while the dual [27]. We used the same cohort to test the nanotatoR enzyme labeling called more duplications and inversions annotation pipeline. The gene_list_generation function, than single-enzyme labeling in this example. using “Duchenne muscular dystrophy” as input for the To validate the efficiency of nanotatoR in identifying rentrez tool, used at high stringency, i.e. selecting only genes overlapping with SVs for NA12878, we looked for genes with pathogenic or likely pathogenic variants in four previously published variants [43]. The 4 deletion ClinVar, yielded only one gene as expected for this variants identified in the study overlapped GSTM1, monogenic disorder. Table 1 shows that all of the LCE3B, LCE3C, CR1 and SIGLEC14 genes. nanotatoR’s previously identified variants in DMD cases were automated pipeline was able to identify the same type of correctly annotated. Each of these types of variants was variant (deletions) involving the same genes in both the placed in the correct final Excel output tab with
Bhattacharya et al. BMC Genomics (2021) 22:10 Page 13 of 16 Table 1 Summary of nanotatoR annotation results of Duchenne muscular dystrophy patient cohort Sample ID Overlap Gene Variant Type Clinical Significance Zygosity Internal Frequency DGV/BNDB Frequency CDMD1003_P DMD Deletion Pathogenic Hemizygous 0 0/0 CDMD1155_P DMD Deletion Pathogenic Hemizygous 0 0/0 CDMD1156_P DMD Deletion Pathogenic Hemizygous 0 0/0 CDMD1159_P DMD Deletion Deletion Pathogenic Hemizygous 0 0/0 DMD Unknown Hemizygous 0 0/0 CDMD1131_P DMD Deletion Pathogenic Hemizygous 22% 0/0 CDMD1132_M Carrier Heterozygous CDMD1157_P DMD Deletion Pathogenic Hemizygous 11% 0/0 CDMD1158_M Non-Carrier n/a CDMD1163_P DMD Insertion Pathogenic Hemizygous 0 0/0 CDMD1164_M Carrier Heterozygous CDMD1187_P DMD Inversion Pathogenic Hemizygous 0 0/0 Using nanotatoR we annotated variants that overlapped the DMD gene, evaluated the zygosity and calculated the internal (cohort size 11 samples) and external (DGV/BNDB) frequencies. The details of the variants can be found in Barseghyan et.al. 2017 [27] corresponding frequencies, gene overlap and maternal currently suboptimal, with limited user-defined parame- carrier status. (Annotation for all samples is shown in ters for frequency calculations in external databases, inter- Supplementary Table S9; columns used to create Table section with gene expression datasets, or filtration 1 are highlighted). through primary gene lists. These functions are critical for Note that, as most callers, Bionano SVcaller currently clinical applications to evaluate SV pathogenicity. nanota- identifies zygosity for X and Y chromosome variants in toR, currently takes as input the SMAP/TXT variant file, an XY individual as homozygous rather than databases (internal and/or external), terms list, gene loca- hemizygous, which we corrected in Table 1. As a result tion bed files, and expression values, to provide the user of the erroneous input, internal frequency is currently with comprehensive SV annotations. overestimated for variants on the X chromosome. The field of genome-wide structural variation identifi- Frequency calculations for SVs on the Y chromosome cation is rapidly advancing with LRS and OGM con- are not affected. stantly evolving. However, currently both LRS and OGM Using nanotatoR, we were able to automate steps that technologies have limitations in identification of SNVs, previously had to be taken manually to identify the large SVs > 5 kb (LRS) and small SVs < 1 kb (OGM) [17]. pathogenic SVs in the DMD gene in the data sets: A combination of these various methods on the same navigation to X chromosome location of the DMD gene, genome will likely be necessary for optimal resolution selection of the type of the SV (deletion/insertion, etc..), and accuracy of SV detection [17, 47], which will require filtration by frequency, and curation of gene pathogenicity. the design of integrated platforms able to detect and In addition to the previously reported SVs, with the help classify variants using multiple types of data sets. Simi- of nanotatoR, we identified an additional deletion of larly, accurate determination of the pathogenicity of SVs unknown significance in sample CDMD_1159 that had requires integration of multiple data sets (e.g. OGM and previously been missed. gene expression) as well as tools capable of annotating these various types of data sets. nanotatoR has this func- Conclusions tionality and was able to considerably reduce analysis Structural variants play a major role in various genetic time compared to manual filtration of many steps. Fu- diseases (reviewed in [44]). Due to the technical ture development of nanotatoR will be focused around limitations of short-read-based genome sequencing and adoption of variant annotation file (VCF) format as input microarray techniques, identification of SVs is challenging. files, support for SV calls produced by LRS/SRS tech- Introduction of optical genome mapping and long-read- nologies, additional population frequency databases such based technologies promises to advance the field of SV as gnomAD [48], integration of gene regulatory informa- identification. Although there are tools available for anno- tion in the form of CHIP-seq and microRNA sequencing tation of SVs (AnnotSV [45], Annovar [46]), they do not data, and implementation of automated SV classification take into account OGM criteria (such as self molecules or based on ACMG guidelines [49]. We will also imple- chimeric score) for filtration. This has prompted the de- ment functions for annotation of somatic SVs for better velopment of nanotatoR to help researchers analyze OGM variant prioritization during analysis. In addition, we SV datasets, with high efficiency and precision. The anno- plan to design a graphical interface for easy access and tation pipelines available for the OGM and LRS data are wider adoption.
You can also read