Sequencing, assembly and annotation of the whole-insect genome of Lymantria dispar dispar, the European gypsy moth - Oxford Academic Journals

Page created by Cody Kelley
 
CONTINUE READING
Sequencing, assembly and annotation of the whole-insect genome of Lymantria dispar dispar, the European gypsy moth - Oxford Academic Journals
2
                                                                                                                                    G3, 2021, 11(8), jkab150
                                                                                                                          DOI: 10.1093/g3journal/jkab150
                                                                                                             Advance Access Publication Date: 30 April 2021
                                                                                                                                            Genome Report

Sequencing, assembly and annotation of the whole-insect
genome of Lymantria dispar dispar, the European gypsy
moth
Michael E. Sparks   ,1,* Francois Olivier Hebert,2 J. Spencer Johnston,3 Richard C. Hamelin,2,4 Michel Cusson,2,5

                                                                                                                                                               Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021
Roger C. Levesque and Dawn E. Gundersen-Rindal1
                  2

1
  USDA-ARS Invasive Insect Biocontrol and Behavior Laboratory, Beltsville, MD 20705, USA,
2
  Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Quebec City, QC, G1V 0A6, Canada,
3
  Department of Entomology, Texas A&M University, College Station, TX 77843, USA,
4
  Department of Forest and Conservation Sciences, Faculty of Forestry, University of British Columbia, Vancouver, BC V6T 1Z4, Canada, and
5
  Laurentian Forestry Centre, Canadian Forest Service, Natural Resources Canada, Quebec City, QC G1V 4C7, Canada

*Corresponding author: michael.sparks2@usda.gov

Abstract
The European gypsy moth, Lymantria dispar dispar (LDD), is an invasive insect and a threat to urban trees, forests and forest-related indus-
tries in North America. For use as a comparator with a previously published genome based on the LD652 pupal ovary-derived cell line, as
well as whole-insect genome sequences obtained from the Asian gypsy moth subspecies L. dispar asiatica and L. dispar japonica, the
whole-insect LDD genome was sequenced, assembled and annotated. The resulting assembly was 998 Mb in size, with a contig N50 of
662 Kb and a GC content of 38.8%. Long interspersed nuclear elements constitute 25.4% of the whole-insect genome, and a total of
11,901 genes predicted by automated gene finding encoded proteins exhibiting homology with reference sequences in the NCBI NR
and/or UniProtKB databases at the most stringent similarity cutoff level (i.e., the gold tier). These results will be especially useful in develop-
ing a better understanding of the biology and population genetics of L. dispar and the genetic features underlying Lepidoptera in general.

Keywords: whole-genome sequencing; PacBio long-read assembly; Illumina short-read polishing; automated gene finding; Lepidoptera;
European gypsy moth; gypsy moth genomics

Introduction                                                                       earlier mitogenome-based research findings (Djoumad et al. 2017).
                                                                                   These resources will contribute to an improved understanding of
Feeding on over 300 species of woody plants, the European gypsy
                                                                                   lepidopteran genetics in general (Triant et al. 2018).
moth, Lymantria dispar dispar (LDD), poses significant economic
and ecological threats to urban trees and forests of North
America, particularly with respect to forest-related industries                    Materials and methods
(Leonard 1981). Having a representative genome sequence for this
                                                                                   The European gypsy moth genome size was estimated using previ-
forest pest is important to facilitate various aspects of biosurveil-
                                                                                   ously described methods (Johnston et al. 2019) applied to LDD New
lance, including discovery of traits associated with invasiveness
                                                                                   Jersey Standard Strain (NJSS) moth heads obtained from the USDA-
and mapping of population resequencing data to determine sour-
ces and pathways of spread (Roe et al. 2019; Hamelin and Roe                       APHIS Otis Laboratory (Buzzards Bay, MA, USA): four individuals
2020). An Illumina-based genome assembly of biomaterials pre-                      were processed apiece for each of the male and female sexes.
pared from the LDD-derived in vitro cell line, LD652 (Goodwin et al.               Genomic DNA was prepared from a single female pupa, which had
1978), was recently published (Zhang et al. 2019); however, this cell              been reared from an NJSS egg mass—this biomaterial was used for
culture may not accurately portray the comprehensive genomic                       sequencing a total of 96 SMRT-Cells by the University of
and/or genic complements of the intact LDD insect (Sparks et al.                   Washington PacBio Sequencing Services group (Seattle, WA, USA).
2013). Whole-insect genomes of two subspecies of Asian gypsy                       Subread data were extracted in fastq format from bas.h5 archives
moths, L. dispar asiatica (LDA) and L. dispar japonica (LDJ), were also            (without imposition of quality filters) using Pacific Biosciences’
recently sequenced using PacBio technology (Hebert et al. 2019).                   pbh5tools (https://github.com/PacificBiosciences/pbh5tools). Read
This study reports the sequencing of LDD at the whole-insect level                 correction, trimming and assembly were performed using version
using PacBio and Illumina technologies, done specifically to enable                1.9 of the CANU assembler (Koren et al. 2017). Subread sets were
nuclear genome-informed population genetics and comparative                        compiled into BAM-formatted archives using the bax2bam utility
genomics analyses among these subspecies, thereby augmenting                       from Pacific Biosciences’ SMRT Analysis software suite (https://www.

Received: February 05, 2021. Accepted: April 26, 2021
Published by Oxford University Press on behalf of Genetics Society of America 2021. This work is written by US Government employees and is in the public
domain in the US.
2   |   G3, 2021, Vol. 11, No. 8

pacb.com/support/software-downloads) with the following pulse              with a reference protein (also of at least 150 aa in length) such that
feature options enabled: DeletionQV, DeletionTag, InsertionQV,             75% or more of aligned residues are positively similar, and the hit
IPD, MergeQV, SubstitutionQV, PulseWidth and SubstitutionTag.              length-to-subject sequence length ratio is 90% or greater. Silver-tier
BAM-formatted subreads were aligned to assembled genomic                   entries are a minimum of 100 amino acid residues long with a hit
templates using the BLASR long read aligner via the pbalign wrapper        spanning at least 75% of a reference protein sequence’s length (also
(https://github.com/PacificBiosciences/blasr;        https://github.com/   at least 100 aa in length), and bronze-tier entries are similar, but
PacificBiosciences/pbalign), the results of which were subsequently        with the hit covering 30% or more of the reference protein’s length.
used as inputs to the Arrow genome polishing algorithm as imple-               The annotation pipeline was also used to independently analyze
mented in the Pacific Biosciences GenomicConsensus package                 the LDA and LDJ genomes (available from https://osf.io/unz2v/files/ as
(https://github.com/PacificBiosciences/GenomicConsensus).          Three   “lda.genome.v0.3.arrow-polished.fasta” and “ldj.genome.v0.3.arrow-
rounds of Arrow-based assembly polishing were performed.                   polished.fasta”, respectively), as well as the two LD652 cell line assem-
    An additional library was prepared from a single non-sexed,            blies described in Zhang et al. (2019): one of these LD652 assemblies
3rd-instar larval insect reared from an NJSS egg mass. A set of            incorporates information from Novogene-prepared Hi-C libraries

                                                                                                                                                       Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021
359,593,922 Illumina PE100 short read pairs was sequenced from             (“LD652_Lymantria_dispar.FINAL.fasta”, obtained from http://prodata.
this library by the Georgia Genomics Facility (Athens, GA, USA)            swmed.edu/download/pub/LD652/) and the other (available from
and aligned to the Arrow-polished template with bowtie2                    GenBank under assembly accession GCA_004115105.1), does not.
(Langmead and Salzberg 2012), the results of which were used as            Predicted protein sequences from each of the gold, silver and bronze
input to the Pilon polishing program operating in its diploid mode         tiers were combined in an assembly-specific manner, and these were
(Walker et al. 2014). The fully polished assembly was then proc-           then globally pooled and clustered with CD-HIT (Fu et al. 2012) using a
essed with Purge Haplotigs (Roach et al. 2018) to minimize resid-          sequence identity threshold of 0.975. Clusters having exactly one rep-
ual heterogeneity; empirically selected read depth cutoffs of low          resentative sequence from each of the five assemblies were identified,
¼ 19, mid ¼ 58, and high ¼ 107 were used. The genome was then              providing a set of putative single-copy gene models universally pre-
submitted to the NCBI WGS division, which filtered for potential           sent in these L. dispar genomes.
vector contaminants and mitochondrial sequences. The NCBI-
vetted assembly was compared with the endopterygota_odb9                   Data availability
database using BUSCO (Sima     ~ o et al. 2015) to assess genome qual-     All data used in the preparation of the final LDD assembly are
ity in terms of known single-copy orthologs. Repetitive content            available at the NCBI under BioProject accession number
was identified using RepeatModeler (Smit and Hubley 2020)—                 PRJNA694036: PacBio subreads and Illumina short-read data are
which wraps the LTRharvest (Ellinghaus et al. 2008), RepeatScout           available at the SRA database (accession numbers SRR13505170-
(Price et al. 2005) and RECON (Bao and Eddy 2002) pipelines—and            SRR13505265, and SRR13518687-SRR13518688, respectively). The
masked using RepeatMasker (Smit et al. 2019).                              Whole Genome Shotgun project has been deposited at DDBJ/
    Gene annotation was conducted using a modified version of              ENA/GenBank under the accession JAEVLZ000000000. The ver-
the insect-specific pipeline described in Hebert et al. (2019) (see        sion described in this paper is JAEVLZ010000000. In addition, the
Figure 1). In addition to extrinsic hints provided to Augustus             whole-genome assembly and automated protein-coding gene
(Stanke et al. 2006) by Exonerate (Slater and Birney 2005) and             finding results (see below) are available at the Open Science
PASA (Haas et al. 2003), proteins predicted by GenomeThreader              Framework (OSF) repository: https://doi.org/10.17605/OSF.IO/
(Gremme et al. 2005), used to align Trinity-based RNA-Seq assem-           JX2VS. Lastly, interested parties are welcome to contact the cor-
blies (Haas et al. 2013) to genomic templates and compute con-             responding author to request relevant data and software.
sensus gene structures from the results, were also used. In
particular, GenomeThreader’s splice site models were retrained
using BSSM4GSQ (Sparks and Brendel 2005) with Bombyx mori data
                                                                           Results and discussion
(obtained from NCBI’s Annotation Release 102; https://www.ncbi.            The fully processed, NCBI-vetted genome assembly of the whole-
nlm.nih.gov/genome/annotation_euk/Bombyx_mori/102/) and its                insect European gypsy moth (i.e., LDD) exhibited a size of 998 Mb,
predictions were post-processed with MetWAMer (https://github.             a contig N50 of 662 Kb and a GC content of 38.8% (see Table 1).
com/scentiant/MetWAMer; Sparks and Brendel 2008) to identify               The assembly size is 1.015 times larger than the 983 Mb mean es-
translation initiation sites. If a GenomeThreader-derived con-             timate obtained from a sample of four females using flow cytom-
sensus gene structure was ideal (i.e., canonical translation start         etry [female: 983.2 6 6.7 Mb (n ¼ 4) and male: 993.3 6 6.2 Mb
and stop codons were present and, for multi-exon models,                   (n ¼ 4)]. Of a total of 2,442 known BUSCOs in the Endopterygota,
introns uniformly exhibited GT donors and AG acceptors), it was            95.5% were complete (93.7% single-copy and 1.8% duplicated),
provided to Augustus as an immutable anchor; otherwise, it was             2.7% were fragmented and 1.8% were missing. Long interspersed
supplied as a hint. Both of the alternatives-from-evidence and             nuclear elements (LINEs) constitute 25.40% of the LDD genome
alternatives-from-sampling Augustus flags were set to false.               (see Table 2). This amount is only slightly less than the 25.74% and
RNA-Seq data used to prepare Trinity assemblies were obtained              26.86% fractions of LINEs observed by the revised annotation pipe-
from Sparks et al. (2013) and other unpublished studies con-               line as applied to LDA and LDJ, respectively, although it is notably
ducted by MES and DEGR.                                                    greater than the 18.96% and 17.34% amounts observed for the
    Protein sequences predicted by the overall pipeline were con-          LD652 assemblies with and without Hi-C library integration, re-
strained to 50 or more amino acids (aa) in length. These were then         spectively (see Table 3). Similarly, although LDD, LDA and LDJ ex-
compared with the NCBI NR and UniProtKB’s Swiss-Prot and TrEMBL            hibit commensurate levels of total repetitive DNA (68.74%, 67.45%
protein databases (UniProt Consortium 2019) using the DIAMOND              and 67.09%, respectively), the LD652 assemblies comprised lesser
aligner in its BLASTp-like mode (Buchfink et al. 2015). Gene models        amounts of such content (with Hi-C ¼ 59.76%, sans Hi-C ¼ 59.66%).
were sorted into gold, silver and bronze tiers based on their encoded         The protein-coding gene annotation pipeline identified 75,018
products’ similarity to exemplar proteins: gold-tier entries are at        unique LDD genes, of which 11,901 sorted to the gold tier, 17,598 to
least 150 aa in length and exhibit a single high-scoring segment pair      the silver, and 26,773 to bronze (the remaining 18,746 “non-podium”
M.E. Sparks et al.   |   3

                                                                                                                                                      Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021

Figure 1 L. dispar dispar genome assembly and annotation pipeline. The automated pipeline used to assemble the European gypsy moth genome, as well
as annotate repetitive content and nuclear protein coding genes, is a modified version of the methods used for annotating the Asian gypsy moth
genomes, L. dispar asiatica and L. dispar japonica (Hebert et al. 2019). (The Bombyx mori image is copyright Freepik 2018 and reproduced here with
permission.)
4   |   G3, 2021, Vol. 11, No. 8

Table 1 Genome assembly statistics following initial Canu assembly, each of three rounds of Arrow polishing, a round of Pilon short-
read polishing, primary contig identification using the Purge Haplotigs pipeline, and purging of mitochondrial sequences and putative
vector contaminants by software filters at the NCBI WGS division

                            Unpolished         Arrow polished     Arrow polished   Arrow polished       Pilon     Purge Haplotigs NCBI filters (final
                          assembly (Canu)         (round 1)          (round 2)        (round 3)        polished     processed        version)

N50 (a)                        355,683            355,730            355,728           355,719       354,801         661,876            661,876
N90 (a)                         26,157             26,161             26,163            26,167        26,130          91,592             91,592
GC (%)                          38.88              38.89               38.89             38.89        38.89            38.83             38.83
Total size                 1,280,392,172       1,281,096,304      1,281,192,033     1,281,236,029 1,279,160,664    998,776,906        998,430,749
Min contig                      1,028              1,028               1,028             1,029        1,029            1,039             1,039
Max contig                    8,311,499          8,312,296          8,312,357         8,312,335     8,267,660       8,267,660          8,267,660
Contig count                    16,115             16,115             16,115            16,115        16,115           4,625             4,622
SMRT cells                        96
Subread count (b)            20,008,385

                                                                                                                                                        Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021
Bases in subreads (b)     113,177,874,503
Genome size estimate        983,200,000
Sequencing coverage             115.11
a
 N50 was calculated using the definition provided in Miller et al. (2010), with N90 being similarly defined.
b
 Subreads determined using bax2bam (from smrtlink_7.0.0.63985) w/options “--subread” and
“--pulsefeatures¼DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV,SubstitutionQV,PulseWidth,SubstitutionTag”.

Table 2 Repetitive content of the L. dispar dispar genome assembly as identified using RepeatModeler/RepeatMasker

                                                      Number of elements                  Length occupied (bp)            Percentage of sequence (%)

Bases masked                                                                                  686,312,151                            68.74
Retroelements                                         1,688,175                               336,176,424                            33.67
  SINEs                                               150,440                                  12,344,035                             1.24
  Penelope                                            214,022                                  28,041,873                             2.81
  LINEs:                                              1,210,601                               253,631,498                            25.40
     CRE/SLACS                                        1,123                                       414,475                             0.04
     L2/CR1/Rex                                       120,505                                  29,326,164                             2.94
     R1/LOA/Jockey                                    289,120                                 100,744,241                            10.09
     R2/R4/NeSL                                       3,328                                     1,806,751                             0.18
     RTE/Bov-B                                        392,519                                  58,464,284                             5.86
     L1/CIN4                                          0                                                 0                             0.00
  LTR elements:                                       327,134                                  70,200,891                             7.03
     BEL/Pao                                          97,919                                   24,858,437                             2.49
     Ty1/Copia                                        20,415                                    6,797,796                             0.68
     Gypsy/DIRS1                                      114,861                                  23,908,124                             2.39
     Retroviral                                       455                                          56,012                             0.01
DNA transposons                                       410,996                                  75,109,994                             7.52
  hobo-Activator                                      39,043                                   10,065,906                             1.01
  Tc1-IS630-Pogo                                      324,022                                  56,591,826                             5.67
  En-Spm                                              0                                                 0                             0.00
  MuDR-IS905                                          0                                                 0                             0.00
  PiggyBac                                            1,258                                       410,331                             0.04
  Tourist/Harbinger                                   6,160                                       810,421                             0.08
  Other (Mirage, P-element, Transib)                  4,091                                       523,746                             0.05
Rolling-circles                                       746,035                                 123,018,327                            12.32
Unclassified                                          1,043,151                               151,324,752                            15.16
Total interspersed repeats                                                                    562,611,170                            56.35
Small RNA                                             4,846                                       382,666                             0.04
Satellites                                            192                                         130,399                             0.01
Simple repeats                                        3,585                                       169,589                             0.02

genes failed to satisfy the criteria for any of these levels, thus repre-         Given that the LD652 cell line was derived from ovarian tissue
senting either false positive gene calls or novel sequences not yet pre-       over four decades ago (Goodwin et al. 1978) and has undergone
sent in reference databases). Reannotation of the LDA and LDJ                  long-term storage, multiple serial passages and exposure to un-
genomes with this updated annotation system yielded commensu-                  known selective pressures in the interim (Lynn 2000, 2006), the
rate counts of genes encoding proteins corroborated by records in the          possibility exists that loss of chromosomal-level genetic material
NCBI and UniProtKB databases (i.e., gold, silver and bronze tier mod-          may partially explain the discrepancies in genome size, repetitive
els): LDA ¼ 9,581 þ 16,081 þ 27,583 ¼ 53,245 “podium-level” genes,             content and gene content observed between LD652 and LDD
LDJ ¼ 57,331 and LDD ¼ 56,272 (see Table 3). These amounts are                 (Table 3). More likely, however, is that the short-read technology
roughly double what was observed for the LD652 assemblies: 24,345              used to sequence LD652 (Zhang et al. 2019) was comparably less
unique protein-coding gene loci were identified in the Hi-C library-           effective at characterizing such a highly repetitive genome as
based assembly and 26,504 in the other version. Clustering of pooled,          that of L. dispar than were the longer PacBio reads used in this
podium-level sequences identified a total of 3,265 putative single-            study (and in Hebert et al. 2019). What is more, the lower levels of
copy proteins shared across all five of these L. dispar genomes.               BUSCOs detected and fewer protein-coding genes predicted in the
M.E. Sparks et al.     |     5

Table 3 A comparison of descriptive statistics assessing assembly quality, and repetitive element annotation and gene finding results

                                    LDD                         LDA                           LDJ                      LD652 (Hi-C)                    LD652

Size (bp)                      998,430,749                  921,904,509                 1,000,530,546                  788,637,865                  864,963,299
Contig count                       4,622                       8,189                        11,303                       134,187                      135,446
N50                              661,876                      212,365                      137,117                      5,068,179                     249,594
N90                               91,592                       40,935                       37,304                         2,453                       3,108
GC (%)                             38.83                       38.60                        38.51                          35.18                       35.25
LINE elements                   1,210,601                    1,073,465                    1,439,736                      902,373                      879,467
LINEs (%)                          25.40                       25.74                        26.86                          18.96                       17.34
bp masked                      686,312,151                  621,829,597                  671,280,650                   471,271,105                  516,072,603
bp masked (%)                      68.74                       67.45                        67.09                          59.76                       59.66
Gold tier                         11,901                       9,581                        8,834                          5,341                       5,896
Silver tier                       17,598                       16,081                       16,445                         5,289                       5,819
Bronze tier                       26,773                       27,583                       32,052                        13,715                      14,789

                                                                                                                                                                       Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021
Non-podium                        18,746                       21,614                       27,695                        22,491                      24,131
BUSCO (%)                          98.2                         96.4                         97.7                          87.7                         93.9

  Assembly sources and methods used are as described in the Materials and Methods section. Reported gold-, silver- and bronze-tier, as well as non-podium level,
values correspond to the number of unique protein-coding loci identified at the respective quality level by the gene annotation pipeline. Reported BUSCO values
correspond to the fraction of complete (i.e., single-copy and duplicated) and fragmented BUSCOs detected.

LD652 genomes suggest that the longer PacBio reads were also                       Conflicts of interest
more effective at recovering coding (and hence, typically less re-
                                                                                   None declared.
petitive) regions of the LDD, LDA and LDJ genomes in a highly
contiguous manner. Resequencing of LD652 using longer-read
technologies will help with resolving this ambiguity.                              Literature cited
    In L. dispar, females constitute the heterogametic sex (Sahara
et al. 2012), which in part motivated the decision to sequence a                   Bao Z, Eddy SR. 2002. Automated de novo identification of repeat se-
single female LDD pupa using long reads—representative se-                            quence families in sequenced genomes. Genome Res. 12:
quence from both sex chromosomes would thereby be captured.                           1269–1276.
Among Asian gypsy moths, females are also flight capable                           Buchfink B, Xie C, Huson DH. 2015. Fast and sensitive protein align-
(Keena et al. 2008; Srivastava et al. 2021), whereas European gypsy                   ment using DIAMOND. Nat Methods. 12:59–60.
moth females lack this capacity for flight. (Gypsy moth males are                  Djoumad A, Nisole A, Zahiri R, Freschi L, Picq S, et al. 2017.
flight capable.) This results in a comparatively slower rate of host                   Comparative analysis of mitochondrial genomes of geographic
range expansion in North America for European gypsy moth pop-                          variants of the gypsy moth, Lymantria dispar, reveals a previously
ulations, and a corresponding increase in concern for preventing                       undescribed genotypic entity. Sci Rep. 7:14245.
the establishment of Asian gypsy moths. The genomic sequence                       Ellinghaus D, Kurtz S, Willhoeft U. 2008. LTRharvest, an efficient and
resources contributed by this study should, upon detailed func-                        flexible software for de novo detection of LTR retrotransposons.
tional comparisons with the genomes of LDA and LDJ in particu-                         BMC Bioinformatics. 9:18.
lar, help identify genetic determinants controlling flight in this                 Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for cluster-
important nuisance insect. Such results could find potential use                      ing the next-generation sequencing data. Bioinformatics. 28:
as molecular diagnostic markers to assist with the identification                     3150–3152.
and interception of Asian gypsy moths in cargo containers, for in-                 Goodwin RH, Tompkins GJ, McCawley P. 1978. Gypsy moth cell lines
stance.                                                                               divergent in viral susceptibility. I. Culture and identification. In
                                                                                      Vitro. 14:485–494.
                                                                                   Gremme G, Brendel V, Sparks ME, Kurtz S. 2005. Engineering a soft-
Acknowledgments                                                                      ware tool for gene structure prediction in higher organisms. Inf
The authors thank Daniel Kuhar for technical assistance with in-                     Softw Technol. 47:965–978.
sect rearing and DNA extraction. The constructive suggestions                      Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, et al. 2003.
provided by two anonymous reviewers substantially improved                           Improving the Arabidopsis genome annotation using maximal
this report. Mention of trade names or commercial products in                        transcript alignment assemblies. Nucleic Acids Res. 31:
this publication is solely for the purpose of providing specific in-                 5654–5666.
formation and does not imply recommendation or endorsement                         Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, et al.
by the U.S. Department of Agriculture. The USDA is an equal op-                       2013. De novo transcript sequence reconstruction from RNA-Seq:
portunity provider and employer.                                                      reference generation and analysis with Trinity. Nat Protoc. 8:
                                                                                      1494–1512.
                                                                                   Hamelin RC, Roe AD. 2020. Genomic biosurveillance of forest inva-
Funding                                                                               sive alien enemies: a story written in code. Evol Appl. 13:95–115.
Primary funding for this work came from USDA-ARS base project                      Hebert FO, Freschi L, Blackburn G, Béliveau C, Dewar K, et al. 2019.
8042-22000-315-00-D. Additional sources of funding include                            Expansion of LINEs and species-specific DNA repeats drives ge-
Genome Canada’s Large-Scale Applied Research Project (LSARP                           nome expansion in Asian gypsy moths. Sci Rep. 9:16413.
project # 10106), Biosurveillance of Alien Forest Enemies                          Johnston JS, Bernardini A, Hjelmen CE. 2019. Genome size estimation
(bioSAFE; https://www.biosafegenomics.com/), and a Genomics                           and quantitative cytogenetics in insects. In: S Brown, M Pfrender,
Research and Development Initiative (GRDI; Government of                              editors. Insect Genomics, Methods in Molecular Biology. New
Canada) grant awarded to M.C.                                                         York, NY: Humana Press, p. 15–26.
6   |   G3, 2021, Vol. 11, No. 8

Keena MA, Côté M-J, Grinberg PS, Wallner WE. 2008. World distribu-         completeness with single-copy orthologs. Bioinformatics. 31:
    tion of female flight and genetic variation in Lymantria dispar          3210–3212.
   (Lepidoptera: Lymantriidae). Environ Entomol. 37:636–649.             Slater GSC, Birney E. 2005. Automated generation of heuristics for bi-
Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, et al. 2017.            ological sequence comparison. BMC Bioinformatics. 6:31.
    Canu: scalable and accurate long-read assembly via adaptive          Smit AFA, Hubley R. 2020. RepeatModeler. Version 2.0.1, obtained
    k-mer weighting and repeat separation. Genome Res. 27:                   from https://www.repeatmasker.org/RepeatModeler/
   722–736.,                                                             Smit AFA, Hubley R, Green P. 2019. RepeatMasker. Version 4.1.0,
Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with               obtained from http://www.repeatmasker.org/RepeatMasker/.
   Bowtie 2. Nat Methods. 9:357–359.                                     Sparks ME, Blackburn MB, Kuhar D, Gundersen-Rindal DE. 2013.
Leonard DE. 1981. Bioecology of the gypsy moth. In: Doane CC,                Transcriptome of the Lymantria dispar (gypsy moth) larval midgut
    McManus ML, editors. The Gypsy Moth: Research Toward                     in response to infection by Bacillus thuringiensis. PLoS One. 8:
    Integrated Pest Management. Technical Bulletin 1584—U.S. Dept.           e61190.
                                                                         Sparks ME, Brendel V. 2005. Incorporation of splice site probability

                                                                                                                                                      Downloaded from https://academic.oup.com/g3journal/article/11/8/jkab150/6261075 by guest on 15 October 2021
    of Agriculture (USA). Washington, DC: U.S. Department of
                                                                             models for non-canonical introns improves gene structure pre-
    Agriculture, Forest Service, Science and Education Agency,
                                                                             diction in plants. Bioinformatics. 21:iii20–30.
   Animal and Plant Health Inspection Service, p. 9–29.
                                                                         Sparks ME, Brendel V. 2008. MetWAMer: eukaryotic translation initi-
Lynn DE. 2000. Effects of long- and short-term passage of insect cells
                                                                             ation site prediction. BMC Bioinformatics. 9:381.
    in different culture media on baculovirus replication. J Invertebr
                                                                         Srivastava V, Keena MA, Maennicke GE, Hamelin RC, Griess VC.
   Pathol. 76:164–168.
                                                                             2021. Potential differences and methods of determining gypsy
Lynn DE. 2006. Lepidopteran cell lines after long-term culture in al-
                                                                             moth female flight capabilities: implications for the establish-
    ternative media: comparison of growth rates and baculovirus
                                                                             ment and spread in novel habitats. Forests. 12:103.
   replication. In Vitro Cell Dev Biol Anim. 42:149–152.
                                                                         Stanke M, Schöffmann O, Morgenstern B, Waack S. 2006. Gene pre-
Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for next-gen-
                                                                             diction in eukaryotes with a generalized hidden Markov model
    eration sequencing data. Genomics. 95:315–327.                           that uses hints from external sources. BMC Bioinformatics. 7:62.
Price AL, Jones NC, Pevzner PA. 2005. De novo identification of repeat   Triant DA, Cinel SD, Kawahara AY. 2018. Lepidoptera genomes: cur-
   families in large genomes. Bioinformatics. 21:i351–358.                   rent knowledge, gaps and future directions. Curr Opin Insect Sci.
Roach MJ, Schmidt SA, Borneman AR. 2018. Purge Haplotigs: allelic            25:99–105.
    contig reassignment for third-gen diploid genome assemblies.         UniProt Consortium. 2019. UniProt: a worldwide hub of protein
   BMC Bioinformatics. 19:460.                                               knowledge. Nucleic Acids Res. 47:D506–D515.
Roe AD, Torson AS, Bilodeau G, Bilodeau P, Blackburn GS, et al. 2019.    Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, et al. 2014. Pilon: an
    Biosurveillance of forest insects: part I—integration and applica-       integrated tool for comprehensive microbial variant detection
    tion of genomic tools to the surveillance of non-native forest           and genome assembly improvement. PLoS One. 9:e112963.
   insects. J Pest Sci. 92:51–70.,                                       Zhang J, Cong Q, Rex EA, Hallwachs W, Janzen DH, et al. 2019. Gypsy
Sahara K, Yoshido A, Traut W. 2012. Sex chromosome evolution in              moth genome provides insights into flight capability and viru-
   moths and butterflies. Chromosome Res. 20:83–94.                          s-host interactions. Proc Natl Acad Sci USA. 116:1669–1678.
   ~ o FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov
Sima
    EM. 2015. BUSCO: assessing genome assembly and annotation                                                    Communicating editor: B. Oliver
You can also read