Table Of ContentEvolution of Viral Proteins Originated De Novo by Overprinting
Niv Sabath,*,1,2 Andreas Wagner,1,2,3 and David Karlin4
1InstituteofEvolutionaryBiologyandEnvironmentalStudies,UniversityofZurich,Zurich,Switzerland
2TheSwissInstituteofBioinformatics,Basel,Switzerland
3TheSantaFeInstitute,SantaFe,NewMexico
4OxfordUniversity,SouthParksRoad,Oxford,UnitedKingdom
*Correspondingauthor:E-mail:[email protected].
Associateeditor:DanielFalush
Abstract
D
o
w
Newprotein-codinggenescanoriginateeitherthroughmodificationofexistinggenesordenovo.Recently,theimportanceofde Rn
lo
novooriginationhasbeenrecognizedineukaryotes,althougheukaryoticgenesoriginateddenovoarerelativelyrareanddifficult ea
d
toidentify.Incontrast,virusescontainmanydenovogenes,namelythoseinwhichanexistinggenehasbeen“overprinted”bya sed
newopenreadingframe,aprocessthatgeneratesanewprotein-codinggeneoverlappingtheancestralgene.Weanalyzedthe ea fro
m
evolution of 12 experimentally validated viral genes that originatedde novo and estimated their relative ages. We found that r h
youngdenovogeneshaveadifferentcodonusagefromtherestofthegenome.Theyevolverapidlyandareunderpositiveor chttps
weakpurifyingselection.Thus,youngdenovogenesmighthavestrain-specificfunctions,ornofunction,andwouldbedifficult ://a
ac
todetectusingcurrentgenomeannotationmethodsthatrelyonthesequencesignatureofpurifyingselection.Incontrastto a
rd
e
youngdenovogenes,olderdenovogeneshaveacodonusagethatissimilartotherestofthegenome.Theyevolveslowlyand tm
are under stronger purifying selection. Some of the oldest de novo genes evolve under stronger selection pressure than the icic.o
ancestralgenetheyoverlap,suggestinganevolutionarytugofwarbetweentheancestralandthedenovogene. leup
.c
o
Keywords:overlappinggenes,denovoorigin,newgenes. m
/m
b
e
novogenes.First,theincidenceofdenovogeneorigination /a
Introduction mayberelativelylowineukaryotes,rangingfrom2to12%of rticle
Novel protein-coding genes can have two fundamental ori- allnewgeneoriginationeventsaccordingtorecentestimates -a
b
s
gins(reviewedinLonget al.2003;Babushoket al.2007;Zhou (Zhouet al.2008;Toll-Rieraet al.2009a;EkmanandElofsson tra
andWang2008;Bornberg-Baueret al.2010;Kaessmann2010; 2010). Second, direct experimental evidence for the expres- ct/2
TautzandDomazet-Loso2011).Inthefirst,ageneoriginates sionoftheproteinsencodedbycandidatedenovogenesis 9
/1
by modification of an existing gene, for example, through notalwaysavailablein eukaryotes—somemightbeartifacts 2/3
gene duplication, exon shuffling, gene fusion, horizontal of genome annotation (Wang et al. 2003). Third, most eu- 76
7
genetransfer,ortransposition.Inthesecond,ageneorigina- karyotic candidate genes are structurally and functionally /1
0
tes de novo. This mechanism was thought to be highly im- poorly characterized. Finally, current protocols to identify 06
3
probable (Ohno 1970; Jacob 1977), but recent studies have genescreateddenovofromnoncodingsequencesineukary- 1
1
providedexperimentalevidencethatitmaybefrequent.De oticgenomesfocusongeneswithsimilaritytogenesalready by
g
novo origination can take place in a previously noncoding annotatedinthegenomesequence,whereassomedenovo u
e
region,suchasanintergenicregion(Caiet al.2008;Toll-Riera genesmaynotbecurrentlyannotated,evenashypothetical, st o
et al.2009b;Li,Zhang,et al.2010),oranintron(Sorek2007). which would preclude their discovery (Guerzoni and n
1
However, a gene can also originate de novo from an open McLysaght2011). 7 N
readingframethatalreadyencodesaprotein,byamechanism The identification of de novo genes in viruses does not o
v
e
called“overprinting”,inwhichmutationsleadtotheexpres- suffer from these problemsor to a much lesser extent. This m
b
sion of a second reading frame overlapping the first one holds especially for genes generated by overprinting. e
r 2
(Ohno 1984; Keese and Gibbs 1992; Rancurel et al. 2009; Li, Overlapping genes are very common in viral genomes 0
1
Dong,et al.2010).Genome-scalecomputationalanalysesor (Belshawet al.2007;Chiricoet al.2010),providinganabun- 8
experimental analyses of RNA transcripts have proposed dantsourceofsuchdenovogenes.Inaddition,inmostcases,
many candidate genes originated de novo through these theexpressionoftheirproteinproducthasbeenproven,and
mechanisms (Levine et al. 2006; Begun et al. 2007; Zhou theirfunctionisatleastpartlyknown(Rancurelet al.2009).
et al. 2008; Knowles and McLysaght 2009; Chen et al. 2010; Finally, using overlapping genes allows the identification
Wuet al.2011;YangandHuang2011). of proteins originated de novo with high reliability (see
Most studies in this area have focused on eukaryotes, later),byavoidingtheconfoundingfactorsthatlimitcurrent
which are not necessarily the best organisms to study de approaches to identify proteins generated de novo from
(cid:2)TheAuthor2012.PublishedbyOxfordUniversityPressonbehalfofthesocietyforMolecularBiologyandEvolution.
ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://
creativecommons.org/licenses/by-nc/3.0),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninany Open Access
medium,providedtheoriginalworkisproperlycited.
Mol.Biol.Evol.29(12):3767–3780 doi:10.1093/molbev/mss179 AdvanceAccesspublicationJuly19,2012 3767
MBE
Sabathet al. . doi:10.1093/molbev/mss179
noncoding sequences (Guerzoni and McLysaght 2011). For haveenteredthefocalcladethroughhorizontalgenetransfer.
brevity, we will refer here to de novo genes as genes that These confounding factors can be easily excluded for genes
originatedthroughoverprinting. that arose by overprinting (fig. 1b, blue arrows) within a
Viraldenovogenesoftenencodeproteinsthatplayarole pre-existing “ancestral” reading frame (red arrows).
inviralpathogenicityorspreading,ratherthanproteinscen- Specifically,if the ancestralgene is presentoutsidethefocal
tral to viral replication or structure (Li and Ding 2006; clade(e.g.,taxaT4andT5infig.1b),onecanexcludediver-
Rancurelet al.2009).Themajorityoftheseproteinsarepre- gence beyond recognition and horizontal gene transfer, be-
dicted to be structurally disordered, i.e., they lack a stable causeineithercase,theancestralgenewouldnotbepresent
three-dimensional structure (Dyson and Wright 2005; outsidethefocalclade.
Tompa 2005; Sickmeier et al. 2007), but those that are or- Takingtheaboveconsiderationsintoaccount,weaskthe
dered have intriguing structural features (Rancurel et al. following questions about the evolutionary dynamics of de
2009). For instance, the protein p19, originated de novo in novogenes:Dodenovogenesadapttotheirgenome,andif D
o
the plant virus family Tombusviridae (Rancurel et al. 2009), so,howrapidly?Whatistheirrateofevolution?Howdothe wn
hasapreviouslyunknowntertiarystructureandapreviously denovogenesinfluencethegenesthattheyoverlap?Dode loa
d
unknown mode of binding to small interfering RNAs novogenescontributetoviralfitness?Toanswertheseques- ed
(Vargasonet al.2003).Thissuggeststhatdenovogeneorig- tions, we analyzed the evolution of 12 independent, experi- fro
m
inationcanleadtoevolutionaryinnovationsinproteinstruc- mentally validated de novo genes in RNA viruses. We h
tureandfunction(Rancurelet al.2009;Bornberg-Baueret al. estimatedtheirrelativeageandcomparedtheirevolutionary ttp
s
2010;Kaessmann2010;AbroiandGough2011). dynamics to that of the ancestral gene from which they ://a
A prerequisite to identify a de novo gene is that it must originated. cad
e
have a monophyletic distribution in one clade—the “focal” m
Results ic
clade(fig.1a,taxaT1,T2,andT3)—whilebeingabsentfrom .o
u
organismsoutsidethisclade(fig.1a,taxaT4andT5).Wenote AsdescribedinMaterialsandMethods,weassembledadata p.c
that this prerequisite is necessary but not sufficient. Genes set of 12 experimentally validated pairs of overlapping om
fulfilling it may have an ancient origin, older than the focal protein-coding genes (table 1), in which the ancestral and /m
b
clade,buttheymighthavedivergedbeyondrecognitionout- the de novo genes could be unambiguously identified from e/a
side this clade (Elhaik et al. 2006). Alternatively, they may the phylogenetic distribution of their homologs. We pre- rtic
dictedthestructuralandfunctionalorganizationoftheirpro- le-a
teinproducts(fig.2,discussedfurtherlater,seealsoMaterials bs
andMethods).Alloverlappingregionswerelongerthan220 tra
(a) T1 nucleotides (table 1). Among the 12 de novo genes in our ct/2
9
Focal data set, nine genes overlap completely with their ancestral /12
T2 Clade gene,whereasthreeoverlaponlypartially:themachlomovirus /37
6
p31gene,theomegatetravirusp17gene,andtheilarvirus2b 7
T3 /1
gene(fig.2). 0
0
6
3
T4 11
Three Quantifiers Are Useful to Describe the b
y
Evolutionary Dynamics of Overlapping Genes g
T5 u
e
(b) Ancestral De novo Weinvestigatedthreepropertiesoftheancestralanddenovo st o
b1 T1 genes, and the proteins they encode (see Materials and n 1
b2 Methods). The first property is the relative sequence diver- 7 N
LAAannscctee Csstto oomrrm(LoCnA ) b3 T2 CFolacdael gence, a proxy for the rate at which a protein changes its ove
sequence.Relativedivergencevaluesabove1indicatethata m
b
b4 T3 codingregionevolvesfasterthanareferencesequence,inour er 2
casethefulllengthsequenceoftheancestralgeneofthepair 0
1
8
T4 considered.
The second property is the selective constraint (dN/dS),
T5 whichestimatesthestrengthofpurifyingselectiononagene
by its ratio of nonsynonymous to synonymous nucleotide
FIG. 1. Monophyletic distribution of genes originated de novo. (a) A substitutions.ValuesofdN/dSbelow1areevidenceofpuri-
genethatoriginateddenovo(bluearrows)willexhibitamonophyletic
fyingselectionwhosestrengthincreaseswithdecreasingdN/
distributionamongrelatedtaxa.However,thisdistributioncouldalsobe
dS. In principle, values of dN/dS exceeding 1 might suggest
theresultofdivergenceofthegenebeyondrecognitionorofacquisition
thatageneevolvesunderpositiveselection(NeiandGojobori
ofthegenethroughhorizontalgenetransfer(HGT).(b)Foragenethat
1986). However, the method we used to calculate dN/dS
originateddenovo(bluearrows)byoverprintinganancestralreading
frame (red arrows), these confounding factors can be excluded (see (Sabathet al.2008)doesnotteststatisticallyforpositivese-
Introduction). Colors are displayed in the electronic version of the lection, which is often limited to few sites within the gene
article. (NielsenandYang1998;Zhanget al.2005).Therefore,values
3768
MBE
EvolutionofViralProteins . doi:10.1093/molbev/mss179
ofdN/dSabove1shouldbetakentoindicateeitherneutral
ofthepping(nt) evoTluhteiotnhirodrapnodsitfiivnealseplreocptieornty. istheCodonSimilarityIndex
LengthOverlaRegion 273 468 624 382 1,880 698 308 451 312 561 834 228 (oCfSaI)g,ewnheiachndmtehaastuorefsththeeressimtoilfatrhiteygbeentwomeeenctohnetcaoindionngiuts.aCgSeI
isbasedon thesamecalculationsastheCodonAdaptation
NumberofSequences(orSequencePairs)intheAnalyses(diver-gence,/,andCSI)dNdS 10,9,and5 3,2,and3 70,33,and11 3,1,and3 120,1,and16 10,0,and5 15,6,and6 3,0,and3 6,3,and4 10,1,and5 28,13,and8 10,9,and6 cI(r(dnSesoeehfdmeeWaenrrxpeMopenav(arficCoaitersAnesogtrsdInei)eac,n.tLloaOseiimnacv1psnoep9tardmea8iarar7lsMmedl).,ddoTeobtetnafuhhblntaeyoloseduudevsssa2o)teet.hsdagsrueetsmmnheeeteepmsaorrseoaefuvrpshirotzeeilegrvotsohefifeltystshcihgoefeeonxdrripofigearnecselunalsunlasotesntsmadlcygoeeegffsaetabstrnsthaiaeelia/sssr Dow
n
DeNovoFrame PB1-F2 ProteinL* ProteinI p17 ORF69(movementprotein) ORF3(longdistancemovementprotein) Protein2b p31 Pog ProteinC Largeenvelopeprotein(L) ProteinB (1atg1phne00ean(cid:3)(cid:3)cieer34mess)4t,)de(r,0aaaal.nnt6rwd2geCe)oSneu-isIxsenihsodldo.ifebewTdarithn(sec6Wltoe%rwmsoi)ltnce,raoagrwglxenhCorgietneSusrnIeedealsveesisagcl(tntou0hievfe.e6dest6dh)(cirePfoaafen<nndrskdietf5fnret.at4chreieen(cid:2)asntttb,c1eo(e0tPPfw(cid:3)bd<<e5ee)et72nwnt..19hoet(cid:2)(cid:2)eavhnnoe loaded from https://a
c
n) meanrelativedivergenceoftheancestralanddenovogenesis ad
AncestralFrame PB1 Polyprotein Nucleocapsid(N) Capsidprotein Replicase ORF3(movementprotei Polymerase Coatprotein Structuralpolyprotein Phosphoprotein(P) Polymerase(P) ProteinA hodbRyfeiNgaWhnAuno(sce4vipena3osogs%tlkgry)aetem,hldnaeaeeshnriooassdewsrtieqdhgutieenhdedaenontsiomcfeefevdeapor.irdenTognivopecoeneetrfrehgbtsieeiesena(tse4cwcnd7hee%edepo,g)nwe.fentnethdoheemeosstneeiRmlteNh(casAetteieo-teddinmeMtipenheatissnteinendtrcisemieaintlaeyst emic.oup.com/mbe/a
of nera) athnids eMsteitmhaotdesd).aWgeeitnhefinguprleostt3edanthde4t.hTreheephroorpizeorntiteaslaagxaisinostf rticle
TaxonomicDistributiontheOverlap Singlespecies 3speciesinsamegenus 2genotypegroups Wholegenus Wholegenus Wholegenus Wholegenus Wholegenus(containsasinglespecies Wholegenus 2generainsamefamily Wholefamily(contains2ge Wholegenus fgntsgaiheneegalennutdeeeecrrdetssCeis,lvdS(a3ewlIetaiabhv(ncnfieeeoodglrnvede.4odsai4tvscrg)ebao.erytignrnNhreevteeonssisrtcpsaCeaeholrSoetng(Ihwfiedciangssnat.ultct3costouaa)ht,ll)etcah,whutesreeileetaodlhllteeraeiffctgttodit.hivrinTveefaephotmdaericoiiorvvonsseenisnrrttsogtigtmfirerlceeanhaeciclnoegoaemtnefxnt(adiolsfieynelsgsodhno.go(ro3otsiwbvhgeuoi)eess-, -abstract/29/12/3767/10
Species InfluenzaAvirusH5N1 Theilovirus SARScoronavirus Dendrolimuspunctatustetravirus Turnipyellowmosaicvirus Tobaccobushytopvirus Spinachlatentvirus Maizechloroticmottlevirus Israelacuteparalysisvirusofbees Measlesvirus HepatitisBvirus StripedJacknervousnecrosisvirus scaewwMponrheaaorcBrlertneeeteetsdrhsrltdeopieraintwaofrlofdslne,tgsrahdvw.eenaeWnneldtutseofie.lesMorsuvps(ertsaoeeetlsbehuddxtoe)oaaasdfaimnnsntfa)ehoid.lndyerTesdihtfsseotwiuhnornosegf,poslcdreeavoeoiogrvtgtshraesg,errnewsieinanseehinseof.epcsingreRr(euoble(airpAnglesuereNede3rst)sCaoisienitOinaossndVensidnaieAccplabi)fhnatigterecopuaosgatrtreeenefronlesey4tr--l, 06311 by guest on 17 Novem
heStudy. Genus InfluenzavirusA Cardiovirus Betacoronavirus Omegatetravirus Tymovirus Umbravirus Ilarvirus Machlomovirus Aparavirus Morbillivirus Orthohepadnavirus Betanodavirus boRebefsloaertrevivaetciooDnnissvi.deregrienngcethem together and synthesizing our ber 2018
pingGenesint Family Orthomyxoviridae Picornaviridae Coronaviridae Tetraviridae Tymoviridae — Bromoviridae Tombusviridae Dicistroviridae Paramyxoviridae Hepadnaviridae Nodaviridae Fvo(isriefegrsddugige)renennisicofi3enavceaofpoanprrrtlerlytyoshethedneoiiptfnrfsaieszifror(oesbnrnlotuteafealfa)rc(n.ohaTcmltheohsevtzoerreuarerlglgoaphrp,erPstothsh<tieoeein0nrre.esl0giln0(arre1eteis2vods)efi)otaasnhneneqddcauoteneehqncfefiueccspaetilaeodintrroisst-
1.Overlap GenomeAccessionNumber NC_007358 NC_001366 NC_003045 NC_005899 NC_004063 NC_004366 NC_003809 NC_003627 NC_009025 NC_001498 NC_003977 NC_003448 oaannncceee,ssstturrgaaglleppsrrtoointtegeiintnhss.aItenvoconolvanevtreaartsatga,edsteihmneoiolavvroerrpalartopetpeaiinsngsthr(ebegluifouen)lls-slheoonfwgththea
Table Clade 1 2 3 4 5 6 7 8 9 10 11 12 hpirgohteerinsr.elAatcicvoerddiinvgelryg,etnhceestlohpanestohfeitrheantcweostrraelgroevsseirolanplpininegs
3769
MBE
Sabathet al. . doi:10.1093/molbev/mss179
100 aa Ordered protein region
Disordered protein region
PB1-F2
1InfluenzavirusA
PB1 (replicase) RdRP
L*
2Cardiovirus
Polyprotein (L) zLeVP4
D
o
I w
n
3Betacoronavirus lo
a
Nucleocapsid RNA d
binding ed
fro
m
p17 h
4Omegatetravirus ttp
Capsid B1 s://a
c
a
d
e
Movement m
protein ic
5Tymovirus .o
Replicase Methyltransferase- up
Guanylyltransferase .c
o
m
/m
ORF3 b
6Umbravirus e/a
O(MRoFv4ement protein) 30K rticle
-a
b
s
7Ilarvirus 2b tm tract/2
9
Replicase RdRP /1
2
/3
7
6
7
p31 p7 /10
8Machlomovirus 0
6
Capsid jelly-roll fold 31
capsid domain 1
b
y
g
u
pog e
tm s
9Aparavirus t o
Structural jelly-roll fold n
polyprotein capsid domain 1
7
N
o
v
C e
m
10Morbillivirus b
e
P(Ph)osphoprotein cc r 20
1
8
Large envelope PreS1 Pre S
protein (L) S2tmtm tm
11Orthohepadnavirus
Replicase Linker Reverse
Transcriptase
B
cctm
12Betanodavirus
A RdRP
tm
FIG.2. Structuralandfunctionalorganizationoftheoverlappinggeneswestudied.Proteinsencodedbyoverlappinggenesareshowntoscale.Foreach
proteinpair,theancestralproteinisshownonthebottomandthedenovoproteinontop.B1,basedomain1;cc,coiledcoil;Le,Leaderregion;PA2,
phospholipaseA2domain;RdRP,RNA-dependentRNApolymerasedomain;tm,transmembranesegment;z,zinc-bindingregion.
3770
MBE
EvolutionofViralProteins . doi:10.1093/molbev/mss179
are significantly different (P<3.2(cid:2)10(cid:3)25). The range of and2.8andthusvarybylessthanafactorthree.Threeno-
values of the relative divergences between pairs of ancestral tableexceptionsaretheyoungestdenovoproteinsoftaxa1
proteins is generally narrow or moderate. For instance, the and3(respectively,influenzavirusAPB1-F2andbetacorona-
relativedivergencesofthedenovogeneoftymovirus(taxon virus protein I), which exhibit a considerable range of diver-
11),whichencodesthemovementprotein,rangebetween1.2 gences(from0upto10.4).
Selective Constraint
Table2. MeanValuesofThreeEvolutionary PropertiesforAncestral
and De Novo Genes. Figure 3b presents the selective constraint (dN/dS) for the
ancestral and de novo genes in each group of viruses we
Ancestrala De Novoa P
studied. The ancestral and de novo genes are subject to
CSI 0.66 (0.09) 0.62 (0.11) 5.4(cid:2)10(cid:3)5
verydifferentconstraints,asattestedbythesignificantdiffer-
RSeelleacttivioendiinvetergnesnitcye 01..4060 ((00..5420)) 10..8755 ((10..2505)) 72..19(cid:2)(cid:2)1100(cid:3)(cid:3)344 ence(P<4.9(cid:2)10(cid:3)13)betweentheslopesoftheirrespective Dow
regressionlines,andthefactthattheancestralregressionline n
aNumbersinparenthesesarestandarddeviations. hasapositiveslope,whereasthedenovoregressionlinehas loa
d
e
d
fro
m
h
ttp
s
://a
c
a
d
e
m
ic
.o
u
p
.c
o
m
/m
b
e
/a
rtic
le
-a
b
s
tra
c
t/2
9
/1
2
/3
7
6
7
/1
0
0
6
3
1
1
b
y
g
u
e
s
t o
n
1
7
N
o
v
e
m
b
e
r 2
0
1
8
FIG.3. Evolutionarydynamicsofancestral(red)anddenovogenes(blue).Theverticalaxesshow(a)relativedivergenceand(b)selectiveconstraint
(dN/dS)forthe12taxa.Thehorizontalaxisrepresentstheevolutionarydistancefromtheoriginofeachdenovogene(i.e.,theestimatedageofgenes
withintheclade).Regressionlinesareplottedforvisualizationofgeneraltrends.LowdN/dSvaluesrepresentstrongselectiveconstraints(seetext).Note
thatdN/dSin(b)couldonlybecalculatedforgenepairsthathavelessthan50%aminoaciddivergenceattheaminoacidlevel(seeMaterialsand
Methods).Noselectiveconstraintdatacouldbecalculatedforcases6and8(bottompanel)asthesequencepairsinthesecladeshavealldiverged
beyond50%.Whereneighboringgroupshadsimilarages,weshiftedtheirpositionslightlyforvisualclarity(groups5and6).
3771
MBE
Sabathet al. . doi:10.1093/molbev/mss179
D
o
w
n
lo
a
d
e
d
fro
m
h
ttp
s
://a
c
a
d
e
FIG.4. CodonSimilarityIndex(CSI)ofancestral(red)anddenovogenes(blue).Thehorizontalaxisrepresentstheevolutionarydistancefromtheorigin mic
ofeachdenovogene(asinfig.3).Regressionlinesareplottedforvisualizationofgeneraltrends.HighCSIvaluesindicatehighsimilaritybetweenthe .o
u
codonusageofageneandthecodonusageoftherestofagenome.Colorsaredisplayedintheelectronicversionofthepaper. p
.c
o
m
/m
anegativeslope.ThelowvaluesofdN/dS<1belowformost whichtheCSIvaluesofthedenovop17genesareallhigher be
genesindicatethattheyareunderstrongselectiveconstraints than the CSI values of the ancestral capsid genes. The CSI /artic
(purifying selection), i.e., mutations that change the protein valuesforhomologousgenescantakeawiderangeofvalues. le
sequence are likely to be selected against. In several viruses, Forinstance,theCSIofthebetacoronavirusIgenevariesfrom -ab
s
the values of dN/dS for the ancestral genes are particularly 0.27to0.67.Overall,thedatasuggestthatthecodonusageof tra
c
low,suggestinganextremefunctionalorstructuralconstraint. de novo genes becomes slowly assimilated into that of the t/2
9
Forexample,theratiodN/dSoftheancestralgeneinviruses1, hostgenome. /1
2
4,and5isbelow0.05.Incontrast,insomecases,eitherthede /3
7
pnroovtoeigneIn,ean(tdhtehienfltyumenozvairvuirsumsAovPemB1e-nF2t,ptrhoetebinet,arceosproencativvierulys, De Novo Genes Have Different Properties Depending 67/10
on Their Age of Origin 0
from groups 1, 3, and 5) or the ancestral gene (the ilavirus 6
3
Overall,wefoundthatthedifferencesbetweenancestraland 1
polymeraseandthebetanodavirusproteinA,fromgroups7 1
and12,respectively)exhibitsdN/dS(cid:4)1,whichindicatesneu- denovogenesappearstodecreasewithtime.Taxa1–5,with by
g
tralevolutionorpositiveselection.Asexplainedatthebegin- theshortestdistancefromtheorigin—correspondingtothe ue
ning of the Results section, the test we employ cannot youngestdenovogenes—allexhibitasimilarpattern(fig.3): st o
the de novo genes show higher relative divergence with re- n
distinguishbetweenthem.Forallgroups,thetrendsinrelative 1
divergence (fig. 3a) and selective constraints (fig. 3b) were specttotheirancestraloverlappinggenesandweakerselec- 7 N
consistentwithoneanother.Forexample,thedenovopro- tiveconstraint. ov
e
The pattern that contrasts most with that of taxa 1–5 m
teins of cardioviruses are more highly diverged than the an- b
cestral proteins and they also show a lower selective occursintaxa7and12(IlarvirusandBetanodavirus,respec- er 2
tively),wherethedenovogeneshaveamoreancientorigin. 0
constraint. 1
8
Here,thedenovogenesevolvemoreslowlyandexhibitstron-
gerpurifyingselectionrelativetotheirancestraloverlapping
Codon Similarity Index
genes(fig.3).Inotherwords,heremutationsthatchangethe
Figure 4 presents the CSI values of ancestral and de novo sequence of the de novo proteins are more deleterious, on
genes.Theslopesofthetworegressionlinesshowasignificant average, than mutations that change the sequence of the
difference (P<0.032). The slope of the ancestral regression overlappingancestralproteins.
lineisnotsignificantlydifferentfromzero(P>0.96),whereas Overall,ourobservationssuggestthatolderdenovogenes
the de novo regression line has a positive slope (P<0.003). aremoreadaptedtotheirgenomeandevolveunderstronger
Relative to their ancestral overlapping genes, most de novo purifyingselection.Thisinferenceissupportedbyexperimen-
genes show lower CSI values, although the ranges of CSI taldataonthefitnesseffectsofmutationsindenovogenesof
values in de novo and ancestral genes overlap markedly in different ages (table 3). Mutations in the youngest de novo
mostgroups.Anexceptionisomegatetraviruses(taxon4),in genes (taxa 1–3) have little to moderate effect, whereas
3772
MBE
EvolutionofViralProteins . doi:10.1093/molbev/mss179
DescriptionofEffectandReferences SuppressionofPB1-F2neitheraffectedviralreplicationnorvirusloadsinthelungsofmice(McAuleyetal.2010). SuppressionofL*decreasestheabilityofTheiler’svirustoinduceachronicinfectionofthecentralnervoussystem(Stavrouetal.2010). SuppressionofProteinIexpressionleadonlytoare-ducedplaquesize,suggestingaminoreffectonfit-ness(Fischeretal.1997). Unknown Aknock-outmutantofthemovementproteinrepli-catesonlyatlowlevelsinprotoplasts(WeilandandDreher1989). Long-distancemovementisabolishedintheabsenceofORF4plants(Ryabovetal.1999) Unknown Unknown Unknown SuppressionofCresultsinmuchmildersymptomsandlowermortalityinmice(Pattersonetal.2000). DeletionswithintheSdomainoftheenvelopeproteindrasticallyreduceinfectivity(LeDuffetal.2009). SuppressionofB2causesasevereimpairmentintheintracellularaccumulationofviralRNAincellculture(Fenneretal.2006). Downloaded fro
m
FitnessEffectWhentheNovelGeneIsSuppressed Littleornoeffect Moderateeffect Littleornoeffect Unknown Severeeffect Severeeffect Unknown Unknown Unknown Severeeffect Severeeffect Severeeffect https://academic.oup
.c
u- is om
oftheDeNovoGenesintheStudy. Function(s) Virulencefactor(Zamarinetal.2006).Involvedinreglationofpolymeraseactivity(Mazuretal.2008).Functionseemsstrain-specificandhost-specificanddisputed(Krumbholzetal.2011). Involvedintheestablishmentofpermanentinfectionsofthecentralnervoussystem(Chenetal.1995);antiapoptoticeffectincellculture(Ghadgeetal.1998). Unknown Unknown Viralmovementthroughtheplant(Bozarthetal.1992). Long-distance(systemic)movementinplants(Ryabovetal.1999);stabilizesviralgenomicRNA. Unknown Unknown Unknown Virulencefactor(Pattersonetal.2000). BeckandNassal2007).Viralenvelopeglycoprotein( BlocksRNAinterference(Fenneretal.2006). /mbe/article-abstract/29/12/3767/1006311 by guest o
n
n,Function,andFitnessEffect EvidenceforExpression Chenetal.(2001b) vanEyllandMichiels(2002) Senanayakeetal.(1992)andChenetal.(2001a) Hanzliketal.(1995) WeilandandDreher(1989) Ryabovetal.(1998) Xinetal.(1998) Scheets(2000) WardropandBriedis(1991) Peterson(1981) Iwamotoetal.(2005) 17 November 2018
o
EvidenceofExpressi Genus InfluenzavirusA Cardiovirus Betacoronavirus Omegatetravirus Tymovirus Umbravirus Ilarvirus Machlomovirus Aparavirus Morbillivirus Orthohepadnavirus Betanodavirus
3.
ble oup
Ta Gr 1 2 3 4 5 6 7 8 9 10 11 12
3773
MBE
Sabathet al. . doi:10.1093/molbev/mss179
suppression of older de novo genes (taxa 6 and 10–12) has as it ages, resulting in increased selective constraints and a
severeeffects. decreasedrateofsequencechange.
Ourresultsalsosuggestthatingeneral,theancestralgenes
Discussion are more constrained in sequence than the de novo genes
whichoverlapthem.However,thispatterncanbereversedin
Several previous studies have examined the codon usage of
olddenovogenes.Inparticular,threedenovogenesofour
individual or small groups of overlapping genes (McGeoch
dataset,whichencodetheIlarvirus2bprotein,themorbilli-
et al.1985;KeeseandGibbs1992;Pavesiet al.1997;McVeigh
virusCprotein,andbetanodavirusBprotein(taxa7,10and
et al.2000;Leeet al.2010),theirratesofevolution(Mizokami
12), are subject to stronger selective constraints than their
et al.1997;Sanzet al.1999;Jordanet al.2000;Fujiiet al.2001;
ancestralgenes(fig.3b).Wespeculatethatthisreversalcould
Nekrutenko et al. 2005; McGirr and Buehuring 2006;
reflect an evolutionary “tug of war” between the two genes
Hernandez et al. 2010), and selective constraints (Fujii et al.
overthedominanceofthesequence.Anultimate“victory”of D
2001;Hugheset al.2001;GuyaderandDucray2002;Liet al. o
adenovogeneinthistugofwarwouldconsistinthedisap- w
n
2004; Hughes and Hughes 2005; Narechania et al. 2005; pearanceoftheancestralgene.Wespeculatethatsomeviral loa
Campitelli et al. 2006; Holmes et al. 2006; McGirr and d
genes that do not overlap any other gene today may have e
d
Buehuring 2006; Obenauer et al. 2006; Pavesi 2006; Suzuki originatedthroughoverprintingbuthaveeventuallylostthe fro
2006; Pavesi 2007; Zaaijer et al. 2007). Overall, our observa- m
tionsagreewiththoseofpreviousstudies—ancestralandde olavpeprilnapg. Aregsiiomnilaorfsctewnoarioprhoatesinb-eceondipnrgopgoesneedsfowritahninovtehre- http
nabolveotgoenimesprdoifvfeeroinntthheesseepsrtoupdeierstieins.Nseevveerratlhwelaeysss,.wFierswt,ebrey genome of archaeal Thermoplasma (Rogozin et al. 2002). s://ac
Theseoverlappinggenesarenonoverlappinginotherrelated a
examining the phylogenetic distribution of overlapping genomes, possibly because of duplication and consequent dem
gwehniechpoafirtsh,ewtweowgeerneesaibsleantcoestirdaelnatnifdywrehliicahblyon(eseisethlaetedre) loss(Rogozinet al.2002). ic.ou
p
novogene.Second,ournewmethodallowedustoestimate .c
o
Our Evolutionary Inferences Are Robust and m
therelativeageoforiginofdenovogenesandtocorrelatethis
Biologically Coherent /m
age with several quantifiers of their evolutionary dynamics. b
e
Thus, it allowed us to analyze how the evolutionary forces Formostoverlappinggenepairsinourdataset,theidentifi- /a
affectingde novogenes change over time. Third, we useda cationofthedenovogeneishighlyreliable,astheancestral rticle
larger data set than most studies mentioned earlier, which gene has a much wider phylogenetic distribution than the -a
b
s
were carried out on individual genes. Fourth, we used a denovogene.Forinstance,thedenovomovementprotein tra
method specifically tailored to overlapping genes (Sabath of Tymoviruses (taxon 5) has homologs only in this ct/2
et al. 2008) to study the selective constraints these genes genus, whereas its ancestor, the methyltransferase- 9/1
aresubjectedto.Othermethodscangivemisleadingresults guanylyltransferase has homologs in over a dozen families 2/3
when applied to overlapping genes (Holmes et al. 2006; (Rozanovet al.1992). 76
7
Suzuki2006;Pavesi2007;Sabathet al.2008). Oneimportantcaveatofouranalysisisthatweareunable /1
0
toestimatetheabsoluteageofdenovogenesbutonlytheir 06
3
age relative to the divergence of a viral housekeeping gene 1
1
De Novo Genes Adapt to Their Genomes thatencodestheRNA-dependentRNApolymerase.Ourrel- by
Our results suggest that de novo genes do adapt to their ative age estimates of de novo genes, however, are broadly gu
e
genome.Morespecifically,denovogenesevolveveryrapidly consistentwiththeirtaxonomicdistribution(table1,column st o
shortlyaftertheirorigin.Astheyage,theytendtoexperience 6):asexpected,youngdenovogenesshowamorerestricted n
1
increasingly severe selective constraints, and their codon distribution than older de novo genes. The youngest genes 7 N
usage tends to approach that of the ancestral gene from (groups1–3infig.3)arefoundinlessthanonegenus,with ov
e
whichtheyoriginate. gene1occurringonlyinasinglespecies(theseobservations m
b
Ourresultsareconsistentwithpopulationgeneticstheory are not due to sequencing bias, as each taxon considered er 2
(HartlandClark1997).Viruseshavelargepopulationsizes.At contains several species or genera). Conversely, all oldest de 0
1
suchlargesizes,naturalselectionishighlyefficient,whichhas novo genes (6–12) are found at least throughout a whole 8
two consequences regarding de novo genes. First, they are genus(intwogeneraforcases10and11).
likely to become fixed in a population only if they provide A case where gene age may have been overestimated is
some selective advantage. Second, even though a de novo that of taxon 3 (betacoronavirus I gene). Our analysis sug-
genemightinitiallyonlyprovideaverysmallfitnessbenefit,in gestedthattheIandORF9bgeneshaveacommonorigin(see
a large population this fitness benefit can be sufficient to supplementary information, Supplementary Material online,
causethegene’sfixation.Immediatelyafteritsorigin,these- case3)despitetheirdifferentlengths(98aaand207aa,respec-
quenceofadenovogenewilltypicallybefarfromoptimalfor tively)andlackofsignificantsequencesimilarity(notshown).
the (rudimentary) function it provides, unlike a gene origi- This inference is based on assuming functionality for the
nated through modification of an existing gene. unannotated ORFs in five closely related genomes (supple-
Consequently, one would expect a de novo gene to evolve mentary fig. S4, Supplementary Material online). Conse-
rapidlyshortlyafteritsoriginandtobecomebetteradapted quently, we calculated the age of the I gene by considering
3774
MBE
EvolutionofViralProteins . doi:10.1093/molbev/mss179
the node common to I and ORF9b (marked with blue line, Atkins2010,Firthet al.2010).However,thesignatureofpu-
supplementaryfig.S5,SupplementaryMaterialonline,group rifying selection is mostly absent in young de novo genes.
3). In the alternative scenario where these two genes have Thus, the number of young de novo genes may be much
independent origins, their estimated age would be reduced. larger than it appears, because these methods can simply
Nevertheless,theoverallpatternoftheresultswouldremain notdetectmanysuchgenes.
unchanged(notshown).Anothercaveattoourstudyisthat
RNAsecondarystructure,andselectionpressureforhighpro- Conclusion and Perspectives
teinexpressionlevelmaybepartlyresponsibleforthediffer- Inclosing,wepointtoseveraldirectionsforfutureresearch.
encesweobservedincodonusage(PlotkinandKudla2011). Approaches to estimate selection pressures in overlapping
Finally, the estimated rate of evolution of most ancestral genes (e.g., Sabath et al. 2008) lag behind those for
anddenovoproteins(fig.3)isgenerallycoherentwiththeir nonoverlapping genes (reviewed in Anisimova and Kosiol
function or effecton viralfitness(table 3).Forinstance,the 2009), which can detect lineage-specific and site-specific se- D
o
ancestral methyltransferase-guanyltransferase of tymoviruses lection pressures. The development of advanced methods wn
experiencessevereconstraints,asexpectedfromanenzyme, could,forinstance,revealtheroleofpositiveselectioninde loa
d
whereasthedenovoproteinIofbetacoronaviruses,whichis novo gene origination and perhaps predict interactions be- ed
dispensableforreplication,experienceslowornoconstraints. tweenproteinsencodedbyoverlappinggenes,suchastheRz fro
m
We note two exceptions: first, the orthohepadnavirus and Rz1 genes of bacteriophage lambda (Zhang and Young h
replicase gene, encoding an essential reverse transcriptase 1999).Finally,furtherresearchisneededtoshedlightonhow ttp
s
function, is not subject to very strong selective constraint exactlydenovooverlappinggenesoriginateandbecomees- ://a
(e.g., dN/dS between 0.28 and 0.78, taxon 11 in fig. 3b). tablished, i.e., the mutational events that result in their ex- ca
d
However, this discrepancy is readily explained. The overlap- pression,theirfrequency,andtheireffectsonviralfitness. em
pingregionofthereplicasegeneinfactencodestwodomains ic.o
(fig. 1): a disordered, hypervariable linker and the reverse Materials and Methods up
.c
transcriptase domain.The relaxed selective constraintisthe Sequence Analyses om
result of a high dN/dS for the linker (average of 0.94) and a /m
WeextractedallfullysequencedviralgenomesfromtheNCBI b
verylowdN/dSforthereversetranscriptasedomain(average e
viral genome database (Bao et al. 2004) in June 2011 and /a
of0.15).Second,thetymovirusmovementproteingenehasa identified all viral proteins annotated in these genomes. All rtic
dN/dS of 1, suggesting an absence of selective pressure, de- le
homology searches were carried out against a database of -a
spiteencodinganimportantfunctionthatallowsthespread b
of viral RNA between cells. Again, this discrepancy can be these proteins, using PSI-basic local alignment search tool stra
(BLAST) (Altschul et al. 1997) with an E value cutoff of c
attributed to the fact that the movement protein consists 10(cid:3)6.Weperformedallmultiplesequencealignmentsusing t/29
of a slowly evolving region (around aa 1–400) and a /1
MAFFT (Katoh et al. 2002) and constructed phylogenetic 2
fast-evolvingregion(C-terminal200aa)(resultsnotshown). /3
trees with the BIONG method (Gascuel 1997). We rooted 7
6
7
these trees with the mid-point rooting method (Farris /1
Young De Novo Genes Might Have Strain-Specific 1972). We predicted the domain organization of proteins 00
6
Functions and Be Difficult to Detect by encodedbyoverlappinggenesusingANNIE(Ooiet al.2009). 31
1
Sequence Analysis b
y
Ourresultshavetwopracticalimplications.First,theysuggest Collection of Viral Overlapping Genes gu
e
that recently evolved genes might have strain-specific func- Weidentifiedfromtheliteratureasetof40overlappinggene st o
tions, or possibly no function, as suggested previously pairsforwhichtheexpressionofaproteinproductfromtwo n 1
(TrifonovandRabadan2009)forPB1-F2(taxon1),theyoun- reading frames had been experimentally verified. All gene 7 N
gest de novo genes in our data set. Experimental studies pairsinthisdatasetcomefromvirusesthatinfecteukaryotes. ov
e
should thus take this possibility into account. Our current Among these gene pairs, we selected 29 pairs coming from m
b
methodtoestimatedN/dSdoesnottelluswhethertheele- viruseswhosegenomeencodesanRNA-dependentRNApo- er 2
vateddN/dS values observed in some young de novo genes lymerase(RdRP),tofacilitatecomparisonamongclades(see 0
1
8
indicateneutralevolutionorpositiveselection.However,itis later).Wefurthernarrowedthedatasettooverlappinggene
possiblethattheycomenotonlyfromneutralmutationsbut pairsinwhichwecouldidentifywhichgenehadoriginatedde
also from beneficial mutations subject to positive selection, novo(seeproceduredescribedlater).Intotal,weobtained12
whichwouldreflectevolutionaryadaptations. gene pairs that correspond to 12 cases of de novo origin,
Second,ourresultshaveimplicationsfortheidentification stemming from 12 families of RNA viruses that met these
of overlapping genes. Current bioinformatics methods to criteria. The data set shares some genes with a previously
detectoverlappinggenesusethesignatureofpurifyingselec- publisheddata set (4casesoutof 12: groups4,6, 8, and 11
tion(FirthandBrown2005,2006;Sabathet al.2009;Sabath below) (Rancurel et al. 2009). The reason why we could in-
and Graur 2010). These recently developed methods have cludeonlyaminorityofthegenespublishedintheRancurel
hadgreatsuccessandleadtodiscoveries in many viral taxa dataset(4outof17)isthatwerestrictedourselvestocon-
(Chung et al. 2008; Firth 2008; Firth and Atkins 2008a, b; sideringpairsinwhichbothancestralanddenovoproteins
Sabath et al. 2009; Firth and Atkins 2009a, b, c; Firth and had less than 50% amino acid divergence (percentage of
3775
MBE
Sabathet al. . doi:10.1093/molbev/mss179
identity).Table1lists,foreachgenepair,thespeciestaxon- each genome listed in table 1, we identified the RdRP
omy,thegenomeaccessionnumber,thenamesoftheover- domain by using HHpred (Soding et al. 2005) against the
lappinggenes,andtheirlengths.Intherestofthearticle,we PFAM database (Finn et al. 2008) with an E-value cutoff of
willrefertoeachcaseeitherbyitsgenusorbythenumberof 10(cid:3)10. We identified the orthologous RdRP domains within
itsclade,aslistedintable1.Table3listsbibliographicalevi- the other genomes of each clade using PSI-BLAST, aligned
dence about the expression, function, and fitness effect of them,andconstructedtheirphylogenetictrees.Supplemen-
mutationsinthedenovogene. taryfigureS6,SupplementaryMaterialonline,presentsthese
“RdRPtrees.”
Identifying De Novo and Ancestral Genes WedefinedthefocalcladeoftheRdRPtreeasthesmallest
cladethatcontainsallthetaxafoundwithinthefocalcladeof
Toidentifydenovogenecandidates,weappliedthecriterion
thephylogenetictreeoftheoverlappinggenes.TheLCAfor
ofmonophylystatedintheintroduction:oneofthegenesin
theRdRPtreewasdefinedasearlier.Wecomparedthefocal D
anoverlappingpair—theancestralgene—mustoccurineach o
cladesintheRdRPtreeandthetreeofoverlappinggenesand w
member of a viral clade, whereas the other gene—the de n
found that in 9 of 12 cases the focal clades were identical, lo
novo gene—must be restricted to a single subclade, the a
d
whereas in three cases (2, 5, and 12, within the genera e
focal clade (fig. 1b). To find genes that meet this criterion, d
we first identified, for each gene pair, homologous protein Cardiovirus, Tymovirus, and Betanodavirus, respectively), we fro
m
foundminordifferences(supplementaryfig.S6,Supplemen-
productsinrelatedgenomes.(Wefoundnoevidenceofdu- h
plicated genes in the genomes under study, hence all our taryMaterialonline).Overall,thiscomparisonsuggestedthat ttps
homologsareorthologs.)Wealignedthehomologousprotein theRdRPgenesandtheoverlappinggenepairshavesimilar ://a
sequences of the ancestral protein (which is more phyloge- evolutionaryhistories.OnthebasisoftheRdRPtree,wethus cad
estimated as a proxy for the age of a de novo gene the se- e
neticallywidespread)andconstructedtheirphylogenetictree. m
quence divergence of the RdRP domain in each focal clade ic
Bymanualexaminationofthesetrees,weidentified12cases .o
since the origin of the de novo gene, i.e., its accumulated u
omfodneopnhoyvloy.oWrigeinsc(atanbnleeds 1thaendre3la)tethdagtemnoemt tehseocfrittheeriofnocoafl genetic distance D along the tree branches since the LCA. p.com
clade for unannotated ORFs to overcome missing genes To estimate D, we generated 100 bootstrap RdRP trees. For /m
duetofaultannotation.Thesetreesalsoallowedustoinfer eachtreei(1(cid:5)i(cid:5)100),wecalculatedDi,theaveragelength be/a
theinternalnodeofatreeclosesttotheoriginofthedenovo ofthephylogenetictreebranchesbetweentheLCAandeach rtic
gene. We call this node the last common ancestor (LCA, eaxmtapnletignefnigoumree1tbh,aDt cwoonutladincsaltchueladteeanso[vdo(LgCeAn,eT.1F)o+r tdh(eLCeAx-, le-ab
marked with a blue circle in the hypothetical example of i s
fig. 1b). To ensure that the identification of the LCA is not T2)+d(LCA, T3)]/3, where d(LCA, T1)=b1+b2, d(LCA, trac
biased by genome annotation, we manually examined the T2)=b3+b2, and d(LCA, T3)=b4, and b1, b2, b3, and b4 t/2
9
related genomes for presence of homologous unannotated are the branch lengths shown in the figure. Finally, we esti- /12
ORFs.Weprovidedetailedexplanationsofthechallengesin Dma¼teð1D=10a0sÞPthe100aDve.rWagee eostviemrataeldl athllebrbaonochtstlreanpgthtsrebesy, /376
dinefonromvaotiogenn,eSuapnpdleLmCeAntidareynMtifiactaetriioanloinnlitnhee(ssuupppplleemmeennttaarryy the BIONG mei¼th1odi (Gascuel 1997). In supplementary 7/100
figureS7,SupplementaryMaterialonline,wepresentDand 6
figs.S1–S4,SupplementaryMaterialonline).Thephylogenetic 3
1
the standard deviation within each group. For convenience, 1
treeandthecorrespondinggenomicmapsofthe12casesare b
weorderedthecladesintable1accordingtoincreasingD. y
presented in supplementary figure S5, Supplementary g
u
Material online. For all gene pairs that met the monophyly es
criterion,wealsodeterminedtheDNAsequencealignments Analysis of the Evolutionary Properties of t o
n
correspondingtotheaminoacidsequencealignments(ofthe Overlapping Genes 17
ancestralproteins),toenablethecalculationsdescribedlater. For each of our 12 overlapping gene pairs (table 1), we col- No
v
lected the full sequences of the ancestral and the de novo e
m
Estimation of the Relative Age of Origin of the De genes,thesequenceoftheregionofthegenomewherethey be
Novo Genes overlapped, and the sequences of other genes annotated in r 20
the genome in which they occur. We also collected this in- 18
Tounderstandtheevolutionarydynamicsofdenovogenes,
formation for homologous overlapping genes in related ge-
oneneedstoestimatetheirageoforiginandtocomparethis
nomeswithinthefocalclade(seeearlier).Asallsubsequent
ageamongdifferentclades.Thisestimationismadedifficult
analyses are carried on sequences within the focal clade,
by differences in mutation rate, population dynamics, and
which is defined by the distribution of the de novo protein
selection pressures among different viral genomes and
rather than the ancestral protein, it is independent of the
genes (Duffy et al. 2008). To alleviate these difficulties, we
BLAST cutoff. We used these data to study the following
calibrated our estimates with a reference molecule, the
threepropertiesofancestralanddenovogenes:
RdRPproteindomain,whichiscommontoallcladesinthe
study(Bruenn2003).TheRdRPdomainhasacommonorigin 1) The“relativesequencedivergence”betweenpairsofho-
andsimilartertiarystructureinthecladeswestudy(Bruenn mologousproteinsthatthegenesencode.Wedefinethe
2003),andthusweassumethatasafirstapproximationitis sequence divergence between two proteins as the pro-
subject to similar functional and structural constraints. In portion of amino acids in which they differ and the
3776
Description:specific functions, or no function, and would be difficult to detect using current genome annotation methods that rely on the .. fault annotation. These trees .. populations of cotton leaf curl geminivirus, a single-stranded DNA virus.