Feature selection from gene expression data
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Feature selection from gene expression data Molecular signatures for breast cancer prognosis and inference of gene regulatory networks. Anne-Claire Haury Centre for Computational Biology Mines ParisTech, INSERM U900, Institut Curie PhD Defense, December 14, 2012 1
Gene expression Figure: Central Dogma of Molecular Biology (Source: Wikipedia) => RNA as a proxy to measure gene expression. 3
Microarrays: measuring gene expression Microarrays: one option. RNA hybridized onto a chip. Quantification of gene activity in different conditions at the genome scale. Resulting data: gene expression data. High expression samples Low expression genes 4
Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. 5
Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. Mathematically: Explain response Y using variables (Xj )j=1...p Y = f (X1 , X2 , X3 , X4 , . . . , Xp ) + ε 5
Feature selection: extract relevant information Genome ≈ 25,000 genes From gene expression, explain a phenomenon Feature selection: deciding which variables/genes are relevant. Mathematically: Explain response Y using relevant variables amongst (Xj )j=1...p Y = f (X1 , X2 , X3 , X4 . . . , Xp ) + ε Objective 1: more accurate predictors Objective 2: more interpretable predictors Objective 3: faster algorithms 5
Contributions of this thesis Gene Regulatory Network Inference TIGRESS: new method based on local feature selection. Ranked 3rd/29 at DREAM5 challenge. Linear method, competitive with more complex algorithms Molecular signatures for breast cancer prognosis Select biomarkers to predict metastasis/relapse in breast cancer patients. Complete benchmark of feature selection methods. Investigation of the stability issue. 6
alkB alkA ada metI metQ yeiB metC metF metN folE asnA metA metE gudP Gene Regulatory Network gudX purL garD purB gudD metJ mntR purC phoQ prs cvpA ompT purE ubiX glyA mqsR metR pagP aidB speB purF speA emrB purK ahpC glgB metB yjeP glnB hflD qseC metL yegR rseB yrbL mgrB ahpF cheW ybjG asnCmgtA emrA gcvB mdtA dsbA nadB slyB hemH frc ung borD yneM glgX yfdX pnuC motB envR rstB purH flu ybaS mdtD cpxA yqjA exbB spy rseC Inference with TIGRESS purD metH gmr fhuD tonB fhuB hslJ gor tsr nadA mprA oxyS rcnR fhuE fepC mdtC cysN ybaO trxC mnmG fepG yidQ baeS rdoA motA dsbG ygaC fliE fhuC gpmA lon cspD baeR nemR acrA cdaR grxA mntH livK psd codB mdtB ydeH rseA mltF nemA mioC yhhY cysD pgi purR qseB codA bacA asnB ryhB ybiS fliG metK cysC pqiB fhuA acrD yebE nadR yncE ygbK fliH pyrC fepD rcnA proV rpoE yhdY avtA ydfN lpxA ftnB purA fecI livJ livH yccA fecR chiA fldB gltK mukB potH fliO ycgRnfsB flgH tdh livF uof cysP lysU fabZ ycfS cheA potI poxB ppdC wzb mukF rfaZ flhB fepB nrdF uspE cpxP agaD tauB inaA flgI fliJ acrR exbD cysW ade yjjZ gnd csgA cpxR ribA ygiB nfsA argT fliR kbl fhuF ppiD ddpD fliT rimK cysU flhA ftnA nrdH mukE evgS lpxD fldA csgC mcbR dsbC agaI evgA potF cysK tauA ppdB ybjNacrB tolC ybjCflhE gltI garK agaS ddpA fimE tauD amtB flgJ fliK phoP cysM fliC sdaA yciF potG flgE ybaT emrY yojI proX degP pgaD flgG purM gcvH lrhA thrV gadY kbaY pqiA pepDadrA yeaH fpr fliI purN recC cysH hdeBhdeA wza ilvH proW yciG argI fliN nrdE agaB yhdX flgD hmp yhiM fliL gcvP flgN aslB livG wcaB tauC ppdA hisP nac hdeD nhaA pgaB yhdZ ddpX astB ygiC osmC csgB yciE fliF oxyR ecnB agaC cysB hisM nrdI hemL gspO ddpC artM yqjH fliA ybdZ ilvI rpmA pgaA ygdB soxS rob hisJ argG entS smtA nhaR ddpB garLglgC dctR astA flgMsoxR cbl yhiD omrA wzc seqA aroG argA rutR ycaC tppB iraP argD ccmA fimD agaR yhdW fliP fliS fliD marA entB mdtE ydeO envY rcsD gltJ livMmdtF hha rplU pgaC ddpF cysA emrK garP hdfR rprA argE glnK fur hchA trpL artP rutFflgB fepE fiu trpC astC glnG gltL yhjH gcvT slp fimA garR smpA csgD astE pepAflgA hisQyjbF dppC entC tomB yeaG pyrD gadX gadE astD hydN fliM rutD yhjA flhC fimF wcaA dadX yrbA fes rcsA glgP argB ssuB rutA nfo flhD hypB rutB pnp rpsO lrp adiY gspB pncB trpE artQ fliQ rrfA yjbG ilvL fimIgadW gabT trpB argH rutG uxuAgabD ydiV putA fliZ yfiD fliY ilvDhycE ompC rcsB serA ydeP argR torC rutC ssuD gltB ompF csiD yjjP rnpB micF oppA yjbE cydD pspC katG ilvM cadA hns srlD entD trpR argF flgC oppF trpA carA cirA ilvA sdhB artI argC yqjI fdnH ccmG mglC acrF nupC bdm argO mlrA sdiA aroL aroH artJ treR omrB carB putP pspE pspG katE gltF nrdR cyoC gdhA galK ssuCccmF sucA guaA rstAmglB csgG csiR lacA kbaZ fumC torD hycG glnQ torR narH csgF relAgalM hofO ftsA lipA ssuE glnP caiD galR osmB stpA trpD yecR fimC gadC hupA rrsH rrsG aroM ccmB marB dps csgE fecC uxuB entE galE dadA trmD glnH rpsS pspB zwf glnLsufS rutE dcuD fixC nuoB marR gadB srlR serC lacI rtcR nrdD rrlA gltD gabP pspD ompR fecD srlB ftsZ nrdG sufC sufA oppB entA yaiA nikD napB ccmH cyoB ndh sucC gntP pdhR ulaF ansB fecB tnaC lacY mtr agaV tyrA wzyE ompW fdnI cysJ mglA cydA iscA napC hypD sufB hyfB aroP napG hypE glnA metY nagA yjbHfecE rpoS melA pgm amiA acs dppDhycF guaB gutQ aroA iscU tar rtcA nrfA ssuA ccmD lpd adiA glgA torA tyrR rnlA bcsZ pspF nrfGgcd sodA ilvE feoB hupB tnaB rtcB ibpB glcG ydhT glpA nuoA sufD aroF rffA rpsP feaB napA osmE sra rybB tehA napF nrfF glpB oppD caiA fimH cadB nrdB gutM uxaB sohB rbsK dinF paaJ entF bglB ppiA chbG lpxC rplD cheBdmsD glcF galS bcsB ccmC nuoI feaR hypA nuoF sodB rrsC sdhD malQ deoA fepA uxaC ftsQ napD paaI dmsC nirD ynfK bolA tyrB symE dkgB infC yfgF proL glcA infB truB feoC malS nagDglpG hokE wzzE ynfE hypF osmY cyoD malP fimB folA srlE galP dinB pheS hcr hyfG hycI nrfB nrfD fumB tdcC nusA rbfAyiaN sdhA uxuR uidR malZ murG thrS cheZ glpC nsrR cysI narP nrfE dppF caiB ihfA treB sgbH tdcF fimGtyrP bglF tnaA uidC rffH atoD ccmE napH ihfB feoA nuoM gadA fixB cyoE cadC chbA uidA ftsI ydjM ytfE dmsB cyoAsucD hyfC galT rpoH csiE ygbA hybB fumA cspA cyaA cheR nrfC melR rbsR yafO sulA ftsK tehByjgI hybA pspA nuoK focB fnr glmY fadL hyfR aceE sufE rfe hycB caiF tdcE sgbE srlA ptsI lacZ chbF uvrB ybfE ydhX ydhU ynfG rplP hycC dppB acnA modE fhlA tdcD fecA malM flgF creD moeBmoaA ydhV cydCiscR mdh frdB uxaA ulaA hlyE umuD hyaC dmsA nuoG caiEptsG hyfF chpA rpmI rffD rffC scpA modA cysG modB oppC yiaK gltA exuR araF mhpA uidB ddlB rffM norR glcB atoE tdcA bglG rbsD malY yafQ ydiU ydhW narL norW paaC fucO araD recA uraA dcuR paaX adhE araG mhpF creB xdhC hemF prfA treC aldB nanC creA mraY narK nikR arcA glcChpt yjcH paaH melB cheY narI sdhC fis paaA paaF aceB tdcRulaC crp nanR lsrC lsrG nagC cstA sbmC uvrC nikA narX frdD yeaR hycD hyfD glpT cytR malE araE ivbL ydeJ norV ulaE uvrD yeaE grxD atoB glcD appY nuoN fixA sucB nagB rbsC dinI lexA dnaG iscS ynfF hipB chpR acrEactP rbsA chbR uvrA phr rffG erpA moaE ynfH ygjG hipA hemA hypC cydB cdd yiaO glpR paaE aceK gyrAmanZudp lsrD nanM grpE rbsB dacC rpsU murD pheM narG frdC deoDhyfI yeiL dgsA malT hyaFdctA nirC nirB icd dusB mhpE mhpD rffE nikC uspB nikB aceF aldA yiaJ yhcH malG dnaA mhpB mhpR ydhO fdnG hybF fdhF hyfA paaD ulaG manX fucP malI ruvA ackA hyaB yoaG acnB bglJ xylR proP murE ydbD nuoE hyaD asr lamB malK creC dinG ftsW cho moaB hcp nuoL dcuB ulaR yiaL nanE agp chbB polB mhpT ydhY tpx nuoJ hycA paaG trg sfsA pheT rplC tdcB rpmC aceA yjjQ hofB glmS insK wzxE ybdN nagE ssb dinD atoA appC nuoH nanK xseA leuO argP dksA tadA ulaD hyfH paaB mhpC umuC uvrY ftsL tsgA nikE glpD focA ubiC ubiA gatC ulaBpstC nfuA yiaM ecpD nanT glpX xylA chbC murF moaC narJ tam crr fucR lsrB glmU dppA gatZ deoB fucI fucU recX upp pflB hybC appAprmC glpQ deoC idnR rplT betT lldP hyfE pstB rpsJ dcuA caiC gatD lsrR speC ilvB ybfN ampC ruvB cueR rplB moeA rpsQ phoU nuoChyfJ rhaT nrdA fruR malF manY fucA yafN rpsC trmA rpoD dcuShyaA hyaE aspA araJ trxA tisB yafP copA thrT deoR xylF gntT hofN araC serT yccBhycH rplW modC pstA prpE gcvA yebG tapmoaDatoC appB exuTpaaK tsx araH hybErimM htrE lsrA prpD malX pitA rplV fadD fadH araB dnaN dinQ dinJ puuA leuW fixX xylH recF envZ mtlA nanA hofM ubiG polA recN hybD caiT gatY gapA ptsH lsrK ppdD argK xdhA dcuC rpsI flxA murC cueO puuR ychO rplS uspA fadB aer gatA xylG murQ rhaS araA prpR idnK hybG fadI ysgA queA pstSrplM idnD yhfA hybO dsdX iclR tdcG gntK idnO fadA mtlD lsrF fucK pepT glpF glpK gntX mpl ychH ascF xdhB betA dsdC ydeA cspI msrA gatB fbaA spf scpC betI thrU ugpC lldR ogt argU frdA fadJ apaH pgk gntR idnT apaG glgS prpB nupG hofP ompX mazG cpdB ugpE rhaB ompA argW mtlRzraSprpC gntU rhaD proM fadR yibD topA glpE dapB murP leuL fadE lldD cyaR ascB hisR zraR psiE xylB gltX thrW gyrB ugpQ ugpA fruK uhpT betB proK sgbU epd puuD glyU hofC ugpB leuC leuP dsdA envC rhaA dsrA pdxA eda ascG dpiA ilvN slyA rhaR argX leuX scpB yahA fruA leuA yibQ edd tyrU ACH, Mordelet, F., Vera-Licona, P. and Vert, J.-P., TIGRESS: Trustful fruB fbaB glk leuB pfkApykF gpmM leuD tpiA eno pck phoB glyT uhpA setA citC sgrS htpG citG Inference of Gene REgulation using Stability Selection, BMC Systems citX kdgR dpiB phnE ppc citF citD fabA zraP phoE phnH citE fabB phnO phnJ phnM phoA pitB phoH phnN phoR Biology, 2012, 6:145. psiF phnI phnD phnG amn phnL fabR sgrR phnF phnC phnK phnP tbpA thiP thiQ 7
Gene Regulatory Networks Gene regulation: control gene expression. Transcription factors (TF) activate or repress target genes (TG). Gene Regulatory Network (GRN): representation of the regulatory interactions between genes. alkB alkA ada metI metQ yeiB metC metF metN folE asnA metA metE gudP gudX purL garD purB gudD metJ mntR purC phoQ prs cvpA ompT purE ubiX glyA mqsR metR pagP aidB speB purF speA emrB purK ahpC glgB metB yjeP glnB hflD qseC metL yegR rseB yrbL mgrB ahpF cheW ybjG asnCmgtA emrA gcvB mdtA dsbA nadB slyB hemH frc ung borD yneM glgX yfdX pnuC motB envR rstB purH flu ybaS mdtD cpxA yqjA purD exbB spy rseC metH gmr fhuD tonB fhuB hslJ gor tsr nadA mprA oxyS rcnR fhuE fepC mdtC cysN ybaO trxC mnmG fepG yidQ baeS rdoA motA dsbG ygaC fliE fhuC gpmA lon cspD baeR nemR acrA cdaR grxA mntH livK psd codB mdtB ydeH rseA mltF nemA mioC yhhY cysD pgi purR qseB codA bacA asnB ryhB ybiS fliG metK cysC pqiB fhuA acrD yebE nadR yncE ygbK fliH pyrC fepD rcnA proV rpoE yhdY avtA ydfN lpxA ftnB gcvA flgF purA fecI yccA livJ livH chiA fldB fecR gltK fliO mukB potH ycgRnfsB flgH tdh livF ycfS potI uof cysP lysU mukF fabZ cheA poxB ppdC wzb rfaZ flhB fepB nrdF uspE cpxP agaD tauB inaA flgI fliJ acrR exbD cysW ade yjjZ gnd csgA cpxR ribA ygiB nfsA argT fliR k b l fhuF ppiD ddpD fliT rimK cysU flhA ftnA nrdH mukEevgS lpxD fldA csgC mcbR dsbC agaI ppdB acrB ybjC gltI garK evgA potF cysK ybjN tolC flhE agaS ddpA tauA fimE tauD amtB flgJ fliK phoP cysM fliC sdaA yciF potG flgE ybaT emrY yojI proX degP pgaD flgG purM gcvH lrhA thrV gadY kbaY yeaH fpqiA pr hdeBhdeA wza proW yciG pepDadrA fliI purN recC fliN cysH ilvH argI hmp nrdE yhiM agaB yhdX flgD gcvP flgN livG wcaB tauC fliL ppdA hisP nac aslB hdeD nhaA pgaB yhdZ ddpX astB ygiC osmC csgB yciE fliF oxyR ecnB agaC cysB hisM nrdI hemL gspO ddpC artM yqjH fliA ybdZ ilvI rpmA pgaA ygdB soxS r o b hisJ argG entS smtA nhaR ddpB garL g lgC dctR astA flgMsoxR cbl yhiD omrA wzc seqA aroG argA rutR ycaC tppB iraP argD ccmA fimD agaR yhdW fliP fliS gltJ fliD marA entB mdtE ydeO envY rcsD ddpF emrK garP livMmdtF rprA hha rplU pgaC argE cysA hdfR hchA trpL artP glnK rutFflgB fepE fur fiu trpC astC yhjH fimA garR slp csgD astE pepAflgA glnG gltL hisQ gcvT yjbF dppC smpA entC tomB yeaG pyrD gadX gadE astD fliM hydN rutD yhjAflhC fimF wcaA dadX yrbA fes rcsA glgP argB n f o ssuB rutA flhD hypB rutB p n p rpsO l r p adiY gspB pncB trpE artQ fliQ rrfA ilvL fimI yjbG gabT trpB argH rutG gadW uxuA gabD ydiV putA fliZ yfiD fliY ilvD ompC rcsB serA ydeP argR torC rutC ssuD hycE gltB ompF csiD yjjP rnpB micF oppAyjbE cydD pspC katG ilvM cadA hns srlD entD trpR argF flgC oppF trpA carA cirA ilvA sdhB artI argC yqjI fdnH ccmG mglC acrF nupC b d m argO mlrA sdiA aroL aroH artJ treR omrB putP carB pspEpspGkatE gltF nrdR cyoC gdhA galK ssuCccmF sucA guaA rstAm glB csgG csiR lacA kbaZ glnQ ftsA fumC ssuE torD hycG torR narH csgF osmB relAgalM stpA hofO lipA trpD yecR glnP caiD galR gadC hupA rrsH rrsG aroM ccmB marB dps fimC csgE fecC uxuB entE galE dadA trmD glnH rpsS pspB rutE z w f glnLsufS dcuD fixC nuoB gltD marR pspD gadB srlR serC lacI rtcR nrdD rrlA gabP ompR fecD srlB ftsZ nrdG sufC sufA oppB entA yaiA napB ccmH cyoB ndh sucC gntP pdhR ulaF ansB fecB tnaC mlacY tr agaV tyrA wzyE ompW fdnI cysJnikD mglA cydA iscA napC hypD sufB hyfB aroP napG hypE glnA metY nagA yjbHfecE rpoS melA pgm amiA guaB gutQ iscU tar rtcA nrfA ssuA acs pspF nrfG ccmD gcd l pdppD d hycF adiA glgA ilvE torA aroA tyrR rnlA bcsZ glcG ydhT sodA nuoA sufD feoB hupB tnaB aroF rpsP feaB rtcB ibpB napA glpA osmE sra rffA rybB tehA napF nrfF glpB oppD caiA fimH cadB nrdB gutM uxaB sohB rbsK dinF paaJ entF bglB ppiA chbG lpxC rplD cheB dmsD glcF galS bcsB ccmC nuoI feaR hypA nuoF sodB rrsC sdhD malQ deoA fepA uxaC ftsQ napD paaI dmsC nirD ynfK bolA tyrB symE dkgB infC yfgF proL glcAinfB truB feoC malS nagDglpG hokE wzzE ynfE hypF cyoD malP fimB osmY folA srlE galP dinB hyfG fumB rbfA yiaN sdhA uxuR uidR malZ murG pheS hcr hycI nrfB nrfD tdcC nusA thrS cheZ glpC nsrR cysI narP nrfE dppF caiB ihfA treB sgbH tdcF fimG tyrP bglF tnaA uidC rffH atoD ccmE napH ihfB feoA nuoM gadA fixB cadC chbA uidA ftsI ydjM ytfE dmsB cyoAcyoE sucD hyfC galT rpoH csiE ygbA hybB fumA cspA cyaA cheR nrfC melR rbsR yafO sulA ftsK tehByjgI hybA pspA nuoK focB fnr sufE glmY fadL hyfR aceE rfe hycB caiF tdcE sgbE srlA ptsI lacZ chbF uvrB ybfE ydhX ydhU ynfG rplP hycC dppB acnA modE fhlA tdcD fecA malM creD moeBmoaA ydhViscR cydC hyaC mdh frdB nuoG uxaA ulaA hlyE hyfF umuD dmsA caiEptsG chpA rpmI rffD rffC scpA modA cysG modBoppC tdcA yiaK gltA exuR araF mhpA uidB ddlB rffM norR glcB atoE bglG rbsD malY yafQ ydiU ydhW narL norW paaC fucO araD recA uraA dcuR adhEpaaX araG mhpF creB xdhC hemF prfA treC aldB nanC creA mraY narK nikR arcA glcCh p t yjcH paaH melB cheY narI sdhC fis paaA paaF aceB tdcRulaC crp nanR lsrC lsrG nagC cstA sbmC uvrC nikA narX frdD yeaR hycD hyfD glpT cytR malE araE ivbL ydeJ norV ulaE uvrD yeaE grxD atoB appY glcD nuoN fixA sucB nagB rbsC dinI lexA dnaG iscS ynfF hipB chpR acrE actP rbsA chbR uvrA phr rffG erpA moaE ynfH ygjG hipA hemA hypC cydB cdd yiaO glpR paaE aceK gyrAmanZu d p lsrD nanM grpE rbsB dacC rpsU murD pheM narG frdC deoD hyfI yeiL dgsA malT hyaFdctA nirC nirB icd dusB mhpE mhpD rffE nikC uspB nikB aceF aldA yiaJ yhcH malG dnaA mhpB mhpR ydhO fdhF ulaG fucP malI fdnG hybF ackA hyaB yoaG hyfA paaD acnB bglJxylR manX proP ruvA murE ydbD nuoE hyaD asr lamB malK creC dinG ftsW cho moaB hcp nuoL dcuB ulaRyiaL nanE agp chbB polB ydhY t p x nuoJ hycA paaG sfsA wzxE pheT mhpT rplC rpmCtdcB ybdN trg aceA yjjQ hofB ssb glmS insK dinD atoA appC nuoH argP nagE dksA tadA ulaD nanK xseA leuO hyfH paaB umuC uvrY tsgA nikE ubiC glpD focA ubiA gatC ulaB pstC nfuA yiaM ecpD nanT glpX xylA murF ftsL moaC narJ tam crr fucR lsrBmhpC chbC glmU dppA gatZ deoB fucI fucU recX upp pflB hybC appAprmC glpQ deoC idnR rplT betT lldP hyfE pstB rpsJ dcuA caiC gatD lsrR speCilvB ybfN ampC ruvB cueR rplB moeA rpsQ phoU nuoC hyfJ rhaT nrdA fruR malF manY fucA yafN rpsC trmA rpoD dcuShyaA hyaE aspA araJ t r x A tisB yafP copA thrT deoR xylF gntT hofN araC serT yccBhycH rplW modCpstA yebG tapmoaDatoC appB htrE exuTpaaK tsx lsrA prpE prpD malX araH hybErimM rplV fadH araB dnaN dinQ dinJ pitA puuA leuW xylH recF fixXfadD envZ mtlA nanA hofM ubiG polA hybD caiT gatY gapA ptsH lsrK ppdD recN argK xdhA dcuC rpsI flxA murC cueO puuR ychO rplS uspA fadB aer gatA xylG murQ rhaS araA prpR idnK hybG fadI ysgA queA pstSrplM idnD yhfA hybO dsdX iclR tdcG gntK idnO fadA mtlD lsrF fucK pepT glpF glpK gntX mpl ychH ascF xdhB betA dsdC ydeA cspI msrA gatB fbaA spf scpC betI thrU ugpC lldR ogt argU frdA fadJ apaH pgk gntR idnT apaG glgS prpB nupG hofP ompX mazG cpdB ugpE rhaB ompA argW mtlRzraSprpC gntU rhaD proM fadR yibD topA glpE dapB murP leuL fadE lldD cyaR ascB hisR zraR psiE xylB gltX thrW gyrB ugpQ ugpA fruK uhpT betB proK sgbU epd puuD glyU hofC ugpB leuC leuP dsdA envC rhaA dsrA pdxA eda ascG dpiA ilvN slyA rhaR leuXargX scpB yahA fruA leuA yibQ edd tyrU glk leuB fruB fbaB pfkApykF gpmM leuD tpiA eno pck phoB glyT uhpA setA citC sgrS htpG citG citX kdgR dpiB phnE ppc citF citD fabA zraP phoE phnH citE fabB phnO phnJ phnM phoA pitB phoH phnN phoR psiF phnI phnD phnG amn phnL fabR sgrR phnF phnC phnK phnP thiQ thiP tbpA Figure: E. coli regulatory network 8
DREAM network inference challenge Network inference challenge: DREAM5 results: Method Network 1 Network 3 Network 4 Overall AUPR AUROC AUPR AUROC AUPR AUROC GENIE31 0.291 0.815 0.093 0.617 0.021 0.518 40.28 ANOVerence2 0.245 0.780 0.119 0.671 0.022 0.519 34.02 Naive TIGRESS 0.301 0.782 0.069 0.595 0.020 0.517 31.1 1 Huynh-Thu et al., 2010 2 Kueffner et al., 2012 9
Purposes Introduce TIGRESS: Trustful Inference of Gene REgulation using Stability Selection. Assess the impact of the parameters. Test and benchmark TIGRESS on several datasets. 10
Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 11
Regression-based inference: hypotheses Notations ntf transcription factors (TF), ntg target genes (TG), nexp experiments Expression data: X (nexp × (ntf + ntg )). Xg : expression levels of gene g. XG : expression levels of genes in G. Tg : candidate TFs for gene g. Hypotheses 1 The expression level Xg of a TG g is a function of the expression levels XTg of Tg : Xg = fg (XTg ) + ε. 2 A score sg (t) can be derived from fg , for all t ∈ Tg to assess the probability of the interaction (t, g). 12
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 TF 2 ... TF ntf 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 TF 2 ... TF ntf 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - TF 2 0.97 ... ... TF ntf 0 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 TF 2 0.97 - ... ... ... TF ntf 0 0 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 0 TF 2 0.97 - 0.03 ... ... ... ... TF ntf 0 0 0 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 0 ... 0.11 TF 2 0.97 - 0.03 ... 0 ... ... ... ... ... ... TFntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 0 ... 0.11 TF 2 0.97 - 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 0 ... 0.11 TF 2 0.97 - 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Regression-based inference: main steps Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg TF 1 - 0.23 0 ... 0.11 TF 2 0.97 - 0.03 ... 0 ... ... ... ... ... ... TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 13
Adding a linearity assumption TIGRESS’ first hypothesis: regulations are linear Xg = fg (XTg ) + ε = XTg ω g + ε g Consequence: if ωt = 0, no edge between g and t. 14
Adding a sparsity assumption TIGRESS’ second hypothesis: few TFs regulate each TG g ∀g, ]{ωt 6= 0} L TFs selected. 15
Stability Selection Problem: LARS efficiency is limited: bad response to correlation; no confidence score for each TF. Solution: Stability Selection with randomized LARS (Bach, 2008 ; Meinshausen and Bühlmann, 2009): Resample the experiments: run LARS many (e.g. 1, 000) times with different training sets. “Resample” the variables: also weight the variables Xit ← Wt Xit (1) where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more randomized the variables. Get a frequency of selection for each TF. 16
Stability Selection Problem: LARS efficiency is limited: bad response to correlation; no confidence score for each TF. Solution: Stability Selection with randomized LARS (Bach, 2008 ; Meinshausen and Bühlmann, 2009): Resample the experiments: run LARS many (e.g. 1, 000) times with different training sets. “Resample” the variables: also weight the variables Xit ← Wt Xit (1) where Wj ; U([α, 1]) for all t = 1...ntf . The smaller α, the more randomized the variables. Get a frequency of selection for each TF. 16
Stability Selection path 1 0.9 Frequency of selection of each TF 0.8 0.7 over the subsamplings 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Number L of LARS steps (example for one target gene) 17
Scoring 1 0.9 Frequency of selection of each TF 0.8 0.7 over the subsamplings 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 Number L of LARS steps 18
Scoring Choose L, then: Original scoring Area scoring (contribution) 18
Scoring Let Ht be the rank of TF t. Then, score = E[φ(Ht )] with Original: φ(h) = 1 if h ≤ L , 0 otherwise Area: φ(h) = L + 1 − h if h ≤ L , 0 otherwise => Area scoring takes the value of the rank into account. 18
TIGRESS Summary Idea: consider as many problems as TGs (ntg subproblems) subproblem g ⇔ find regulators TFs(g) of gene g 1 For each TG, score all ntf candidate interactions: TG 1 TG 2 TG 3 ... TG ntg LARS TF 1 - 0.23 0 ... 0.11 + Stab. Selection TF 2 0.97 - 0.03 ... 0 => + Choose L ... ... ... ... ... ... + Score TF ntf 0 0 0 ... 0.76 2 Rank the scores altogether: TF 12 → TG 17 1 TF 23 → TG 5 0.99 TF 2 → TG 1 0.97 ... ... ... ... 3 Threshold to a value or a given number N of edges. 19
Parameters TIGRESS needs four parameters to be set: scoring method (original, area, ...); number of runs R: large; randomization level α: between 0 and 1; number of LARS steps L: not obvious. 20
Data Network ] TF ] Genes ] Chips ] Edges DREAM5 Net 1 (in-silico) 195 1643 805 4012 DREAM5 Net 3 (E. coli) 334 4511 805 2066 DREAM5 Net 4 (S. cerevisiae) 333 5950 536 3940 E. coli Net from Faith et al., 2007 180 1525 907 3812 DREAM4 Multifactorial Net 1 100 100 100 176 DREAM4 Multifactorial Net 2 100 100 100 249 DREAM4 Multifactorial Net 3 100 100 100 195 DREAM4 Multifactorial Net 4 100 100 100 211 DREAM4 Multifactorial Net 5 100 100 100 193 21
Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 22
Impact of the parameters: results on in silico network Area (1,000 runs) Original (1,000 runs) 20 20 15 15 L 10 10 5 5 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 20 Area (4,000 runs) 20 Original (4,000 runs) Area less sensitive than original to α and L. 15 15 Area systematically L 10 10 5 5 outperforms original. 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 The more runs, the better Area (10,000 runs) Original (10,000 runs) 20 20 Best values: 15 15 α = 0.4, L = 2, R = 10, 000. L 10 10 5 5 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 α α 40 60 80 100 Overall Score 23
1000 200 How 500 to choose L? 100 0 0 0 0 2 4 6 8 0 5 10 15 0 2 4 6 8 First 50,000 First 100,000 edges edges First 100,000 edges 800 200 150 600 150 100 400 100 50 200 50 0 0 0 0 10 1020 3020 40 30 50 0 50 100 150 L=2: number of TFs/TG smaller L=20: greater number of TFs/TG, and more variable. less sparsity. => L should depend on the expected network’s topology 24
TIGRESS vs state-of-the-art 1 1 0.8 0.8 Precision 0.6 GENIE3 0.6 TPR ANOVerence 0.4 CLR 0.4 ARACNE 0.2 Naive TIGRESS 0.2 TIGRESS 0 0 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 FPR Recall TIGRESS is competitive with state-of-the-art. 25
Results on E. coli network 1 1 0.8 0.8 Precision 0.6 0.6 TPR GENIE3 0.4 0.4 CLR 0.2 ARACNE 0.2 TIGRESS 0 0 0 0.5 1 0 0.05 0.1 0.15 FPR Recall Random Forests-based and DREAM winner GENIE3 overperforms all methods. 26
False discovery analysis on E. coli Main false positive pattern found by TIGRESS: Should find Does find Good news: spuriously inferred edges close to true edges Bad news: confusion (due to linear model?) 27
Undirected case: DREAM4 challenge DREAM4: undirected networks (TFs not known in advance) A posteriori comparison of Default TIGRESS and GENIE3: Method Network 1 Network 2 Network 3 Network 4 Network 5 AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR AUROC GENIE3 0.154 0.745 0.155 0.733 0.231 0.775 0.208 0.791 0.197 0.798 TIGRESS 0.165 0.769 0.161 0.717 0.233 0.781 0.228 0.791 0.234 0.764 Overall scores: GENIE3: 37.48 TIGRESS: 38.85 28
Outline 1 Methods Regression-based inference TIGRESS Material 2 Results In silico network results In vitro networks results Undirected case: DREAM4 3 Conclusion 29
Conclusion Contributions: Automatization and adaptation of Stability Selection to GRN inference. Area scoring setting: better results and less elasticity to parameters. Insights on network’s behavior TIGRESS is: Linear Competitive Parallelizable Available (http://cbio.ensmp.fr/tigress) But outperformed in some cases by random forests: limits of linearity? Perspectives: Adaptive/changeable value for L. Group selection of TFs. Use of time series/knock-out/replicates information. 30
Molecular signatures for breast cancer prognosis ACH, Gestraud, P. and Vert, J.-P., The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures, 2011, PLoS ONE ACH, Jacob, L. and Vert, J.-P., Improving Stability and Interpretability of Molecular Signatures, 2010, arXiv 1001.3109 31
Motivation Prediction of breast cancer outcome Assist breast cancer prognosis based on gene expression. Avoid adjuvant/preventive chemotherapy when not needed. Gene expression signature Data: primary site tumor expression arrays. Among the genome, find the few (50-100) genes sufficient to predict metastasis/relapse. Main challenge: high-dimensional data (few samples, many variables). 32
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Background 2002: Van’t Veer et al. publish 70-gene signature MammaPrint. Since then: at least 47 published signatures (Venet et al., 2011). Little overlap, if any (Fan et al., 2006; Thomassen et al., 2007). Many gene sets are equally predictive (Michiels et al., 2005; Ein-Dor et al., 2005), even within the same dataset. Prediction discordances (Reyal et al., 2008) Stability at the functional level? Not sure. Seeking stability Accuracy is not enough to choose the right genes. => Stability as a confidence indicator. 33
Contributions 1 Systematic comparison of feature selection methods in terms of: predictive performance; stability; biological stability and interpretability. 2 Group selection of genes: with predefined groups (Graph Lasso); with latent groups (k-overlap norm). 3 Evaluation of Ensemble methods 34
Evaluation Accuracy: how well selected genes + classifier predict metastatic events on test data. Stability: how similar two lists of genes are. Interpretability: how much biological sense selected genes make. 35
Data Four public breast cancer datasets from the same technology (Affymetrix U133A): GEO Reference ] genes ] samples ] positives GSE1456 12,065 159 40 GSE2034 12,065 286 107 GSE2990 12,065 125 49 GSE4922 12,065 249 89 36
Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 37
Classical feature selection/feature ranking methods Type Characteristics Examples Univariate, fast T-test, KL Divergence Filters Only depend on the data Wilcoxon rank-sum test Do not use loss function Learning machine as a criterion SVM RFE, Greedy FS Wrappers Computationally expensive Learning + selecting Lasso, Elastic Net Embedded Possible use of prior knowledge Each of these algorithms returns a ranked list of genes to be thresholded. 38
First results 100 genes over four databases. 0.16 Random Ttest 0.14 Entropy 0.12 Bhattacharryya Wilcoxon 0.1 SVM RFE GFS Stability 0.08 Lasso E−Net 0.06 0.04 0.02 0 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy 39
First conclusions 0.16 Random 0.14 Ttest Entropy Random better than tossing a coin. 0.12 Bhattacharryya Wilcoxon 0.1 SVM RFE GFS Stability 0.08 Lasso E−Net Elastic Net neither more stable 0.06 0.04 nor more accurate than Lasso. 0.02 Accuracy/Stability trade-off 0 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy T-test: both simplest and best. Next step: Can we have a better stability without decreasing accuracy? 40
First conclusions 0.16 Random 0.14 Ttest Entropy Random better than tossing a coin. 0.12 Bhattacharryya Wilcoxon 0.1 SVM RFE GFS Stability 0.08 Lasso E−Net Elastic Net neither more stable 0.06 0.04 nor more accurate than Lasso. 0.02 Accuracy/Stability trade-off 0 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy T-test: both simplest and best. Next step: Can we have a better stability without decreasing accuracy? 40
Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 41
Ensemble Methods Run each algorithm R times on subsamples. Get R ranked lists of genes (r b )b=1...R . Aggregate and get a score for each gene: R 1X b S(g) = f (rg ). R b=1 average : f (r ) = (p − r )/p exponential : f (r ) = exp(−αr ) stability selection : f (r ) = δ(r ≤ k ) Sort S by decreasing order and threshold to get final signature 42
Results 100 genes over four databases. Random 0.2 Ttest 0.18 Entropy Bhattacharryya 0.16 Wilcoxon 0.14 SVM RFE GFS Stability 0.12 Lasso 0.1 E−Net Single−run 0.08 Stab. sel 0.06 0.04 0.02 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy 43
Functional stability 100 genes over four databases. Random 0.3 Ttest Entropy 0.25 Bhattacharryya Wilcoxon Functional stability 0.2 SVM RFE GFS Lasso 0.15 E−Net Single−run 0.1 Stab. sel 0.05 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy 44
Ensemble methods: conclusions Random Random 0.2 Ttest 0.3 Ttest 0.18 Entropy Entropy Bhattacharryya Bhattacharryya 0.25 0.16 Wilcoxon Wilcoxon Functional stability 0.14 SVM RFE 0.2 SVM RFE GFS GFS Stability 0.12 Lasso Lasso 0.15 0.1 E−Net E−Net Single−run 0.08 Single−run Stab. sel 0.1 Stab. sel 0.06 0.05 0.04 0.02 0 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy Accuracy Expected improvement in stability not happening. Slight improvement in accuracy in some cases. Loss in functional stability. T-test: still the preferred method. Next step: Can we do better by incorporating prior knowledge? 45
Ensemble methods: conclusions Random Random 0.2 Ttest 0.3 Ttest 0.18 Entropy Entropy Bhattacharryya Bhattacharryya 0.25 0.16 Wilcoxon Wilcoxon Functional stability 0.14 SVM RFE 0.2 SVM RFE GFS GFS Stability 0.12 Lasso Lasso 0.15 0.1 E−Net E−Net Single−run 0.08 Single−run Stab. sel 0.1 Stab. sel 0.06 0.05 0.04 0.02 0 0 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 Accuracy Accuracy Expected improvement in stability not happening. Slight improvement in accuracy in some cases. Loss in functional stability. T-test: still the preferred method. Next step: Can we do better by incorporating prior knowledge? 45
Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 46
Data Expression data: Van’t Veer et al., 2002; Wang et al., 2005. PPI network with 8141 genes (Chuang et al., 2007) Assumption: genes close on the graph behave similarly Idea: instead of selecting single genes, select edges 47
Selecting groups of genes: `1 methods Lasso : selects single genes (Tibshirani, 1996) Why groups? selecting similar genes: improving stability and interpretability smoothing out noise by "averaging": improving accuracy Group Lasso (Yuan & Lin, 2006): implies group sparsity for groups of covariates that form a partition of {1...p} Overlapping group Lasso (Jacob et al., 2009): selects a union of potentially overlapping groups of covariates (e.g. gene pathways). Graph Lasso: uses groups induced by the graph (e.g. edges) 48
Is group sparsity enough? `1 methods work well, but face serious stability issues when groups are correlated. Solution: randomization through stability selection. 49
Accuracy results Test on data from Wang et al., 2005. Signature learnt on Van’t Veer dataset 0.8 Signature learnt on Wang dataset Balanced Accuracy 0.6 0.4 0.2 0 Lasso Lasso + stab. sel. Graph Lasso Graph Lasso + stab. sel. Neither prior knowledge nor stability selection bring any improvement! 50
Stability results 7 Number of genes in the overlap 6 Lasso Graph Lasso with stability selection 5 Lasso with stability selection Graph Lasso 4 3 2 1 0 0 20 40 60 80 100 120 Number of genes in the signatures Graph Lasso slightly improves stability. 51
Interpretability Signature obtained using Lasso: 52
Interpretability Signature obtained using Graph Lasso + Stability Selection: 52
Graph Lasso conclusion Graphical prior seems to increase stability and interpretability. However: no change in accuracy. Next step: Grouping increases stability. Now on to accuracy! 53
Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 54
Latent grouping Grouping genes makes sense. Let the data tell which genes to select together. The k-support norm Introduced by Argyriou et al., 2012 A trade-off between `1 and `2 . Equivalent to overlapping group Lasso (Jacob et al., 2009) with all possible groups of size k . Results in selecting groups that are not predefined. 55
Extreme randomization Following Breiman’s random forests Sample both the examples and the covariates. Less variables = less correlation. Give each gene a chance to be selected. Extreme randomization For each of the R runs: Bootstrap samples (classical Ensemble method) Sample the covariates: randomly choose 10% of them. Run FS procedure on the restricted data. => Compute frequency of selection: P(g selected |g preselected) 56
Accuracy or stability? 0.05 T−test Random 0.045 Ttest 0.04 Lasso 0.035 ENet kSupport (k=2) 0.03 kSupport (k=10) Stability 0.025 kSupport (k=20) Single−run 0.02 Extreme Rand. + SS 0.015 0.01 0.005 0 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 Accuracy 57
Stability or redundancy? 0.05 Random 0.045 Ttest Lasso 0.04 ENet 0.035 kSupport (k=2) kSupport (k=10) Stability 0.03 kSupport (k=20) 0.025 Single−run Extreme Rand. + SS 0.02 0.015 0.01 0.005 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Correlation 58
Conclusions 0.05 0.05 T−test Random Random 0.045 0.045 Ttest Ttest 0.04 Lasso Lasso 0.04 ENet 0.035 ENet 0.035 kSupport (k=2) kSupport (k=2) 0.03 kSupport (k=10) kSupport (k=10) Stability 0.03 Stability 0.025 kSupport (k=20) kSupport (k=20) 0.025 Single−run Single−run 0.02 Extreme Rand. + SS Extreme Rand. + SS 0.02 0.015 0.015 0.01 0.005 0.01 0 0.005 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Accuracy Correlation Extreme Randomization improves accuracy. Grouping improves stability. But: both effects do not add up that well (redundancy) T-test still the best trade-off? 59
Outline 1 A simple start 2 An attempt at enforcing stability: Ensemble methods 3 Using prior knowledge: Graph Lasso 4 Acknowledging latent team work 5 Conclusion 60
Signature selection: conclusion Contributions: Step-by-step study of FS methods behavior on several breast cancer datasets. Systematic analysis of accuracy, stability, interpretability. Insights on the accuracy/stability trade-off. What have we learned? Best methods: simple t-test or complex black box Grouping improves gene and functional stability. Randomization improves accuracy (sometimes) but has unwanted effects on stability. Accuracy/Stability Trade-off: Stability => redundancy => lower accuracy. 61
Signature selection: perspectives One unique signature? single breast cancer subtype many samples larger signature Is expression data sufficient? probably not all information is there clinical data: same accuracy (same information?) possibly look at genotype, methylation, clinical and expression Is stability important? not as important as accuracy + prediction concordance possibly not even achievable 62
Conclusion 63
Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64
Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64
Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64
Conclusion Gene expression data: High-dimensional, noisy Possibly contains important information Feature selection: Find the needle in the haystack. Output relevant genes to be studied further. Main issues: Results are not necessarily transferable across datasets. Models rely on hypotheses! Fixing: Testing on many databases. Keeping model hypotheses in mind / not being afraid of black boxes. 64
Acknowledgements Fantine Paola Pierre Laurent Mordelet Vera-Licona Gestraud Jacob 65
The k-support norm It can be shown that: k −r −1 d !2 12 1 Ωsp (|ω|↓i )2 + |ω|↓i X X k (ω) = r +1 i=1 i=k −r where r is the only integer in {0, . . . , k − 1} statisfying d 1 X |ω|↓k −r −1 > |ω|↓i ≥ |ω|↓k −r . r +1 i=k −r and |ω|↓i is the i-th largest value of |ω| (|ω|↓0 = +∞). 66
Link with overlapping Group Lasso The k-support norm is equivalent to the overlapping group Lasso norm X Ωsp X k (ω) = min ||v || I 2 : supp(vI ) ⊆ I, vI = ω v ∈Rp×Gk I∈Gk I∈Gk where Gk denotes all subsets of {1, . . . , d} of cardinality k . Remark 1: it selects at least k variables. Remark 2: the first selected group consists of the k variables most correlated with the response 67
ADMM - applied to k-support problem Our problem minω,β R̂l (ω) + λ2 Ωsp 2 k (β) s.t. ω − β = 0 Augmented Lagrangian Lρ (ω, β, µ) = R̂l (ω) + λ2 Ωsp 2 0 ρ k (β) + µ (ω − β) + 2 ||ω − β|| 2 Algorithm 1 Initialize: β (1) , ω (1) , µ(1) 2 for t = 1, 2, ..., do n o w (t+1) = arg minw R̂l (w) + µ(t)T w + ρ2 ||w − β (t) ||2 (t) β (t+1) = prox λ Ωsp (.)2 w (t+1) + µρ 2ρ k µ(t+1) = µ(t) + ρ(w (t+1) − β (t+1) ) 68
ADMM - optimality conditions Three first order conditions: Primal condition: ω ∗ − β ∗ = 0 Dual condition 1: ∇R̂l (ω ∗ ) + µ∗ = 0 Dual condition 2: 0 ∈ λ2 ∂Ωsp ∗ 2 k (β ) + µ ∗ Resulting in a definition for the residuals at step t + 1: Primal residuals: r (t+1) = ω (t+1) − β (t+1) Dual residuals: s(t+1) = ρ(β (t+1) − β (t) ) As the algorithm converges, the (norm of the) residuals tend to zero. 69
ADMM - choice of parameter Parameter ρ is critical in ADMM: it controls how much variables change. It can be seen as a step size. How to choose it? Adaptive ADMM One solution is to let(t)it adapt to the problem: (1 + τ )ρ if ||r (t+1) ||2 > η||s(t+1) ||2 and t ≤ tmax ρ(t+1) = ρ(t) /(1 + τ ) if η||r (t+1) ||2 < ||s(t+1) ||2 and t ≤ tmax (t) ρ otherwise In practice, we use τ = 1, η = 10 and tmax = 100. => Adaptive ADMM forces the primal and dual residuals to be of a similar amplitude. 70
Comparison k=1 k=5 5 5 0 0 RDG RDG −5 −5 −10 −10 −15 −15 0 10 20 0 10 20 Time (Seconds) Time (Seconds) ADMM − adaptive k = 10 ADMM − rho=1 k = 100 5 ADMM − rho=10 5 ADMM − rho=100 0 FISTA 0 RDG RDG −5 −5 −10 −10 −15 −15 0 10 20 0 20 40 60 Time (Seconds) Time (Seconds) 71
Accuracy vs size of the signature Single−run Ensemble−Mean 0.65 0.65 0.6 AUC 0.6 AUC 0.55 0.55 0.5 0.5 0.45 0.45 0 20 40 60 80 100 0 20 40 60 80 100 Ensemble−Exponential Ensemble−Stability Selection 0.65 0.65 0.6 AUC 0.6 AUC 0.55 0.55 0.5 0.5 0.45 0.45 0 20 40 60 80 100 0 20 40 60 80 100 Random T−test Entropy Bhatt. Wilcoxon SVM RFE GFS Lasso E−Net 72
Stability vs size of the signature 0.8 Single-run 0.8 Ensemble-average Stability 0.6 Stability 0.6 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 0 20 40 60 80 100 0.8 Ensemble-exponential Ensemble-stability selection 0.8 Stability Stability 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Random T test Entropy Bhatt. Wilcoxon RFE GFS Lasso E−Net 73
You can also read