Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ - 15/07/2014$ RWL$Jones$
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Big$Data$at$the$Large$Hadron$Collider:$ ATLAS$Data$Preserva9on$&$Access$Policy$ Roger$Jones$ 15/07/2014$ RWL$Jones$
Context$ • As$a$Director$of$High$End$Compu9ng$at$Lancaster$University,$I$ would$bring$a$fairly$typical$ins9tu9onal$view$to$the$discussion$ – ~6000$research$users$needing$to$honour$Research$Council$ Policies$ – The$Edinburgh,$Oxford,$UCL$etc$are$larger$examples$of$the$ same.$ $ 15/07/2014$ RWL$Jones$
Context$ • What$I$bring$that$is$unusual$is$my$responsibility$for$Data$ Preserva9on$&$Access$for$the$ATLAS$experiment$at$the$Large$ Hadron$Collider$ – >3000$authors,$6$con9nents,$74$countries,$>150$ins9tutes$ – Large$&$divergent$aXtudes$to$data$preserva9on$&$access$across$ collaborators$–$preserva9on$&$access$policy$a$result$of$delicate$ nego9a9on$ – Huge$data$volume$under$management$–$$130$PB,$~Google$ 15/07/2014$ RWL$Jones$
Constraints$ • The$resource$levels$required$for$meaningful$preserva9on$are$ already$large$and$above$exis9ng$budgets$ • The$data$are$also$complex$and$require$a$large$so\ware,$ discovery,$database$&$support$infrastructure$to$use$ meaningfully$ • The$lead$9mes$for$the$experiment$are$huge$ – Atlas$started$in$1994$a\er$10$years$of$prior$planning,$first$took$ data$in$2009,$expected$life9me$>20$more$years$ – The$analysts$are$also$the$constructors$and$data`takers$ – Long$and$ongoing$commitment$of$effort$(~100days/year/person$ of$non`publishable$work)$for$authorship$ – The$rewards$are$in$largely$in$terms$of$exclusive$access$ 15/07/2014$ RWL$Jones$
Data$formats$ • The$data$is$in$many$formats$ – Trigger$level$data$is$not$wriden$to$storage$for$most$collisions$–$ reduce$40,000,000$collisions$a$second$to$1000$ – Raw$data$is$uncalibrated$and$meaningless$for$analysis$ • Even$collabora9on$members$cannot$access$it$ – Reconstructed$data$is$more$meaningful$–$but$huge$in$volume,$ only$exists$for$~months$ – Analysis$format$is$more$compact,$but$s9ll$huge$ • Requires$a$lot$of$tacit$data$to$make$useful$ – Most$groups$have$even$more$compressed$&$specific$formats$ • Triage$what$is$useful$to$store$&$share$ 15/07/2014$ RWL$Jones$
DPHEP$levels$for$preserva9on$ • Need$to$preserve$data,$metadata,$PB$databases,$tacit$ knowledge$ Preservation Model Use Case Increasing$cost,$complexity$ 1 Provide additional documentation Publication related info search Documentation and$benefits$ 2 Preserve the data in a simplified format Outreach, simple training analyses Outreach Preserve the analysis level software and Full scientific analysis, based on 3 Technical data format the existing reconstruction Preservation Preserve the reconstruction and simulation Retain the full potential of the Projects 4 software as well as the basic level data experimental data • Fully$commided$to$external$access$for$levels$1$&$2$ • Levels$3$&$4$mainly$for$internal$use,$require$large$amounts$of$ simula9on$etc$ • ReCast$and$Rivet$allow$scien9fic$reuse$that$partly$spans$1`3$ 15/07/2014$ RWL$Jones$
Consequences$for$preserva9on$ • Data$preserva9on$is$a$real$challenge$ – Preserving$the$bit$is$the$easy$part$ – Making$it$useful$requires$far$more$ • Strategy:$conserve$the$recipe,$no$the$pizza$ – Store$the$minimum$real$data$necessary$ – Store$the$rest$as$virtual$data$–$reproducible$from$the$preserved$ real$data$ – Build$extensive$valida9on$and$tes9ng$systems$to$ensure$all$data$ is$s9ll$processable$and$analyzable$ • Commitment:$ensure$all$unique$data$remains$‘live’$for$the$ dura9on$of$the$collabora9on$ – Will$work$with$follow`on$projects$to$preserve$it$beyond$that$ date$ 15/07/2014$ RWL$Jones$
Summary$of$Data$Access$ Policy$ • ATLAS$is$open$to$sharing$data$a\er$a$fair$period$of$exclusive$ access$ – The$embargo$period$is$years,$the$9me$to$do$typical$precise$ measurements$$ – The$ATLAS$effort$will$go$to$useful$and$responsible$release$of$ data$and$tools$to$use$it$ – This$at$present$means$for$educa9on$&$outreach;$and$paper$ output$formats$such$as$paper$figures,$suppor9ng$tables$and$ capturing$the$results$of$analyses$in$RIVET$and$ReCast$ • The$lader$allows$scien9fically$meaningful$reuse$of$the$data$ • New$models$can$be$challenged$with$fully`understood,$calibrated$&$ corrected$output$from$exis9ng$analyses$ • Later$releases$of$bulk$data$formats$not$excluded,$but$would$ require$new,$addi9onal$physical$resources$&$effort$into$tools$ 15/07/2014$ RWL$Jones$
Further$comment$ • Full$release$of$paper$associated$data$in$HEPData$ – More$detailed$$ – tables$ – figures$ – cross`sec9ons$(=probabili9es$for$each$process$to$happen)$ – Detailed$efficiency$corrected,$calibrated$outputs$of$analysis$ 15/07/2014$ RWL$Jones$
Outreach,$educa9on$&$ Beyond$ • The$release$of$limited$data$for$educa9on$and$outreach$has$ been$going$on$for$a$long$9me$ – Simplified$formats,$not$suitable$for$extrac9ng$science$ – Four$tailored$packages$with$simplified$analyses$ • Reproducibility$ – Emerging$tools$from$CERN`IT$can$be$useful$for$this$&$outreach,$ and$also$to$help$us$to$capture$&$preserve$our$analyses$ • Also$inves9ga9ng$scope$for$releasing$non`collision$data$(e.g.$ detector$aging,$‘expensive’$radia9on$simula9ons)$that$may$be$ of$use$to$others$ – This$should$be$well$received$by$our$funding$agencies$ 15/07/2014$ RWL$Jones$
Preserva9on$&$Access$in$Big$ Science$ • Small$advert$for$MaRDI`Gross,$study$of$data$management$ policy$recommenda9ons$for$big$science$(2012)$ • hdp://mardigross.jiscinvolve.org/wp/$ 15/07/2014$ RWL$Jones$
You can also read