Research and Collaboration with CERN - Yandex started to collaborate with the LHCb experiment at CERN in 2011 on a number of projects in various ...

Page created by Randall Hunt
 
CONTINUE READING
Research
and Collaboration
with CERN

Yandex started to collaborate with the LHCb
experiment at CERN in 2011 on a number
of projects in various areas of research,
and has been part of CERN openlab
as an Associate member in 2013 and 2014.

1
Reproducible
Experiment Platform

Some of the problems in industrial information retrieval
are very similar to those in particle physics and can
be addressed using the same techniques. Our own
Reproducible Experiment Platform (REP), developed during
our collaboration with LHCb, is a software infrastructure
supporting a collaborative ecosystem for computational
science, and was successfully applied to similar research
environment in physics. It is a Python-based solution, which
allows research teams to run computational experiments
on shared big datasets, obtaining reproducible and repeatable
results and consistent comparisons of these results. REP
supports many data formats, including ROOT, and can be
easily integrated with the existing HEP software and analyses.
The key features of REP were developed using case
studies, which include trigger optimization and physics
analysis studies at the LHCb experiment. Implementing
a REP prototype in information retrieval research resulted
in performance increase by two orders of magnitude.

2
Application of data science methods
to data analysis in particle physics

As part of the LHCb experiment at CERN, Yandex equips          Trigger optimization includes studies, which compare
physicists with tools to help them find better explanations    performance of different algorithms (AdaBoost, MatrixNet
of the mysteries of our universe. Our framework                etc.), carried out to optimize the topological trigger for LHC
and methodology for training and comparing predictive          Run 2. The topological trigger algorithm is designed to select
models, EventFilter, for instance, was shown to improve        all “interesting” decays of b-hadrons, but it cannot be trained
precision for the measurement of the upper limit               on every individual decay. To find out how to optimize
for the τ → 3μ decay in the LHCb experiment.                   performance of the classification algorithm on those decays
                                                               that are not used in training, a number of studies have
Topological trigger performance optimization                   been done, which resulted in the development of a number
                                                               of optimization techniques that include cascading,
The key b-physics trigger algorithm used in the LHCb           ensembling and blending. Some novel boosting techniques
experiment is the so-called topological trigger. In the LHC    helping to reduce systematic uncertainties in Run 2
Run 1, this trigger, utilizing a custom-boosted decision       measurements have been developed and implemented
tree algorithm, selected an almost 100%-pure sample            recently. After re-optimization, the topological trigger
of b-hadrons with a typical efficiency of 60-70%. Its output   is expected to significantly improve performance
was used in about 60% of the LHCb publications.                on the Run 1 for a wide range of b-hadron decays.

3
Development and application of information
search and retrieval technologies
for working with large datasets
EventIndex                                                       Data popularity estimation for data
                                                                 storage management in LHCb
To deal with the challenges of processing and structuring very
large volumes of data generated in the LHCb experiment,          It is impossible to fit all the petabytes of data produced
we developed LHCb EventIndex, an event search system             within the lifetime of the LHCb experiment on a disk,
designed for organizing LHCb events stored on the Grid.          a cheap and efficient but limited form of data storage. Most
Its primary goal is to enable a fast selection of subsets        of this data, however, doesn’t need to be instantly available
of events that fulfil a combination of high-level conditions,    after it has been analysed. To understand which datasets
such as which trigger has fired, or how many muons were          can be archived on a tape and which datasets should remain
found in the event that equal a certain value. This system,      instantly available in a disk storage system, we have been
based on open-source NoSQL solutions, was designed               developing a program called Data Popularity Estimator. Our
with scalability and extensibility in mind. EventIndex can       goal is to optimize the usage of dataset information from
be extended to support other classes of data gathered            the past for predicting its future popularity. The detailed
in the LHCb analysis, which increases retrieval efficiency       comparison of various time series analyses, machine learning
and reduces the load on data storage systems.                    classifiers, clustering and regression algorithms that we have
                                                                 carried out showed that using this approach we will be able
                                                                 to significantly reduce disk occupancy for the LHCb data.
                                                                 Using this system saves up to about 2.5Pb of disk space
                                                                 of the total of approximately 8.9Pb of the LHCb data.

4
Development and application of information
search and retrieval technologies
for working with large datasets
SkyGrid                                                              Extremely rare production and rare decay inherent
                                                                     to the physics of hidden sectors, as well as the need
To address all the limitations of computational grid                 to suppress the entirety of all Standard Model phenomena
infrastructure, we developed a system that integrates cloud          mimicking signals, require large computing resources.
technologies into grid systems and called it SkyGrid.                In this endeavour, SkyGrid’s know-how and technology have
                                                                     already made a major contribution to generating billions
SkyGrid was developed in collaboration with the SHiP                 of events, which will allow SHiP’s researchers to make
proposed experiment, a new general-purpose fixed target              weighted decisions on signal detector’s design and proceed
facility to search for hidden particles, which currently unites      towards discoveries of new physical phenomena.
the effort of 41 research institutions in 14 countries. It allowed
researchers to integrate their private cloud resources into
a single computational infrastructure. Detailed evaluation
of the signal and background information based on Monte
Carlo simulation is critical to the development of SHiP’s
experimental technique and essential for the demonstration
of the reach and sensitivity of the experiment’s proposal.

5
Improving data analysis methods to suit
specific needs in particle physics

An approach to enhancing performance of a classification             WLCG GRID Tier-II Node
solution, known as ‘boosting’, becomes more and more
popular in particle physics. This approach involves                  Yandex provides computational resources that are consumed
training a series of classifiers, instead of training only one,      by LHCb as Worldwide LHC Computational Grid
to then efficiently combine the output of the trained classifiers.   Tier-II node. This is the largest Tier-II site for LHCb
                                                                     at the moment. It is used for event simulation, as well
We explored several novel boosting methods that were                 as event reconstruction and processing.
designed to produce a uniform selection efficiency in a chosen
multivariate space. Such algorithms have a wide range
of applications in particle physics, including producing
uniform signal selection efficiency that helps to avoid false
signal peaks in an invariant mass distribution when searching
for new particles. The technique we propose eliminates
a trade-off between uniformity & efficiency. It can be used
for online filtering and data analysis. It is much faster
than its predecessor and available as an open-source solution.

6
You can also read