Research and Collaboration with CERN - Yandex started to collaborate with the LHCb experiment at CERN in 2011 on a number of projects in various ...
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Research and Collaboration with CERN Yandex started to collaborate with the LHCb experiment at CERN in 2011 on a number of projects in various areas of research, and has been part of CERN openlab as an Associate member in 2013 and 2014. 1
Reproducible Experiment Platform Some of the problems in industrial information retrieval are very similar to those in particle physics and can be addressed using the same techniques. Our own Reproducible Experiment Platform (REP), developed during our collaboration with LHCb, is a software infrastructure supporting a collaborative ecosystem for computational science, and was successfully applied to similar research environment in physics. It is a Python-based solution, which allows research teams to run computational experiments on shared big datasets, obtaining reproducible and repeatable results and consistent comparisons of these results. REP supports many data formats, including ROOT, and can be easily integrated with the existing HEP software and analyses. The key features of REP were developed using case studies, which include trigger optimization and physics analysis studies at the LHCb experiment. Implementing a REP prototype in information retrieval research resulted in performance increase by two orders of magnitude. 2
Application of data science methods to data analysis in particle physics As part of the LHCb experiment at CERN, Yandex equips Trigger optimization includes studies, which compare physicists with tools to help them find better explanations performance of different algorithms (AdaBoost, MatrixNet of the mysteries of our universe. Our framework etc.), carried out to optimize the topological trigger for LHC and methodology for training and comparing predictive Run 2. The topological trigger algorithm is designed to select models, EventFilter, for instance, was shown to improve all “interesting” decays of b-hadrons, but it cannot be trained precision for the measurement of the upper limit on every individual decay. To find out how to optimize for the τ → 3μ decay in the LHCb experiment. performance of the classification algorithm on those decays that are not used in training, a number of studies have Topological trigger performance optimization been done, which resulted in the development of a number of optimization techniques that include cascading, The key b-physics trigger algorithm used in the LHCb ensembling and blending. Some novel boosting techniques experiment is the so-called topological trigger. In the LHC helping to reduce systematic uncertainties in Run 2 Run 1, this trigger, utilizing a custom-boosted decision measurements have been developed and implemented tree algorithm, selected an almost 100%-pure sample recently. After re-optimization, the topological trigger of b-hadrons with a typical efficiency of 60-70%. Its output is expected to significantly improve performance was used in about 60% of the LHCb publications. on the Run 1 for a wide range of b-hadron decays. 3
Development and application of information search and retrieval technologies for working with large datasets EventIndex Data popularity estimation for data storage management in LHCb To deal with the challenges of processing and structuring very large volumes of data generated in the LHCb experiment, It is impossible to fit all the petabytes of data produced we developed LHCb EventIndex, an event search system within the lifetime of the LHCb experiment on a disk, designed for organizing LHCb events stored on the Grid. a cheap and efficient but limited form of data storage. Most Its primary goal is to enable a fast selection of subsets of this data, however, doesn’t need to be instantly available of events that fulfil a combination of high-level conditions, after it has been analysed. To understand which datasets such as which trigger has fired, or how many muons were can be archived on a tape and which datasets should remain found in the event that equal a certain value. This system, instantly available in a disk storage system, we have been based on open-source NoSQL solutions, was designed developing a program called Data Popularity Estimator. Our with scalability and extensibility in mind. EventIndex can goal is to optimize the usage of dataset information from be extended to support other classes of data gathered the past for predicting its future popularity. The detailed in the LHCb analysis, which increases retrieval efficiency comparison of various time series analyses, machine learning and reduces the load on data storage systems. classifiers, clustering and regression algorithms that we have carried out showed that using this approach we will be able to significantly reduce disk occupancy for the LHCb data. Using this system saves up to about 2.5Pb of disk space of the total of approximately 8.9Pb of the LHCb data. 4
Development and application of information search and retrieval technologies for working with large datasets SkyGrid Extremely rare production and rare decay inherent to the physics of hidden sectors, as well as the need To address all the limitations of computational grid to suppress the entirety of all Standard Model phenomena infrastructure, we developed a system that integrates cloud mimicking signals, require large computing resources. technologies into grid systems and called it SkyGrid. In this endeavour, SkyGrid’s know-how and technology have already made a major contribution to generating billions SkyGrid was developed in collaboration with the SHiP of events, which will allow SHiP’s researchers to make proposed experiment, a new general-purpose fixed target weighted decisions on signal detector’s design and proceed facility to search for hidden particles, which currently unites towards discoveries of new physical phenomena. the effort of 41 research institutions in 14 countries. It allowed researchers to integrate their private cloud resources into a single computational infrastructure. Detailed evaluation of the signal and background information based on Monte Carlo simulation is critical to the development of SHiP’s experimental technique and essential for the demonstration of the reach and sensitivity of the experiment’s proposal. 5
Improving data analysis methods to suit specific needs in particle physics An approach to enhancing performance of a classification WLCG GRID Tier-II Node solution, known as ‘boosting’, becomes more and more popular in particle physics. This approach involves Yandex provides computational resources that are consumed training a series of classifiers, instead of training only one, by LHCb as Worldwide LHC Computational Grid to then efficiently combine the output of the trained classifiers. Tier-II node. This is the largest Tier-II site for LHCb at the moment. It is used for event simulation, as well We explored several novel boosting methods that were as event reconstruction and processing. designed to produce a uniform selection efficiency in a chosen multivariate space. Such algorithms have a wide range of applications in particle physics, including producing uniform signal selection efficiency that helps to avoid false signal peaks in an invariant mass distribution when searching for new particles. The technique we propose eliminates a trade-off between uniformity & efficiency. It can be used for online filtering and data analysis. It is much faster than its predecessor and available as an open-source solution. 6
You can also read