Columnar data analysis with uproot and awkward array - Indico
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Columnar data analysis with uproot and awkward array Nikolai Hartmann LMU Munich February 19, 2021, Munich ATLAS Belle II Computing meeting 1 / 17
Columnar data analysis - Motivation Operate on columns - “array-at-a-time” instead of “event-at-a-time” Advantages: • Operations can be predefined, no for loops! Most prominent example: numpy → Move slow performing organisational stuff out of the event loop → Write analysis code in python instead of c++ • These operations run on contiguous blocks in memory and are therefore fast (vectorizable, good for CPU cache) • Lots of advances in tools during the last years, since this kind of workflow is essential for data science/machine learning Disadvatages • Arrays need to be loaded into memory → need to process chunk-wise if amount of data too large • Some operations more difficult to think about (e.g combinatorics, nested selections, variable length lists per event) 3 / 17
ATLAS analysis model for Run 3 (+x) 4 / 17
ATLAS analysis model for Run 3 (+x) 4 / 17
Read DAOD PHYSLITE with uproot DAOD PHYSLITE has most data split into columns, but • Some branches have higher level of nesting (vector) • Those can’t be split by ROOT • Also, need to loop through data to “columnize” → slow in python → i have a hack based on numba for now → there is now a Forth machine in awkward that will handle this in the future 5 / 17
Read DAOD PHYSLITE with uproot vector (Jets) Uproot default Uproot default 101 Custom deserialization Only decompression Loading time for 10000 events [s] ROOT TTree::Draw 100 10 1 10 2 t>> (Jets) ctrons) > (MET) vector (Ele c tor>
Plots from Jim https://github.com/scikit-hep/awkward-1.0/pull/661 (We will probably here more from him about this topic at vCHEP21) 7 / 17
Intermezzo - why ROOT files have a high compression ratio Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch: 8 / 17
Intermezzo - why ROOT files have a high compression ratio Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch: “Garbage”: Header (telling us “this is a vector”) and number of bytes following (redundant) 8 / 17
Intermezzo - why ROOT files have a high compression ratio Even more true for higher nested, structured data Example: AnalysisElectronsAuxDyn.trackParticleLinks (vector and ElementLink has 2 members - m_persKey, m_persIndex): 9 / 17
Alternative storage formats Loading times for all columns of 10k DAOD PHYSLITE events Format Compression Dedup. offsets Size on disk Execution time ROOT zlib No 117 MB 6.0 s ROOT (large baskets) zlib No 116 MB 5.0 s Parquet snappy No 121 MB 0.6 s Parquet snappy Yes 118 MB 0.6 s HDF5 gzip No 101 MB 2.0 s HDF5 gzip Yes 89 MB 1.6 s HDF5 lzf No 137 MB 1.5 s HDF5 lzf Yes 113 MB 1.1 s npz zip No 92 MB 2.0 s npz zip Yes 82 MB 1.5 s Parquet seems especially promising, but everything is faster than ROOT 10 / 17
Event data models and awkward array Awkward array has everything we need to represent what we are doing in a columnar fashion: • Nested records e.g. Events -> [Electrons -> pt, eta, phi, ..., Jets -> pt, eta, phi ...] • Behavior/Dynamic quantities e.g. LorentzVector - can add vectors, calculate invariant masses etc. • Cross references via indices e.g. Events.Electrons.trackParticle represents an electron’s track particle via indices 11 / 17
Prototype for DAOD PHYSLITE → git Can already do this: >>> import awkward as ak >>> events[ak.num(events.Electrons) >= 1].Electrons.pt[:, 0] → filtering on different levels >>> events.Electrons.trackParticles.z0 → dynamically create cross references from indices 12 / 17
>>> events.Electrons.trackParticles >>> events.Electrons.trackParticles.pt → dynamically calculate momenta from track parameters >>> electrons = Events.electrons >>> jets = Events.jets >>> electrons.delta_r(electrons.nearest(jets)) < 0.2 → more advanced LorentzVector calculations 13 / 17
Technical aspects of this • Class names are attached as metadata to the arrays → separation of data schema and behaviour • To do dynamic cross references, need a reference to the top level object (events.Electrons needs to know about events) • Also, want to load columns lazily → very useful for interactive working → using awkward’s VirtualArray • Cross references also have to work after slicing/indexing the array → need “global” indices All these exist in the coffea NanoEvents module → in contact with developers, working on an implementation for DAOD PHYSLITE 14 / 17
Trying to do an actual analysis athena/SUSYTools columnar analysis 500000 25000 20000 400000 20000 Object count 15000 300000 15000 10000 200000 10000 5000 100000 5000 0 0 0 all baseline passOR signal all baseline passOR signal all baseline passOR signal Electrons Jets Muons • Start with some simple object selections on Electrons, Muons, Jets → most challenging part: getting all the overlap removal logic correct • Compare with SUSYTools framework (athena analysis) → working to some extend • Many things still missing/unclear → e.g. MET calculation, Pileup reweighting, Systematics 15 / 17
Performance Measurement Total time [s] average no. events / s Athena/SUSYTools 22 2300 Columnar 3.8 13000 Columnar (cached) 1.2 42000 • In all cases: Read with “warm” page cache • “Cached” for columnar analysis means data already decompressed and deserialized 16 / 17
Scaling tests • We have now a sample of PHYSLITE ATLAS Run2 data: 100 TB, 260k files, 1.8e10 events • Starting testing with a 10% subset on lrz • Want to run this using dask • Found several issues with (py)xrootd and uproot xrootd on the way ... → Mostly fixed now, but still some problems with memory leaks • Want to test this analysis within the ATLAS Google cloud project → working together with Lukas Heinrich and Ricardo Rocha who did this demo at the KubeCon2019 17 / 17
Backup 18 / 17
An impressive demo Lukas Heinrich and Ricardo Rocha at the KubeCon 2019 → youtube recording, chep talk Reperform the Higgs discovery analysis on 70 TB of CMS open data in a live demo 19 / 17
ROOT file storage (graphics from tutorial by Jim Pivarski) 20 / 17
How does that basket data actually look like? ... and how does uproot read it? For simple n-tuples actually just the numbers! • Stored in big-endian format (starting with the most significant bit) (most nowadays processors use little-endian) • After decompressing the basket can be simply loaded into a numpy array • Example for a float (single precision) branch: np.frombuffer(basket_data, dtype=">f4") 21 / 17
More complicated for vector branches Example for vector 10 bytes header (vector) float float float float float --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- 64 0 0 26 0 9 0 0 0 5 72 8 207 244 71 187 94 243 71 144 28 162 70 114 142 37 70 134 68 95 140095.81 95933.9 73785.266 15523.536 17186.186 --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- • Each event consists of a vector header (telling us how many bytes and numbers follow) and then the actual data • Fortunately ROOT stores event offsets at the end of baskets for vector branches → can read them and use numpy tricks to skip over vector headers • Not possible any more for further nesting (vector
Columnar data analysis with PHYSLITE Idea: • Most data is stored in “aux” branches (vector) → easily readable column-wise, also with uproot • Reconstruction/Calibrations already applied → the rest might be “simple” enough to do with plain columnar operations → the xAOD EDM should be representable to large extend in awkward array → many things already solved by CMS in coffea / NanoEvents 23 / 17
Represent the PHYSLITE EDM as an awkward array { "class": "RecordArray", "contents": { { "AnalysisElectrons": { "class": "ListArray64", "class": "ListOffsetArray64", "starts": "i64", "offsets": "i64", "stops": "i64", "content": { "content": { "class": "RecordArray", "class": "IndexedArray64", "contents": { "index": "i64", "pt": "float32", "content": { "eta": "float32", "class": "RecordArray", "phi": "float32", "contents": { "m": "float32", "phi": "float32", "charge": "float32", "d0": "float32", "ptvarcone30_TightTTVA_pt1000": "float32", "z0": "float32", (...) (...) "trackParticles": { } }, }, "parameters": { "parameters": { "__record__": "xAODTrackParticle" "__record__": "xAODParticle" } } } } } } } (...) } } With this we can do things like >>> # pt of the first track particle of each electron in events with at least one electron >>> Events[ak.num(Events.AnalysisElectrons) >= 1].AnalysisElectrons.trackParticles.pt[:,:,0] 24 / 17
Awkward combinatorics ak.cartesian ak.combinations (graphics from tutorial by Jim Pivarski) ak.cartesian can be called with nested=True to keep structure of the first array → can apply reducer afterwards to regain array with same structure → e.g. find closest other particle (min) 25 / 17
approx. overlap removal def has_overlap(obj1, obj2, filter_dr): """ Return mask array where obj1 has overlap with obj2 based on a filter function on deltaR (and pt of the first one) """ obj1x, obj2x = ak.unzip( ak.cartesian([obj1[["pt", "eta", "phi"]], obj2[["eta", "phi"]]], nested=True) ) dr = np.sqrt((obj1x.eta - obj2x.eta) ** 2 + delta_phi(obj1x, obj2x) ** 2) return ak.any(filter_dr(dr, obj1x.pt), axis=2) def match_dr(dr, pt, cone_size=0.2): return dr < cone_size def match_boosted_dr(dr, pt, max_cone_size=0.4): return dr < np.minimum(*ak.broadcast_arrays(10000. / pt + 0.04, 0.4)) 26 / 17
Alternative: Numba https://numba.pydata.org • Just-in-time compiler for python code → just decorate a function with @numba.njit • Works with numpy arrays • Works with awkward arrays • Reading compontents of awkward arrays works more or less straight forward → just index like Events[0].AnalysisElectrons[0].pt • Creating awkward arrays a bit more difficult - 2 Options: → Use the awkward ArrayBuilder → Create arrays and offsets separately • Can be a fallback if it is hard to think about a problem without a loop over events/objects 27 / 17
Overlap removal using numba and ArrayBuilder @numba.njit def delta_phi(obj1, obj2): return (obj1.phi - obj2.phi + np.pi) % (2 * np.pi) - np.pi @numba.njit def delta_r(obj1, obj2): return np.sqrt((obj1.eta - obj2.eta) ** 2 + delta_phi(obj1, obj2) ** 2) @numba.njit def has_overlap_numba(builder, obj1, obj2, cone_size=0.2): # loop over events for i in range(len(obj1)): builder.begin_list() # loop over first object list for k in range(len(obj1[i])): # loop over second object list for l in range(len(obj2[i])): if delta_r(obj1[i][k], obj2[i][l]) < cone_size: builder.append(True) break else: builder.append(False) builder.end_list() def has_overlap(obj1, obj2, cone_size=0.2): builder = ak.ArrayBuilder() has_overlap_numba(builder, obj1, obj2) return builder.snapshot() 28 / 17
approx. overlap removal - cont’d # remove jets overlapping with electrons evt["jets", "passOR"] = ( evt.jets.baseline & ( ~has_overlap( evt.jets, evt.electrons[evt.electrons.baseline], match_dr ) ) ) # remove electrons overlapping (boosted cone) with remaining jets (if they pass jvt) evt["electrons", "passOR"] = ( evt.electrons.baseline & ( ~has_overlap( evt.electrons, evt.jets[evt.jets.passOR & evt.jets.passJvt], match_boosted_dr ) ) ) ... etc 29 / 17
You can also read