Columnar data analysis with uproot and awkward array - Indico
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Columnar data analysis with uproot and awkward array
Nikolai Hartmann
LMU Munich
February 19, 2021, Munich ATLAS Belle II Computing meeting
1 / 17Columnar data analysis - Motivation
Operate on columns - “array-at-a-time” instead of “event-at-a-time”
Advantages:
• Operations can be predefined, no for loops! Most prominent example: numpy
→ Move slow performing organisational stuff out of the event loop
→ Write analysis code in python instead of c++
• These operations run on contiguous blocks in memory and are therefore fast (vectorizable,
good for CPU cache)
• Lots of advances in tools during the last years, since this kind of workflow is essential for
data science/machine learning
Disadvatages
• Arrays need to be loaded into memory
→ need to process chunk-wise if amount of data too large
• Some operations more difficult to think about
(e.g combinatorics, nested selections, variable length lists per event)
3 / 17ATLAS analysis model for Run 3 (+x)
4 / 17ATLAS analysis model for Run 3 (+x)
4 / 17Read DAOD PHYSLITE with uproot
DAOD PHYSLITE has most data split into columns, but
• Some branches have higher level of nesting (vector)
• Those can’t be split by ROOT
• Also, need to loop through data to “columnize”
→ slow in python
→ i have a hack based on numba for now
→ there is now a Forth machine in awkward that will handle this in the future
5 / 17Read DAOD PHYSLITE with uproot
vector (Jets) Uproot default
Uproot default
101 Custom deserialization
Only decompression
Loading time for 10000 events [s]
ROOT TTree::Draw
100
10 1
10 2
t>> (Jets) ctrons) > (MET)
vector (Ele c tor>Plots from Jim
https://github.com/scikit-hep/awkward-1.0/pull/661
(We will probably here more from him about this topic at vCHEP21)
7 / 17Intermezzo - why ROOT files have a high compression ratio
Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch:
8 / 17Intermezzo - why ROOT files have a high compression ratio
Example: Data of one basket of the AnalysisElectronsAuxDyn.pt branch:
“Garbage”: Header (telling us “this is a vector”) and number of bytes following (redundant)
8 / 17Intermezzo - why ROOT files have a high compression ratio
Even more true for higher nested, structured data
Example: AnalysisElectronsAuxDyn.trackParticleLinks
(vector and ElementLink has 2 members - m_persKey,
m_persIndex):
9 / 17Alternative storage formats
Loading times for all columns of 10k DAOD PHYSLITE events
Format Compression Dedup. offsets Size on disk Execution time
ROOT zlib No 117 MB 6.0 s
ROOT (large baskets) zlib No 116 MB 5.0 s
Parquet snappy No 121 MB 0.6 s
Parquet snappy Yes 118 MB 0.6 s
HDF5 gzip No 101 MB 2.0 s
HDF5 gzip Yes 89 MB 1.6 s
HDF5 lzf No 137 MB 1.5 s
HDF5 lzf Yes 113 MB 1.1 s
npz zip No 92 MB 2.0 s
npz zip Yes 82 MB 1.5 s
Parquet seems especially promising, but everything is faster than ROOT
10 / 17Event data models and awkward array
Awkward array has everything we need to represent what we are doing in a columnar fashion:
• Nested records
e.g. Events -> [Electrons -> pt, eta, phi, ..., Jets -> pt, eta, phi ...]
• Behavior/Dynamic quantities
e.g. LorentzVector - can add vectors, calculate invariant masses etc.
• Cross references via indices
e.g. Events.Electrons.trackParticle represents an electron’s track particle via
indices
11 / 17Prototype for DAOD PHYSLITE
→ git
Can already do this:
>>> import awkward as ak
>>> events[ak.num(events.Electrons) >= 1].Electrons.pt[:, 0]
→ filtering on different levels
>>> events.Electrons.trackParticles.z0
→ dynamically create cross references from indices
12 / 17>>> events.Electrons.trackParticles
>>> events.Electrons.trackParticles.pt
→ dynamically calculate momenta from track parameters
>>> electrons = Events.electrons
>>> jets = Events.jets
>>> electrons.delta_r(electrons.nearest(jets)) < 0.2
→ more advanced LorentzVector calculations
13 / 17Technical aspects of this
• Class names are attached as metadata to the arrays
→ separation of data schema and behaviour
• To do dynamic cross references, need a reference to the top level object
(events.Electrons needs to know about events)
• Also, want to load columns lazily
→ very useful for interactive working
→ using awkward’s VirtualArray
• Cross references also have to work after slicing/indexing the array
→ need “global” indices
All these exist in the coffea NanoEvents module
→ in contact with developers, working on an implementation for DAOD PHYSLITE
14 / 17Trying to do an actual analysis
athena/SUSYTools columnar analysis
500000 25000
20000
400000 20000
Object count
15000
300000 15000
10000 200000 10000
5000 100000 5000
0 0 0
all
baseline
passOR
signal
all
baseline
passOR
signal
all
baseline
passOR
signal
Electrons Jets Muons
• Start with some simple object selections on Electrons, Muons, Jets
→ most challenging part: getting all the overlap removal logic correct
• Compare with SUSYTools framework (athena analysis)
→ working to some extend
• Many things still missing/unclear
→ e.g. MET calculation, Pileup reweighting, Systematics
15 / 17Performance
Measurement Total time [s] average no. events / s
Athena/SUSYTools 22 2300
Columnar 3.8 13000
Columnar (cached) 1.2 42000
• In all cases: Read with “warm” page cache
• “Cached” for columnar analysis means data already decompressed and deserialized
16 / 17Scaling tests
• We have now a sample of PHYSLITE ATLAS Run2 data:
100 TB, 260k files, 1.8e10 events
• Starting testing with a 10% subset on lrz
• Want to run this using dask
• Found several issues with (py)xrootd and uproot xrootd on the way ...
→ Mostly fixed now, but still some problems with memory leaks
• Want to test this analysis within the ATLAS Google cloud project
→ working together with Lukas Heinrich and Ricardo Rocha who did this demo at the
KubeCon2019
17 / 17Backup
18 / 17An impressive demo
Lukas Heinrich and Ricardo Rocha at the KubeCon 2019 → youtube recording, chep talk
Reperform the Higgs discovery analysis on 70 TB of CMS open data in a live demo
19 / 17ROOT file storage
(graphics from tutorial by Jim Pivarski)
20 / 17How does that basket data actually look like?
... and how does uproot read it?
For simple n-tuples actually just the numbers!
• Stored in big-endian format (starting with the most significant bit)
(most nowadays processors use little-endian)
• After decompressing the basket can be simply loaded into a numpy array
• Example for a float (single precision) branch:
np.frombuffer(basket_data, dtype=">f4")
21 / 17More complicated for vector branches
Example for vector
10 bytes header (vector) float float float float float
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
64 0 0 26 0 9 0 0 0 5 72 8 207 244 71 187 94 243 71 144 28 162 70 114 142 37 70 134 68 95
140095.81 95933.9 73785.266 15523.536 17186.186
--+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+-
• Each event consists of a vector header (telling us how many bytes and numbers follow)
and then the actual data
• Fortunately ROOT stores event offsets at the end of baskets for vector branches
→ can read them and use numpy tricks to skip over vector headers
• Not possible any more for further nesting (vectorColumnar data analysis with PHYSLITE
Idea:
• Most data is stored in “aux” branches (vector)
→ easily readable column-wise, also with uproot
• Reconstruction/Calibrations already applied
→ the rest might be “simple” enough to do with plain columnar operations
→ the xAOD EDM should be representable to large extend in awkward array
→ many things already solved by CMS in coffea / NanoEvents
23 / 17Represent the PHYSLITE EDM as an awkward array
{
"class": "RecordArray",
"contents": {
{
"AnalysisElectrons": {
"class": "ListArray64",
"class": "ListOffsetArray64",
"starts": "i64",
"offsets": "i64",
"stops": "i64",
"content": {
"content": {
"class": "RecordArray",
"class": "IndexedArray64",
"contents": {
"index": "i64",
"pt": "float32",
"content": {
"eta": "float32",
"class": "RecordArray",
"phi": "float32",
"contents": {
"m": "float32",
"phi": "float32",
"charge": "float32",
"d0": "float32",
"ptvarcone30_TightTTVA_pt1000": "float32",
"z0": "float32",
(...)
(...)
"trackParticles": { }
},
},
"parameters": {
"parameters": {
"__record__": "xAODTrackParticle"
"__record__": "xAODParticle"
}
}
}
}
}
}
}
(...)
}
}
With this we can do things like
>>> # pt of the first track particle of each electron in events with at least one electron
>>> Events[ak.num(Events.AnalysisElectrons) >= 1].AnalysisElectrons.trackParticles.pt[:,:,0]
24 / 17Awkward combinatorics
ak.cartesian ak.combinations
(graphics from tutorial by Jim Pivarski)
ak.cartesian can be called with nested=True to keep structure of the first array
→ can apply reducer afterwards to regain array with same structure
→ e.g. find closest other particle (min)
25 / 17approx. overlap removal
def has_overlap(obj1, obj2, filter_dr):
"""
Return mask array where obj1 has overlap with obj2 based on a filter
function on deltaR (and pt of the first one)
"""
obj1x, obj2x = ak.unzip(
ak.cartesian([obj1[["pt", "eta", "phi"]], obj2[["eta", "phi"]]], nested=True)
)
dr = np.sqrt((obj1x.eta - obj2x.eta) ** 2 + delta_phi(obj1x, obj2x) ** 2)
return ak.any(filter_dr(dr, obj1x.pt), axis=2)
def match_dr(dr, pt, cone_size=0.2):
return dr < cone_size
def match_boosted_dr(dr, pt, max_cone_size=0.4):
return dr < np.minimum(*ak.broadcast_arrays(10000. / pt + 0.04, 0.4))
26 / 17Alternative: Numba
https://numba.pydata.org
• Just-in-time compiler for python code
→ just decorate a function with @numba.njit
• Works with numpy arrays
• Works with awkward arrays
• Reading compontents of awkward arrays works more or less straight forward
→ just index like Events[0].AnalysisElectrons[0].pt
• Creating awkward arrays a bit more difficult - 2 Options:
→ Use the awkward ArrayBuilder
→ Create arrays and offsets separately
• Can be a fallback if it is hard to think about a problem without a loop over events/objects
27 / 17Overlap removal using numba and ArrayBuilder
@numba.njit
def delta_phi(obj1, obj2):
return (obj1.phi - obj2.phi + np.pi) % (2 * np.pi) - np.pi
@numba.njit
def delta_r(obj1, obj2):
return np.sqrt((obj1.eta - obj2.eta) ** 2 + delta_phi(obj1, obj2) ** 2)
@numba.njit
def has_overlap_numba(builder, obj1, obj2, cone_size=0.2):
# loop over events
for i in range(len(obj1)):
builder.begin_list()
# loop over first object list
for k in range(len(obj1[i])):
# loop over second object list
for l in range(len(obj2[i])):
if delta_r(obj1[i][k], obj2[i][l]) < cone_size:
builder.append(True)
break
else:
builder.append(False)
builder.end_list()
def has_overlap(obj1, obj2, cone_size=0.2):
builder = ak.ArrayBuilder()
has_overlap_numba(builder, obj1, obj2)
return builder.snapshot()
28 / 17approx. overlap removal - cont’d
# remove jets overlapping with electrons
evt["jets", "passOR"] = (
evt.jets.baseline
& (
~has_overlap(
evt.jets,
evt.electrons[evt.electrons.baseline],
match_dr
)
)
)
# remove electrons overlapping (boosted cone) with remaining jets (if they pass jvt)
evt["electrons", "passOR"] = (
evt.electrons.baseline
& (
~has_overlap(
evt.electrons,
evt.jets[evt.jets.passOR & evt.jets.passJvt],
match_boosted_dr
)
)
)
... etc
29 / 17You can also read