ATLAS collaboration

Cite as: ATLAS collaboration (2014). Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.ZBP2.M5T8

Dataset Derived Datascience ATLAS CERN-LHC

The dataset has been built from official ATLAS full-detector simulation, with "Higgs to tautau" events mixed with different backgrounds. The simulator has two parts. In the first, random proton-proton collisions are simulated based on the knowledge that we have accumulated on particle physics. It reproduces the random microscopic explosions resulting from the proton-proton collisions. In the second part, the resulting particles are tracked through a virtual model of the detector. The process yields simulated events with properties that mimic the statistical properties of the real events with additional information on what has happened during the collision, before particles are measured in the detector.

The signal sample contains events in which Higgs bosons (with a fixed mass of 125 GeV) were produced. The background sample was generated by other known processes that can produce events with at least one electron or muon and a hadronic tau, mimicking the signal. For the sake of simplicity, only three background processes were retained for the Challenge. The first comes from the decay of the Z boson (with a mass of 91.2 GeV) into two taus. This decay produces events with a topology very similar to that produced by the decay of a Higgs. The second set contains events with a pair of top quarks, which can have a lepton and a hadronic tau among their decay. The third set involves the decay of the W boson, where one electron or muon and a hadronic tau can appear simultaneously only through imperfections of the particle identification procedure.

Due to the complexity of the simulation process, each simulated event has a weight that is proportional to the conditional density divided by the instrumental density used by the simulator (an importance-sampling flavour), and normalised for integrated luminosity such that, in any region, the sum of the weights of events falling in the region is an unbiased estimate of the expected number of events falling in the same region during a given fixed time interval. In our case, the weights correspond to the quantity of real data taken during the year 2012. The weights are an artifact of the way the simulation works and so they are not part of the input to the classifier. For the Challenge, weights have been provided in the training set so the AMS can be properly evaluated. Weights were not provided in the qualifying set since the weight distribution of the signal and background sets are very different and so they would give away the label immediately. However, in the opendata.cern.ch dataset, weights and labels have been provided for the complete dataset.

The evaluation metric is the *approximate median significance* (AMS):

\[ \text{AMS} = \sqrt{2\left((s+b+b_r) \log \left(1 + \frac{s}{b + b_r}\right)-s\right)}\]

where

- $s, b$: unnormalised true positive and false positive rates, respectively,
- $b_r =10$ is the constant regularisation term,
- $\log$ is the natural log.

More precisely, let $(y_1, \ldots, y_n) \in \{\text{b},\text{s}\}^n$ be the vector of true test labels, let $(\hat{y}_1, \ldots, \hat{y}_n) \in \{\text{b},\text{s}\}^n$ be the vector of predicted (submitted) test labels, and let $(w_1, \ldots, w_n) \in {\mathbb{R}^+}^n$ be the vector of weights. Then

\[ s = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{s}\} \mathbb{1}\{\hat{y}_i = \text{s}\} \]

and

\[ b = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{b}\} \mathbb{1}\{\hat{y}_i = \text{s}\}, \]

where the indicator function $\mathbb{1}\{A\}$ is 1 if its argument $A$ is true and 0 otherwise.

For more information on the statistical model and the derivation of the metric, see the documentation.

Variable | Description |
---|---|

EventId |
An unique integer identifier of the event. |

DER_mass_MMC |
The estimated mass $m_{H}$ of the Higgs boson candidate, obtained through a probabilistic phase space integration. |

DER_mass_transverse_met_lep |
The transverse mass between the missing transverse energy and the lepton. |

DER_mass_vis |
The invariant mass of the hadronic tau and the lepton. |

DER_pt_h |
The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton and the missing transverse energy vector. |

DER_deltaeta_jet_jet |
The absolute value of the pseudorapidity separation between the two jets (undefined if PRI_jet_num $\leq$ 1). |

DER_mass_jet_jet |
The invariant mass of the two jets (undefined if PRI_jet_num $\leq$ 1). |

DER_prodeta_jet_jet |
The product of the pseudorapidities of the two jets (undefined if PRI_jet_num $\leq$ 1). |

DER_deltar_tau_lep |
The R separation between the hadronic tau and the lepton. |

DER_pt_tot |
The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num $\geq$) and the subleading jet (if PRI jet num = 2) (but not of any additional jets). |

DER_sum_pt |
The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI jet num $\geq$ 1) and the subleading jet (if PRI jet num = 2) and the other jets (if PRI jet num = 3). |

DER_pt_ratio_lep_tau |
The ratio of the transverse momenta of the lepton and the hadronic tau. |

DER_met_phi_centrality |
The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton. |

DER_lep_eta_centrality |
The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI_jet_num $\leq$ 1). |

PRI_tau_pt |
The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the hadronic tau. |

PRI_tau_eta |
The pseudorapidity $\eta$ of the hadronic tau. |

PRI_tau_phi |
The azimuth angle $\phi$ of the hadronic tau. |

PRI_lep_pt |
The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the lepton (electron or muon). |

PRI_lep_eta |
The pseudorapidity $\eta$ of the lepton. |

PRI_lep_phi |
The azimuth angle $\phi$ of the lepton. |

PRI_met |
The missing transverse energy $\overrightarrow{E}^{miss}_{T}$ |

PRI_met_phi |
The azimuth angle $\phi$ of the mssing transverse energy |

PRI_met_sumet |
The total transverse energy in the detector. |

PRI_jet_num |
The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3). |

PRI_jet_leading_pt |
The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is the jet with largest transverse momentum (undefined if PRI_jet_num = 0). |

PRI_jet_leading_eta |
The pseudorapidity $\eta$ of the leading jet (undefined if PRI jet num = 0). |

PRI_jet_leading_phi |
The azimuth angle $\phi$ of the leading jet (undefined if PRI jet num = 0). |

PRI_jet_subleading_pt |
The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is, the jet with second largest transverse momentum (undefined if PRI_jet_num $\leq$ 1). |

PRI_jet_subleading_eta |
The pseudorapidity $\eta$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1). |

PRI_jet_subleading_phi |
The azimuth angle $\phi$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1). |

PRI_jet_all_pt |
The scalar sum of the transverse momentum of all the jets of the events. |

Weight |
The event weight $w_{i}$ |

Label |
The event label (string) $y_{i}$ $\in$ $\{s,b\}$ (s for signal, b for background). |

KaggleSet |
String specifying to which Kaggle set the event belongs : ”t”:training, ”b”:public leaderboard, ”v”:private leaderboard,”u”:unused. |

KaggleWeight |
Weight normalised within each Kaggle dataset. |

The events were selected from simulated events passing the single electron or single muon trigger. Each event has an identified electron or muon and an identified hadronic tau, and should not have a b-tagged jet.

Repeating the ATLAS "Higgs to tautau" analysis (as documented in the reference document, see documentation) on the dataset allow to reproduce approximately the event yields quoted for signal and background. The event yields cannot be reproduced exactly because data driven corrections have not been applied (see documentation for more details).

This dataset is an extended version of the data provided for the Higgs Boson Machine Learning Challenge on Kaggle. For more information

The open data are released under the Creative Commons CC0 waiver. Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data. All releases will have a unique DOI that you are requested to cite in any applications or publications.