• Help
    Discussion forum
    Search tips
  • About
    CERN Open Data
    ALICE
    ATLAS
    CMS
    DELPHI
    LHCb
    OPERA
    TOTEM
    Glossary

Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014

ATLAS collaboration

Cite as: ATLAS collaboration (2014). Dataset from the ATLAS Higgs Boson Machine Learning Challenge 2014. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.ZBP2.M5T8

Dataset Derived Datascience ATLAS CERN-LHC


Description

The dataset has been built from official ATLAS full-detector simulation, with "Higgs to tautau" events mixed with different backgrounds. The simulator has two parts. In the first, random proton-proton collisions are simulated based on the knowledge that we have accumulated on particle physics. It reproduces the random microscopic explosions resulting from the proton-proton collisions. In the second part, the resulting particles are tracked through a virtual model of the detector. The process yields simulated events with properties that mimic the statistical properties of the real events with additional information on what has happened during the collision, before particles are measured in the detector.

The signal sample contains events in which Higgs bosons (with a fixed mass of 125 GeV) were produced. The background sample was generated by other known processes that can produce events with at least one electron or muon and a hadronic tau, mimicking the signal. For the sake of simplicity, only three background processes were retained for the Challenge. The first comes from the decay of the Z boson (with a mass of 91.2 GeV) into two taus. This decay produces events with a topology very similar to that produced by the decay of a Higgs. The second set contains events with a pair of top quarks, which can have a lepton and a hadronic tau among their decay. The third set involves the decay of the W boson, where one electron or muon and a hadronic tau can appear simultaneously only through imperfections of the particle identification procedure.

Due to the complexity of the simulation process, each simulated event has a weight that is proportional to the conditional density divided by the instrumental density used by the simulator (an importance-sampling flavour), and normalised for integrated luminosity such that, in any region, the sum of the weights of events falling in the region is an unbiased estimate of the expected number of events falling in the same region during a given fixed time interval. In our case, the weights correspond to the quantity of real data taken during the year 2012. The weights are an artifact of the way the simulation works and so they are not part of the input to the classifier. For the Challenge, weights have been provided in the training set so the AMS can be properly evaluated. Weights were not provided in the qualifying set since the weight distribution of the signal and background sets are very different and so they would give away the label immediately. However, in the opendata.cern.ch dataset, weights and labels have been provided for the complete dataset.

The evaluation metric is the approximate median significance (AMS):

\[ \text{AMS} = \sqrt{2\left((s+b+b_r) \log \left(1 + \frac{s}{b + b_r}\right)-s\right)}\]

where

  • $s, b$: unnormalised true positive and false positive rates, respectively,
  • $b_r =10$ is the constant regularisation term,
  • $\log$ is the natural log.

More precisely, let $(y_1, \ldots, y_n) \in \{\text{b},\text{s}\}^n$ be the vector of true test labels, let $(\hat{y}_1, \ldots, \hat{y}_n) \in \{\text{b},\text{s}\}^n$ be the vector of predicted (submitted) test labels, and let $(w_1, \ldots, w_n) \in {\mathbb{R}^+}^n$ be the vector of weights. Then

\[ s = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{s}\} \mathbb{1}\{\hat{y}_i = \text{s}\} \]

and

\[ b = \sum_{i=1}^n w_i\mathbb{1}\{y_i = \text{b}\} \mathbb{1}\{\hat{y}_i = \text{s}\}, \]

where the indicator function $\mathbb{1}\{A\}$ is 1 if its argument $A$ is true and 0 otherwise.

For more information on the statistical model and the derivation of the metric, see the documentation.

Dataset characteristics

818238 events. 1 files. 186.5 MiB in total.

Dataset semantics

Variable Type Description
EventId An unique integer identifier of the event.
DER_mass_MMC The estimated mass $m_{H}$ of the Higgs boson candidate, obtained through a probabilistic phase space integration.
DER_mass_transverse_met_lep The transverse mass between the missing transverse energy and the lepton.
DER_mass_vis The invariant mass of the hadronic tau and the lepton.
DER_pt_h The modulus of the vector sum of the transverse momentum of the hadronic tau, the lepton and the missing transverse energy vector.
DER_deltaeta_jet_jet The absolute value of the pseudorapidity separation between the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_mass_jet_jet The invariant mass of the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_prodeta_jet_jet The product of the pseudorapidities of the two jets (undefined if PRI_jet_num $\leq$ 1).
DER_deltar_tau_lep The R separation between the hadronic tau and the lepton.
DER_pt_tot The modulus of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI_jet_num $\geq$) and the subleading jet (if PRI jet num = 2) (but not of any additional jets).
DER_sum_pt The sum of the moduli of the transverse momenta of the hadronic tau, the lepton, the leading jet (if PRI jet num $\geq$ 1) and the subleading jet (if PRI jet num = 2) and the other jets (if PRI jet num = 3).
DER_pt_ratio_lep_tau The ratio of the transverse momenta of the lepton and the hadronic tau.
DER_met_phi_centrality The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton.
DER_lep_eta_centrality The centrality of the pseudorapidity of the lepton w.r.t. the two jets (undefined if PRI_jet_num $\leq$ 1).
PRI_tau_pt The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the hadronic tau.
PRI_tau_eta The pseudorapidity $\eta$ of the hadronic tau.
PRI_tau_phi The azimuth angle $\phi$ of the hadronic tau.
PRI_lep_pt The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the lepton (electron or muon).
PRI_lep_eta The pseudorapidity $\eta$ of the lepton.
PRI_lep_phi The azimuth angle $\phi$ of the lepton.
PRI_met The missing transverse energy $\overrightarrow{E}^{miss}_{T}$
PRI_met_phi The azimuth angle $\phi$ of the mssing transverse energy
PRI_met_sumet The total transverse energy in the detector.
PRI_jet_num The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3).
PRI_jet_leading_pt The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is the jet with largest transverse momentum (undefined if PRI_jet_num = 0).
PRI_jet_leading_eta The pseudorapidity $\eta$ of the leading jet (undefined if PRI jet num = 0).
PRI_jet_leading_phi The azimuth angle $\phi$ of the leading jet (undefined if PRI jet num = 0).
PRI_jet_subleading_pt The transverse momentum $\sqrt{p^{2}_{x} + p^{2}_{y}}$ of the leading jet, that is, the jet with second largest transverse momentum (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_subleading_eta The pseudorapidity $\eta$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_subleading_phi The azimuth angle $\phi$ of the subleading jet (undefined if PRI_jet_num $\leq$ 1).
PRI_jet_all_pt The scalar sum of the transverse momentum of all the jets of the events.
Weight The event weight $w_{i}$
Label The event label (string) $y_{i}$ $\in$ $\{s,b\}$ (s for signal, b for background).
KaggleSet String specifying to which Kaggle set the event belongs : ”t”:training, ”b”:public leaderboard, ”v”:private leaderboard,”u”:unused.
KaggleWeight Weight normalised within each Kaggle dataset.

External links

Go to the Higgs Boson Machine Learning Challenge on Kaggle

http://higgsml.lal.in2p3.fr

How were these data selected?

The events were selected from simulated events passing the single electron or single muon trigger. Each event has an identified electron or muon and an identified hadronic tau, and should not have a b-tagged jet.

How were these data validated?

Repeating the ATLAS "Higgs to tautau" analysis (as documented in the reference document, see documentation) on the dataset allow to reproduce approximately the event yields quoted for signal and background. The event yields cannot be reproduced exactly because data driven corrections have not been applied (see documentation for more details).

How can you use these data?

This dataset is an extended version of the data provided for the Higgs Boson Machine Learning Challenge on Kaggle. For more information

Go to the Higgs Machine Learning documentation


      

Files and indexes

Disclaimer

These open data are released under the Creative Commons Zero v1.0 Universal license.

Logo CC0-1.0

Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.

ALICE experiment
ATLAS experiment
CMS experiment
DELPHI experiment
LHCb experiment
OPERA experiment
PHENIX experiment
TOTEM experiment
© CERN, 2014–2025 ·
Terms of Use ·
Privacy Policy ·
Help ·
GitHub ·
Twitter ·
Email
Powered by Invenio
Open Data Portal v0.4.3
CERN