Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available data set for the development of Machine Learning (ML) based boosted top tagging algorithms. The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively. Both sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons). For each jet, the data set contains:
The four vectors of constituent particles
15 high level summary quantities evaluated on the jet
The four vector of the whole jet
A training weight
A signal (1) vs background (0) label.
There is one rule in using this data set: the contribution to a loss function from any jet should always be weighted by the training weight. Apart from this a model should separate the signal jets from background by whatever means necessary.
The open data are released under the Creative
Commons CC0 waiver. Neither the experiment(s) ( ATLAS ) nor CERN endorse any
works, scientific or otherwise, produced using these data.
All releases will have a unique DOI that you are requested to cite in any applications or