ATLAS top tagging open data set with systematic uncertainties

Name: ATLAS top tagging open data set with systematic uncertainties
Creator: ATLAS collaboration
Published: 2024

ATLAS collaboration

Cite as: ATLAS collaboration (2024). ATLAS top tagging open data set with systematic uncertainties. CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.SOAY.LABE

Dataset Derived Simulated Datascience ATLAS CERN-LHC

Description

Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available dataset for the development of Machine Learning (ML) based boosted top tagging algorithms. The dataset consists of a nominal piece used for the training and evaluation of algorithms, and a systematic piece used for estimating the size of systematic uncertainties produced by an algorithm. The nominal data are is split into two orthogonal sets, named train and test. The systematic varied data is split into many more pieces that should only be used for evaluation in most cases. Both nominal sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons).

A brief overview of these datasets is as follows. For more detailed information see arxiv:2047.20127.

train_nominal - 92,820,427 jets, equal parts signal and background
test_nominal - 10,306,813 jets, equal parts signal and background
esup - 10,032,472 jets with the cluster energy scale up systematic variation active, equal parts signal and background
esdown - 10,032,472 jets with the cluster energy scale down systematic variation active, equal parts signal and background
cer - 10,040,653 jets with the cluster energy resolution systematic variation active, equal parts signal and background
cpos - 10,032,472 jets with the cluster energy position systematic variation active, equal parts signal and background
teg - 7,421,204 jets with the track efficiency global systematic variation active, 30% signal jets
tej - 7,017,046 jets with the track efficiency in jets systematic variation active, 32% signal jets
tfl - 5,907,310 jets with the track fake rate loose systematic variation active, 18% signal jets
tfj - 6,977,371 jets with the track fake rate in jets systematic variation active, 32% signal jets
bias - 10,011,330 jets with the track bias systematic variation active, 52% signal jets
ttbar_pythia - 193,792 jets from Pythia simulated events containing Standard Model top-anti top quark pair production, all signal jets
ttbar_herwig - 180,811 jets from Herwig simulated events containing Standard Model top-anti top quark pair production, all signal jets
cluster - 5,000,004 jets simulated using the Sherpa cluster based hadronization model, all background jets
string - 5,000,001 jets simulated using the Lund string based hadronization model, all background jets
angular - 4,900,000 jets simulated using the Herwig angular ordered parton shower model, all background jets
dipole - 4,900,000 jets simulated using the Herwig dipole parton shower model, all background jets

For each jet, the datasets contain:

The four vectors of constituent particles
15 high level summary quantities evaluated on the jet
The four vector of the whole jet
A training weight (nominal only)
PYTHIA shower weights (nominal only)
A signal (1) vs background (0) label

There are two rules for using this data set: the contribution to a loss function from any jet should always be weighted by the training weight, and any performance claim is incomplete without an estimate of the systematic uncertainties via the method illustrated in this repository. The ideal model shows high performance but also small systematic uncertainties.

Dataset characteristics

205774178 events. 2020 files. 170.3 GiB in total.

External links

ATLAS paper arxiv:2047.20127

How can you use these data?

This dataset supersedes an earlier data release which did not include data for estimating systematic uncertainties.

A detailed explanation of this dataset, with examples demonstrating how to train a tagger and assess systematic uncertainties, is provided in the this repository.

If this dataset is used in a publication, please cite this dataset record along with the accompanying paper arxiv:2047.20127.

Files and indexes

Disclaimer

These open data are released under the Creative Commons Zero v1.0 Universal license.

Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.