Open Data CERN
  • Help
    Discussion forum
    Search tips
  • About
    CERN Open Data
    ALICE
    ATLAS
    CMS
    DELPHI
    LHCb
    OPERA
    TOTEM
    Glossary

ATLAS $t\bar{t}$ simulation for ML-based jet flavour tagging (JetSet)

ATLAS collaboration

Cite as: ATLAS collaboration (2025). ATLAS $t\bar{t}$ simulation for ML-based jet flavour tagging (JetSet). CERN Open Data Portal. DOI:10.7483/OPENDATA.ATLAS.QG8W.TO8P

Dataset Derived Simulated Datascience ATLAS 13.6TeV pp CERN-LHC


Description

Flavour-tagging — the task of identifying heavy flavor jets — is essential for many physics analyses at the ATLAS experiment. This dataset, released for public use, can be used to train and evaluate machine learning models for jet flavour-tagging, as described in arXiv:2505.19689. It aims broaden interest and further development of innovative machine learning techniques to improve flavour-tagging performance.

The dataset consists of approximately 50 million events from simulated top quark pair production at a centre-of-mass energy of 13.6 TeV. It is stored in HDF5 format and contains structured event-level, jet-level, track-level and truth hadron information. This dataset is designed to be compatible with the flavour-tagging algorithm development pipeline used at ATLAS, and is supported by accompanying instructions and example configurations provided in open-source repositories.

To improve usability, the dataset is split into three mutually exclusive HDF5 files:

  • mc-flavtag-ttbar-small.h5 — ~1.36 million events (~5.6 million jets)
  • mc-flavtag-ttbar-medium.h5 — ~6.23 million events (~25.6 million jets)
  • mc-flavtag-ttbar-large.h5 — ~41.1 million events (~168 million jets)

Downloading all three files will provide access to the complete dataset. The smaller subsets are useful for quick exploration or prototyping workflows.

Dataset characteristics

48698675 events. 3 files. 100.4 GiB in total.

How can you use these data?

A detailed explanation of this dataset, and instructions for pre-processing, training, and evaluation workflows are provided in the accompanying GitLab repository. If this dataset is used in a publication, please cite this dataset record along with the accompanying ATLAS paper describing GN2, a ATLAS flavour-tagging algorithm with a transformer-like architecture.

Transforming Jet Flavour: Documentation and training pipeline

ATLAS-FTAG-2023-05

arXiv:2505.19689


      

Files and indexes

Disclaimer

These open data are released under the Creative Commons Zero v1.0 Universal license.

Logo CC0-1.0

Neither the experiment(s) ( ATLAS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.

ALICE experiment
ATLAS experiment
CMS experiment
DELPHI experiment
LHCb experiment
OPERA experiment
PHENIX experiment
TOTEM experiment
© CERN, 2014–2026 ·
Terms of Use ·
Privacy Policy ·
Help ·
GitHub ·
Twitter ·
Email
Powered by Invenio
Open Data Portal v0.13.0
CERN