Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012

Wunsch, Stefan

Cite as: Wunsch, Stefan; (2019). Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.GV20.PR5T

Software Analysis Workflow CMS CERN-LHC

Description

This analysis uses data and simulation of events at the CMS experiment from 2012 with the goal to study decays of a Higgs boson into two tau leptons in the final state of a muon lepton and a hadronically decayed tau lepton. The analysis follows loosely the setup of the official CMS analysis published in 2014.

The purpose of the original CMS analysis was to establish the existence of the Higgs boson decaying into two tau leptons. Since performing this analysis properly with full consideration of all systematic uncertainties is an enormously complex task, we reduce this analysis to the qualitative study of the kinematics and properties of such events without a statistical analysis. However, as you can explore in this record, already such a reduced analysis is complex and requires extensive physics knowledge, which makes this a perfect first look into the procedures required to claim the evidence or existence of a new particle.

Two example results produced by this analysis can be seen below. The plots show the data recorded by the detector compared to the estimation of the contributing processes, which are explained in the following. The analysis has implemented the visualization of 34 such observables.

Signal process

The physical process of interest, also often called signal, is the production of the Higgs boson in the decay to two tau leptons. The main production modes of the Higgs boson are the gluon fusion and the vector boson fusion production indicated in the plots with the labels $gg\rightarrow H$ and $qq\rightarrow H$ , respectively. See below the two Feynman diagrams that describe the processes in leading order.

Tau decay modes

The tau lepton has a very short lifetime of about 290 femtoseconds after which it decays into other particles. With a probability of about 20% each, the tau lepton decays into a muon or an electron and two neutrinos. All other decay modes consist of a combination of hadrons such as pions and kaons and a single neutrino. You can find here a full overview and the exact numbers. This analysis considers only tau lepton pairs of which one tau lepton decays into a muon and two neutrinos and the other tau lepton hadronically, whereas the official CMS analysis considered additional decay channels.

Background processes

Besides the Higgs boson decaying into two tau leptons, many other processes can produce very similar signatures in the detector, which have to be taken into account to draw any conclusions from the data. In the following, the most prominent processes with a similar signature as the signal are presented. Besides the QCD multijet process, the analysis estimates the contribution of the background processes using simulated events.

$Z\rightarrow\tau\tau$ : The most prominent background process is the Z boson decaying into two tau leptons. The leading order production is called the Drell-Yan process in which a quark anti-quark pair annihilates. Because the Z boson decays such as the Higgs boson directly into two tau leptons, this process is very hard to distinguish from the signal.

$Z\rightarrow ll$ : Besides the decay of the Z boson into two tau leptons, the Z boson decays with the same probability to electrons and muons. Although this process does not contain any genuine tau lepton, a tau can be reconstructed by mistake. Objects that are likely to be misidentified as a hadronic decay of a tau lepton are electrons or jets.

$W+\mathrm{jets}$ : W bosons are frequently produced at the LHC and can decay into any lepton. If a muon from a W boson is selected together with a misidentified tau from a jet, a similar event signature as the signal can occur. However, this process can be strongly suppressed by a cut in the event selection on the transverse mass of the muon and the missing energy such as done in the published CMS analysis.

$t\overline{t}$ : Top anti-top pairs are produced at the LHC by quark anti-quark annihilation or gluon fusion. Because a top quark decays immediately and almost exclusively via a W boson and a bottom quark, additional misidentifications result in signal-like signatures in the detector similar to the $W+\mathrm{jets}$ process explained above. However, the identification of jets originating from bottom quarks, and the subsequent removel of such events, is capable to reduce this background effectively.

$\mathrm{QCD}$ : The QCD multijet background describes decays with a large number of jets, which occurs very often at the LHC. Such events can be falsely selected for the analysis due to misidentifications. Because a proper simulation of this process is complex and computational expensive, the contribution is not estimated from simulation but from data itself. Therefore, we select tau pairs with the same selection than the signal but with the changed requirement that both tau leptons have the same charge. Then, all known processes from simulation are subtracted from the histogram. Using the resulting histogram as estimation for the QCD multijet process is possible because the production of misidentified tau lepton candidates is independent from the charge.

Analysis steps

The analysis steps follow the typical workflow of such an analysis at CMS. An overview of these steps is given in the following and the full details can be found in the analysis code. The detailed technical description how to run the analysis is put separately in the respective section below.

First, the NanoAOD files containing data and simulated events are pre-processed. This step is here called skimming since the event selection reduces the size of the datasets significantly. In addition, we perform a pair selection to find from the muon and tau collections the pair which originates most likely from a Higgs boson.
The first step produces skimmed datasets from the original files but still preserves information of selected quantities for each event. In this step, we compute histograms of these quantities for all skimmed datasets. Because of the data-driven QCD estimation, similar histograms have to be produced with the selection containing same-charged tau lepton pairs. This sums up to multiple hundreds of histograms which have to be combined to the final plots such as the ones shown above.
Finally, the histograms are combined to the final plots showing the data taken with the CMS detector compared to the expectation from the background estimations. These plots allow to study the contribution of the different physics processes to the data taken with the CMS detector and represent the first step to answer the question whether we can verify the existence of the Higgs boson.

For completeness, the following points describe the missing steps to extract from the data a meaningful statistical result. Please note that these steps typically take most of the time of a physics analysis and are not included in this example.

The simulation has to be improved with corrections to reflect the data precisely. For example, a typical correction is the calibration of the energy measurement of tau leptons or jets. Each correction comes with a systematic uncertainty, which has to be included in the measurement.
As the final step, a measurement is performed with a fit of a statistical model to the data. This model incorporates the expectation from a physics model, for example the Standard Model, and all statistical and systematic uncertainties. Typical systematic uncertainties are the uncertainties of the amount of expected background events or the theory of the physics model.

How to interpret the results

Please note that a proper analysis of data always requires a thorough study of all uncertainties. Since this analysis does not include any systematic uncertainties, the resulting plots should be interpreted only qualitatively. For a valid physics measurement, the differences between the data and the sum of all processes would have to be explained within the uncertainties. Also note that the counts for the simulated Higgs boson events are scaled up to make the expected signal contribution visible by eye.

Use with

The analysis can be run with the following datasets:

GluGluToHToTauTau dataset in reduced NanoAOD format for education and outreach

VBF_HToTauTau dataset in reduced NanoAOD format for education and outreach

DYJetsToLL dataset in reduced NanoAOD format for education and outreach

TTbar dataset in reduced NanoAOD format for education and outreach

W1JetsToLNu dataset in reduced NanoAOD format for education and outreach

W2JetsToLNu dataset in reduced NanoAOD format for education and outreach

W3JetsToLNu dataset in reduced NanoAOD format for education and outreach

Run2012B_TauPlusX dataset in reduced NanoAOD format for education and outreach

Run2012C_TauPlusX dataset in reduced NanoAOD format for education and outreach

Characteristics

4 files. 1014.4 KiB in total.

System details

This code can be used with the CMS Open Data VM environment.
Software release: ROOT 6.16 or later

How can you use this?

Set up your system with the required software and code

The analysis can be run with a plain ROOT installation, either installed locally or taken from the CMS Open Data VM environment. Alternatively, go to root.cern for instructions how to install the software locally.

In case your system has a CVMFS installation, or you can use the CMS Open Data VM environment, you can set up the needed software sourcing an LCG software release with the following setup script:

source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/setup.sh

Note that you may have to replace x86_64-slc6-gcc8-opt with a platform matching your system.

To download the files, you can either use directly the web browser or the following command.

git clone git://github.com/cms-opendata-analyses/HiggsTauTauNanoAODOutreachAnalysis -b 2012

Step 1: Reduce and pre-process the inital datasets

This step is implemented in the file skim.cxx and is written in C++ for performance reasons. To compile the program, run the following command. Note that you may need to change the compiler based on your system.

g++ -g -O3 -o skim skim.cxx $(root-config --cflags --libs)

The compilation produces a binary, which can be executed as follows.

./skim

The initial datasets are retrieved via network using the XRootD protocol and due to the large size of the dataset the runtime is mainly dependent on your network connection. To speed up multiple executions of the analysis, you can download the files and point to these local copies in the code.

Step 2: Produce histograms

The next step is implemented in Python in the file histograms.py. Run the following command to process the previously produced reduced datasets.

python histograms.py

The script produces the file histograms.root, which contains the histograms. You can have a look at the plain histograms with the ROOT browser opened with the command rootbrowse histograms.root.

Step 3: Combine histograms to plots

To combine the histograms produced in the previous step to meaningful plots, run the following command.

python plot.py

The Python script generates for each variable a png and pdf image file, which can be viewed with a program of your choice. The final plots of the analysis are also included in the source code repository.

Source code repository

https://github.com/cms-opendata-analyses/HiggsTauTauNanoAODOutreachAnalysis

Files and indexes

Disclaimer

These open data are released under the GNU General Public License v3.0.

Neither the experiment(s) ( CMS ) nor CERN endorse any works, scientific or otherwise, produced using these data.

This release has a unique DOI that you are requested to cite in any applications or publications.