Cite as: Wunsch, Stefan; (2019). Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012. CERN Open Data Portal. DOI:10.7483/OPENDATA.CMS.GV20.PR5T
Software Analysis Workflow CMS CERN-LHC
This analysis uses data and simulation of events at the CMS experiment from 2012 with the goal to study decays of a Higgs boson into two tau leptons in the final state of a muon lepton and a hadronically decayed tau lepton. The analysis follows loosely the setup of the official CMS analysis published in 2014.
The purpose of the original CMS analysis was to establish the existence of the Higgs boson decaying into two tau leptons. Since performing this analysis properly with full consideration of all systematic uncertainties is an enormously complex task, we reduce this analysis to the qualitative study of the kinematics and properties of such events without a statistical analysis. However, as you can explore in this record, already such a reduced analysis is complex and requires extensive physics knowledge, which makes this a perfect first look into the procedures required to claim the evidence or existence of a new particle.
Two example results produced by this analysis can be seen below. The plots show the data recorded by the detector compared to the estimation of the contributing processes, which are explained in the following. The analysis has implemented the visualization of 34 such observables.
Signal process
The physical process of interest, also often called signal, is the production of the Higgs boson in the decay to two tau leptons. The main production modes of the Higgs boson are the gluon fusion and the vector boson fusion production indicated in the plots with the labels $gg\rightarrow H$ and $qq\rightarrow H$, respectively. See below the two Feynman diagrams that describe the processes in leading order.
Tau decay modes
The tau lepton has a very short lifetime of about 290 femtoseconds after which it decays into other particles. With a probability of about 20% each, the tau lepton decays into a muon or an electron and two neutrinos. All other decay modes consist of a combination of hadrons such as pions and kaons and a single neutrino. You can find here a full overview and the exact numbers. This analysis considers only tau lepton pairs of which one tau lepton decays into a muon and two neutrinos and the other tau lepton hadronically, whereas the official CMS analysis considered additional decay channels.
Background processes
Besides the Higgs boson decaying into two tau leptons, many other processes can produce very similar signatures in the detector, which have to be taken into account to draw any conclusions from the data. In the following, the most prominent processes with a similar signature as the signal are presented. Besides the QCD multijet process, the analysis estimates the contribution of the background processes using simulated events.
$Z\rightarrow\tau\tau$: The most prominent background process is the Z boson decaying into two tau leptons. The leading order production is called the Drell-Yan process in which a quark anti-quark pair annihilates. Because the Z boson decays such as the Higgs boson directly into two tau leptons, this process is very hard to distinguish from the signal.
$Z\rightarrow ll$: Besides the decay of the Z boson into two tau leptons, the Z boson decays with the same probability to electrons and muons. Although this process does not contain any genuine tau lepton, a tau can be reconstructed by mistake. Objects that are likely to be misidentified as a hadronic decay of a tau lepton are electrons or jets.
$W+\mathrm{jets}$: W bosons are frequently produced at the LHC and can decay into any lepton. If a muon from a W boson is selected together with a misidentified tau from a jet, a similar event signature as the signal can occur. However, this process can be strongly suppressed by a cut in the event selection on the transverse mass of the muon and the missing energy such as done in the published CMS analysis.
$t\overline{t}$: Top anti-top pairs are produced at the LHC by quark anti-quark annihilation or gluon fusion. Because a top quark decays immediately and almost exclusively via a W boson and a bottom quark, additional misidentifications result in signal-like signatures in the detector similar to the $W+\mathrm{jets}$ process explained above. However, the identification of jets originating from bottom quarks, and the subsequent removel of such events, is capable to reduce this background effectively.
$\mathrm{QCD}$: The QCD multijet background describes decays with a large number of jets, which occurs very often at the LHC. Such events can be falsely selected for the analysis due to misidentifications. Because a proper simulation of this process is complex and computational expensive, the contribution is not estimated from simulation but from data itself. Therefore, we select tau pairs with the same selection than the signal but with the changed requirement that both tau leptons have the same charge. Then, all known processes from simulation are subtracted from the histogram. Using the resulting histogram as estimation for the QCD multijet process is possible because the production of misidentified tau lepton candidates is independent from the charge.
Analysis steps
The analysis steps follow the typical workflow of such an analysis at CMS. An overview of these steps is given in the following and the full details can be found in the analysis code. The detailed technical description how to run the analysis is put separately in the respective section below.
For completeness, the following points describe the missing steps to extract from the data a meaningful statistical result. Please note that these steps typically take most of the time of a physics analysis and are not included in this example.
How to interpret the results
Please note that a proper analysis of data always requires a thorough study of all uncertainties. Since this analysis does not include any systematic uncertainties, the resulting plots should be interpreted only qualitatively. For a valid physics measurement, the differences between the data and the sum of all processes would have to be explained within the uncertainties. Also note that the counts for the simulated Higgs boson events are scaled up to make the expected signal contribution visible by eye.
The analysis can be run with the following datasets:
GluGluToHToTauTau dataset in reduced NanoAOD format for education and outreach
VBF_HToTauTau dataset in reduced NanoAOD format for education and outreach
DYJetsToLL dataset in reduced NanoAOD format for education and outreach
TTbar dataset in reduced NanoAOD format for education and outreach
W1JetsToLNu dataset in reduced NanoAOD format for education and outreach
W2JetsToLNu dataset in reduced NanoAOD format for education and outreach
W3JetsToLNu dataset in reduced NanoAOD format for education and outreach
Run2012B_TauPlusX dataset in reduced NanoAOD format for education and outreach
Run2012C_TauPlusX dataset in reduced NanoAOD format for education and outreach
Set up your system with the required software and code
The analysis can be run with a plain ROOT installation, either installed locally or taken from the CMS Open Data VM environment. Alternatively, go to root.cern for instructions how to install the software locally.
In case your system has a CVMFS installation, or you can use the CMS Open Data VM environment, you can set up the needed software sourcing an LCG software release with the following setup script:
source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc8-opt/setup.sh
Note that you may have to replace x86_64-slc6-gcc8-opt
with a platform matching your system.
To download the files, you can either use directly the web browser or the following command.
git clone git://github.com/cms-opendata-analyses/HiggsTauTauNanoAODOutreachAnalysis -b 2012
Step 1: Reduce and pre-process the inital datasets
This step is implemented in the file skim.cxx
and is written in C++ for performance reasons. To compile the program, run the following command. Note that you may need to change the compiler based on your system.
g++ -g -O3 -o skim skim.cxx $(root-config --cflags --libs)
The compilation produces a binary, which can be executed as follows.
./skim
The initial datasets are retrieved via network using the XRootD protocol and due to the large size of the dataset the runtime is mainly dependent on your network connection. To speed up multiple executions of the analysis, you can download the files and point to these local copies in the code.
Step 2: Produce histograms
The next step is implemented in Python in the file histograms.py
. Run the following command to process the previously produced reduced datasets.
python histograms.py
The script produces the file histograms.root
, which contains the histograms. You can have a look at the plain histograms with the ROOT browser opened with the command rootbrowse histograms.root
.
Step 3: Combine histograms to plots
To combine the histograms produced in the previous step to meaningful plots, run the following command.
python plot.py
The Python script generates for each variable a png
and pdf
image file, which can be viewed with a program of your choice. The final plots of the analysis are also included in the source code repository.
GNU General Public License (GPL) version 3