CMS completes 2010-2011 proton-proton data release

2020-08-27 by CMS Collaboration

The CMS experiment at CERN is happy to announce a new batch of open data from the Large Hadron Collider (LHC), this time encompassing all the data recorded by CMS in 2011. With this release, CMS has brought all proton-proton collision data recorded in 2010 and 2011 into the public domain, free to use and available to anyone. This is the fifth data release by CMS since 2014 and adds to the earlier open data from the LHC’s first data-taking period (“Run1”), from 2010 to 2012.

This release consists of 18 new datasets, thus bringing the integrated luminosity, used as the measure of collected data, to 5 inverse femtobarns for 2011 data, in addition to 11 inverse femtobarns of 2012 data already available. With more data at hand, the studies on these public data can now be done with increased precision.

Use of these data is encouraged and CMS eagerly looks forward to the first workshop for users of CMS open data, scheduled for the end of September. The goal of the workshop is to lower the threshold for accessing these data and to make them easier to use. “This first workshop, which will be held virtually due to the current circumstances, will also serve to establish collaborative ties with the external theoretical-physics community and improve our documentation,” says Edgar Carrera, a professor at the University of San Francisco de Quito (Ecuador), who is a member of the CMS Open Data team and is the lead organiser of the workshop.

In addition to the Virtual Machine images, CMS now provides Docker container images that include the necessary software pre-installed. These container images can be found centrally on Docker Hub. Some advice on their usage has been added to the guide page on the CERN Open Data portal as well as to the new user forum.

Together with this release, an additional example of usage and validation for an earlier release is provided: for the special “low luminosity” data measured in the CASTOR calorimeter. As a novelty, its code repository includes a computational workflow description, which can be directly tested in an automated way using GitHub actions and can also be executed as a complete workflow in a Kubernetes setup. Hans Van Haevermaet, a postdoctoral researcher at the University of Antwerpen (Belgium) who studied the CASTOR calorimeter early data and prepared them for the release, is delighted by these new developments: “Having the possibility to provide not only the open data, but also clear instructions on how to properly use them is very important for data preservation. Thanks to state-of-the-art software tools that facilitate automated workflows it is now even more convenient to perform analyses with open data and explore their physics.”

Other such workflows are available and some are implemented in REANA, a CERN platform for reusable analysis. We are striving for the best interoperability of the CMS open data, and this calls for enabling the use of CMS analysis workflows, initially designed for HEP-specific computing infrastructure, in modern cloud computing environments. “Open Data only really unveils their full power if one knows how to use them, but the computing power demands for a high-energy physics analysis are substantial”, says CERN research physicist Clemens Lange who is exploring the analysis reproducibility effort within CMS, an effort from which the open data users will increasingly benefit in the future. “By having the analysis workflows run using software containers, anyone can make use of them, be it on their institute’s computing cluster or on public cloud compute platforms.”

Also new in this release are enhanced search functions for simplified datasets, widely used in education. This functionality was suggested by the participants of the workshop for teachers on use of open data, and makes it easier for educators to navigate in the CERN Open Data portal. The datasets for education can now be searched based on the number of events, or on the contents of the datasets.

All CMS open data are released into the public domain under the Creative Commons CC0 waiver via the CERN Open Data portal. The portal is openly developed on GitHub by the CERN Information Technology team in cooperation with the experimental collaborations. CMS would like to thank CERN for providing resources and expertise to build and maintain the portal. We would also like to acknowledge the continuous effort of many of our collaboration members who have helped us release this latest batch of CMS open data.