QCDGE:
Small Molecule
Big Data
Infinite Truth

QCDGE: Quantum Chemistry Database with Ground- and Excited-State Properties

Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality databases grows significantly in chemical researches. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 atoms, which contain C, N, O and F heavy atoms. Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G* level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the ωB97X-D/6-31G* level.

Totally twenty seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry database. %construction, tailored to meet the requirements of big data. Our QCDGE database contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This database, along with its construction protocol, is expected to have a significant impact on the broad applications of machine learning studies across different fields of chemistry, especially in the area of excited-state researches.

How to access?

If you are downloading via the public web please download via url.

If you are downloading via LAN please download via url.

The following files are provided:

  1. final_all.csv: This CSV file summarizes the basic information of all molecules.
  2. final_all.hdf5: All data of the QCDGE database.
  3. A_9.hdf5: A subset of the QCDGE database whose molecules are collected from QM9 dataset and whose molecules are less than 10 heavy atoms (C, N, O, F).
  4. A_10.hdf5: A subset of the QCDGE database whose molecules are collected from GDB-11 database and whose molecules are equal to 10 heavy atoms (C, N, O, F).
  5. B_9.hdf5: A subset of the QCDGE database whose molecules are collected from PubChemQC database and whose molecules are less than 10 heavy atoms (C, N, O, F).
  6. B_10.hdf5: A subset of the QCDGE database whose molecules are collected from PubChemQC database and whose molecules are equal to 10 heavy atoms (C, N, O, F).
  7. SHA512SUM: The hash strings of all HDF5 files generated by SHA-512 algorithm.
  8. extract_data.py: This script allows for extracting molecular properties from the QCDGE database.
  9. README: More details about the accessing and retrieval of the QCDGE data.

How to cite?

Our manuscript was published on Sci. Data and can be accessed at url.