Experimental dipole moment prediction with a progressive 2D–3D hybrid framework and twin-pair-based diagnostic evaluation


Uğurlu S. Y., He S.

JOURNAL OF CHEMINFORMATICS, cilt.20, sa.10, ss.200-252, 2026 (SCI-Expanded, Scopus)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 20 Sayı: 10
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1186/s13321-026-01215-4
  • Dergi Adı: JOURNAL OF CHEMINFORMATICS
  • Derginin Tarandığı İndeksler: Academic Search Ultimate (EBSCO), Natural Science Collection (ProQuest), Biological Science Database (ProQuest), Health Research Premium Collection (ProQuest), Scopus, Materials Science & Engineering Collection (ProQuest), Pharma Collection (ProQuest), Technology Collection (ProQuest), Aerospace Database, Science Citation Index Expanded (SCI-EXPANDED), Directory of Open Access Journals
  • Sayfa Sayıları: ss.200-252
  • Akdeniz Üniversitesi Adresli: Evet

Özet

Accurate prediction of molecular dipole moments is essential for modeling electrostatic interactions in solvation, molecular recognition, and materials research. Although modern three-dimensional learning models can achieve near-saturated performance on low-noise quantum-chemical benchmarks, their performance often decreases on experimentally compiled datasets, where measurement noise, stereochemical ambiguity, and conformer variability create a substantial gap between idealized theoretical data and practical experimental labels.

In this study, we present a progressive two-dimensional/three-dimensional hybrid framework for experimental dipole-moment prediction under heterogeneous label conditions. The framework was constructed in two stages. First, a strong two-dimensional tabular CatBoost model based on SMILES-derived multi-view fingerprints was enriched with physicochemical, dipole-related, and conformer-dependent three-dimensional shape descriptors. This introduced an initial level of two-dimensional and three-dimensional integration and improved predictive performance compared with the two-dimensional representation alone. Second, the enriched tabular predictor was linearly fused with a geometry-aware three-dimensional graph neural network operating on conformer-derived molecular graphs, providing an additional performance gain through complementary structural learning.

On the experimental test set, the final hybrid framework achieved R² = 0.844 with MAE = 0.399 D, outperforming the nonparametric fingerprint k-nearest-neighbor baseline with R² = 0.65. Its performance also remained close to the similarity-conditioned diagnostic reference range estimated from twin-pair dispersion, with R² values of approximately 0.86–0.91 under Tanimoto similarity thresholds of 0.95, 0.97, 0.98, and 0.99.

Scientific Contribution

The main scientific contribution of this study is the development of an experimentally grounded progressive two-dimensional/three-dimensional hybrid learning framework for dipole-moment prediction under heterogeneous experimental labels. The proposed framework integrates enriched tabular chemical representations, geometry-aware graph learning, and twin-pair-based diagnostic evaluation.

To contextualize model performance in this noise-affected experimental setting, twin-pair analysis was combined with a test-to-train similarity-stratified evaluation protocol. In addition, baseline comparisons and component-wise interpretability analyses were used to clarify the roles of representation diversity and learner complementarity.

Relative to three recently published dipole-prediction studies, the proposed framework remains competitively positioned. However, these comparisons should be interpreted cautiously because the underlying datasets, label sources, molecular domains, and noise regimes are not directly matched. Unlike previous studies that mainly focused on lower-noise theoretical benchmarks or narrower molecular domains, the present work targets the comparatively underexplored Stenutz experimental compilation.