JOURNAL OF CHEMINFORMATICS, cilt.20, sa.10, ss.200-252, 2026 (SCI-Expanded, Scopus)
Accurate prediction of molecular dipole moments is essential for modeling electrostatic interactions in solvation, molecular recognition, and materials research. Although modern three-dimensional learning models can achieve near-saturated performance on low-noise quantum-chemical benchmarks, their performance often decreases on experimentally compiled datasets, where measurement noise, stereochemical ambiguity, and conformer variability create a substantial gap between idealized theoretical data and practical experimental labels.
In this study, we present a progressive two-dimensional/three-dimensional hybrid framework for experimental dipole-moment prediction under heterogeneous label conditions. The framework was constructed in two stages. First, a strong two-dimensional tabular CatBoost model based on SMILES-derived multi-view fingerprints was enriched with physicochemical, dipole-related, and conformer-dependent three-dimensional shape descriptors. This introduced an initial level of two-dimensional and three-dimensional integration and improved predictive performance compared with the two-dimensional representation alone. Second, the enriched tabular predictor was linearly fused with a geometry-aware three-dimensional graph neural network operating on conformer-derived molecular graphs, providing an additional performance gain through complementary structural learning.
On the experimental test set, the final hybrid framework achieved R² = 0.844 with MAE = 0.399 D, outperforming the nonparametric fingerprint k-nearest-neighbor baseline with R² = 0.65. Its performance also remained close to the similarity-conditioned diagnostic reference range estimated from twin-pair dispersion, with R² values of approximately 0.86–0.91 under Tanimoto similarity thresholds of 0.95, 0.97, 0.98, and 0.99.
Scientific Contribution
The main scientific contribution of this study is the development of an experimentally grounded progressive two-dimensional/three-dimensional hybrid learning framework for dipole-moment prediction under heterogeneous experimental labels. The proposed framework integrates enriched tabular chemical representations, geometry-aware graph learning, and twin-pair-based diagnostic evaluation.
To contextualize model performance in this noise-affected experimental setting, twin-pair analysis was combined with a test-to-train similarity-stratified evaluation protocol. In addition, baseline comparisons and component-wise interpretability analyses were used to clarify the roles of representation diversity and learner complementarity.
Relative to three recently published dipole-prediction studies, the proposed framework remains competitively positioned. However, these comparisons should be interpreted cautiously because the underlying datasets, label sources, molecular domains, and noise regimes are not directly matched. Unlike previous studies that mainly focused on lower-noise theoretical benchmarks or narrower molecular domains, the present work targets the comparatively underexplored Stenutz experimental compilation.