PDBMINER: RANK AGGREGATION-BASED SELECTION OF QUALIFIED PROTEIN DATA BANK STRUCTURES

Uğurlu S. Y.

XII. International Health, Engineering and Sciences Congress, Toskent, Özbekistan, 10 - 12 Nisan 2026, ss.326-340, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Basıldığı Şehir: Toskent
Basıldığı Ülke: Özbekistan
Sayfa Sayıları: ss.326-340
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Akdeniz Üniversitesi Adresli: Evet

Özet

The selection of appropriate Protein Data Bank (PDB) structures is essential for structure-based computational studies such as molecular docking, virtual screening, and molecular dynamics simulations. The quality of the selected structure significantly impacts the reliability and reproducibility of computational outcomes. Researchers typically focus on a limited set of criteria, particularly resolution, for structure selection. However, structural quality encompasses multiple parameters, including crystallographic resolution, R-free, R-observed, R-work, data completeness, mutation status, and structural age. This complexity implies that a PDB entry may excel in one aspect but be inadequate in another, making PDB selection a multi-criteria decision problem.

To address this challenge, we developed PDBminer, an automated pipeline designed to systematically identify and prioritize high-quality PDB structures associated with a given UniProt accession. The method retrieves all related PDB entries from the UniProt REST API, extracts structural and experimental quality metrics directly from PDB files, and applies configurable quality filters. Multiple structural criteria—including resolution, R-free, R-observed, R-work, crystallographic completeness, B-factor statistics, mutation presence, and structural age—are integrated using a rank aggregation framework to produce a unified ranking of candidate structures. Additionally, an optional redundancy filtering step removes highly overlapping entries based on UniProt residue coverage, ensuring that only representative structures are retained.

Widely studied proteins such as p53 (UniProt: P04637) and the SARS-CoV-2 replicase polyprotein (UniProt: P0DTD1) contain 286 and 2422 associated PDB structures, respectively. Even when only several structural metrics—such as resolution, completeness, R-free, R-observed, R-work, mutation status, and sequence coverage—are considered, manually collecting and comparing these data across thousands of entries can require days of effort. Moreover, identifying optimal structures requires solving a multi-criteria optimization problem involving seven structural quality features, which may take weeks of manual analysis and still fail to guarantee a systematic and unbiased selection. In contrast, PDBminer automatically evaluates all available structures and produces a ranked list of candidate PDB entries within minutes.

The final output is a ranked dataset exported as a structured report containing structural metrics, ligand information, and quality indicators for each candidate PDB entry. By combining automated data retrieval, multi-criteria evaluation, and rank aggregation, PDBminer provides a robust and reproducible strategy for selecting optimal PDB structures for structural bioinformatics and computational drug discovery studies. The framework enables systematic prioritization of reliable protein models and serves as a decision-support tool for structure-based computational research.