What is it about?
We used a manual approach to curate structure based data for a publicly available physicochemical property dataset. Using this experience we developed an automation procedure using KNIME to process multiple other datasets and then developed QSAR prediction models and examined the influence of data curation on the statistical performance of the models.
Featured Image
Why is it important?
Data quality is important. For the development of QSAR prediction models this paper shows the importance of data curation and how it influences the resulting statistical performance of the models and why it is worth the upfront investment in checking and validating the data. This work focused only on the chemical structures, NOT the actual property values, and even this made a measurable difference to the algorithmic performance.
Perspectives
Read the Original
This page is a summary of: An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling$, SAR and QSAR in Environmental Research, November 2016, Taylor & Francis,
DOI: 10.1080/1062936x.2016.1253611.
You can read the full text:
Resources
The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists
This is a poster delivered at the Fall ACS conference in Philadelphia regarding the process and impact of data curation
An examination of data quality on QSAR Modeling in regards to the environmental sciences
Presentation given at UNC Chapel Hill
PHYSPROP Curated training and test sets etc.
The PHYSPROP analysis folder contains the training set, test set, KNIME workflow and other related data for the paper.
Contributors
The following have contributed to this page