Biological Information in SARs
Following up on our work on small molecule data, we extended the paradigm of structure-activity relationships of non-congeneric compounds to integrate biological data. In contrast to previous work on the Human Tumor Cell Line Screening data from the National Cancer Institute (NCI), we propose a full-fledged predictive model of growth inhibition that is applicable to new compounds and new cell lines. The model predicts a baseline activity from a chemical structure, which is further modulated by biological information. The prediction task is technically challenging, because we are dealing with multi-relational, high-dimensional (gene expression) and graph (chemical structure) data. We show that the inclusion of biological information leads to a statistically significant improvement of prediction quality. The data used in the paper are made available online.
Publication:
[RRK06] Richter, L, Rückert, U, and Kramer, S (2006). Learning a Predictive Model for Growth Inhibition from the NCI DTP Human Tumor Cell Line Screening Data: Does Gene Expression Make a Difference? In: Pacific Symposium on Biocomputing, vol. 11, pp. 596-607.
Data:
Available Datasets:
- The compound file from the DTP Human Tumor Cell Line Screen project we used for our analysis: CANA00SD.txt.gz
- The indices for the training set compounds: traincompoundNumbers.txt
- The indices for the test set compounds: testcompoundNumbers.txt
- The training set cell lines: trainCell.txt
- The test set cell lines: testCell.txt
- The cell lines representing a tissue each: trainTissue.txt
- Gene expression of the selected cell lines according to Scherf (1375 genes) with interpolated missing values: expression.csv.scherf.staffed
- Our filtered version of gi50_sep03.txt.gz:gi50_optimal.txt.gz
- Pearson correlation coefficients for all used cell lines against training cell lines: trainCorr.csv
- Pearson correlation coefficients for all used cell lines against cell lines representing a tissue: trainTissueCorr.csv
Please cite [RRK06] if you are using the data in a paper.
