TY - JOUR
T1 - A Turing test for artificial expression data
AU - Maier, Robert
AU - Zimmer, Ralf
AU - Küffner, Robert
PY - 2013/10/15
Y1 - 2013/10/15
N2 - Motivation: The lack of reliable, comprehensive gold standards complicates the development of many bioinformatics tools, particularly for the analysis of expression data and biological networks. Simulation approaches can provide provisional gold standards, such as regulatory networks, for the assessment of network inference methods. However, this just defers the problem, as it is difficult to assess how closely simulators emulate the properties of real data. Results: In analogy to Turing's test discriminating humans and computers based on responses to questions, we systematically compare real and artificial systems based on their gene expression output. Different expression data analysis techniques such as clustering are applied to both types of datasets. We define and extract distributions of properties from the results, for instance, distributions of cluster quality measures or transcription factor activity patterns. Distributions of properties are represented as histograms to enable the comparison of artificial and real datasets. We examine three frequently used simulators that generate expression data from parameterized regulatory networks. We identify features distinguishing real from artificial datasets that suggest how simulators could be adapted to better emulate real datasets and, thus, become more suitable for the evaluation of data analysis tools. Availability: See http://www2.bio.ifi.lmu.de/~kueffner/attfad/ and the supplement for precomputed analyses; other compendia can be analyzed via the CRAN package attfad. The full datasets can be obtained from http://www2.bio.ifi. lmu.de/~kueffner/attfad/data.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
AB - Motivation: The lack of reliable, comprehensive gold standards complicates the development of many bioinformatics tools, particularly for the analysis of expression data and biological networks. Simulation approaches can provide provisional gold standards, such as regulatory networks, for the assessment of network inference methods. However, this just defers the problem, as it is difficult to assess how closely simulators emulate the properties of real data. Results: In analogy to Turing's test discriminating humans and computers based on responses to questions, we systematically compare real and artificial systems based on their gene expression output. Different expression data analysis techniques such as clustering are applied to both types of datasets. We define and extract distributions of properties from the results, for instance, distributions of cluster quality measures or transcription factor activity patterns. Distributions of properties are represented as histograms to enable the comparison of artificial and real datasets. We examine three frequently used simulators that generate expression data from parameterized regulatory networks. We identify features distinguishing real from artificial datasets that suggest how simulators could be adapted to better emulate real datasets and, thus, become more suitable for the evaluation of data analysis tools. Availability: See http://www2.bio.ifi.lmu.de/~kueffner/attfad/ and the supplement for precomputed analyses; other compendia can be analyzed via the CRAN package attfad. The full datasets can be obtained from http://www2.bio.ifi. lmu.de/~kueffner/attfad/data.tar.gz. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
UR - https://www.scopus.com/pages/publications/84885653547
U2 - 10.1093/bioinformatics/btt438
DO - 10.1093/bioinformatics/btt438
M3 - Article
C2 - 23956305
AN - SCOPUS:84885653547
SN - 1367-4803
VL - 29
SP - 2603
EP - 2609
JO - Bioinformatics
JF - Bioinformatics
IS - 20
ER -