A Machine Learning Approach to Test Data Generation

A Case Study in Evaluation of Gene Finders

Henning Christiansen, Christina Mackeprang Dahmcke

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods based on semiartificially produced test data, often produced by {\em ad hoc} techniques complemented by statistical models such as Hidden Markov Models (HMM). The quality of such a test method depends on how well the test data reflect the regularities in known data and how well they generalize these regularities. So far only very simplified and generalized, artificial data sets have been tested, and a more thorough statistical foundation is required.

We propose to use logic-statistical modelling methods for machine-learning for analyzing existing and manually marked up data, integrated with the generation of new, artificial data. More specifically, we suggest to use the PRISM system developed by Sato and Kameya. Based on logic programming extended with random variables and parameter learning, PRISM appears as a powerful modelling environment, which subsumes HMMs and a wide range of other methods, all embedded in a declarative language. We illustrate these principles here, showing parts of a model under development for genetic sequences and indicate first initial experiments producing test data for evaluation of existing gene finders, exemplified by GENSCAN, HMMGene and genemark.hmm.


Original languageEnglish
Title of host publicationProc. International Conference on Machine Learning and Data Mining MLDM'2007 : Lecture Notes in Artificial Intelligence
Number of pages15
Volume4571
PublisherSpringer
Publication date2007
Pages741-755
ISBN (Print)978-3-540-73498-7
Publication statusPublished - 2007
EventInternational Conference on Machine Learning and Data Mining MLDM'2007 - Leipzig, Germany
Duration: 18 Jul 200720 Jul 2007

Conference

ConferenceInternational Conference on Machine Learning and Data Mining MLDM'2007
CountryGermany
CityLeipzig
Period18/07/200720/07/2007
SeriesLecture notes in artificial intelligence
Number4571
ISSN0302-9743

Keywords

  • bioinformatics
  • sequence analyses
  • software testing
  • machine learning

Cite this

Christiansen, H., & Dahmcke, C. M. (2007). A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders. In Proc. International Conference on Machine Learning and Data Mining MLDM'2007: Lecture Notes in Artificial Intelligence (Vol. 4571, pp. 741-755). Springer. Lecture notes in artificial intelligence, No. 4571
Christiansen, Henning ; Dahmcke, Christina Mackeprang. / A Machine Learning Approach to Test Data Generation : A Case Study in Evaluation of Gene Finders. Proc. International Conference on Machine Learning and Data Mining MLDM'2007: Lecture Notes in Artificial Intelligence. Vol. 4571 Springer, 2007. pp. 741-755 (Lecture notes in artificial intelligence; No. 4571).
@inproceedings{adb48880fd3711db8d23000ea68e967b,
title = "A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders",
abstract = "Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods based on semiartificially produced test data, often produced by {\em ad hoc} techniques complemented by statistical models such as Hidden Markov Models (HMM). The quality of such a test method depends on how well the test data reflect the regularities in known data and how well they generalize these regularities. So far only very simplified and generalized, artificial data sets have been tested, and a more thorough statistical foundation is required.We propose to use logic-statistical modelling methods for machine-learning for analyzing existing and manually marked up data, integrated with the generation of new, artificial data. More specifically, we suggest to use the PRISM system developed by Sato and Kameya. Based on logic programming extended with random variables and parameter learning, PRISM appears as a powerful modelling environment, which subsumes HMMs and a wide range of other methods, all embedded in a declarative language. We illustrate these principles here, showing parts of a model under development for genetic sequences and indicate first initial experiments producing test data for evaluation of existing gene finders, exemplified by GENSCAN, HMMGene and genemark.hmm.",
keywords = "bioinformatik, sekvensanalyse, softwaretest, maskinindl{\ae}ring, bioinformatics, sequence analyses, software testing, machine learning",
author = "Henning Christiansen and Dahmcke, {Christina Mackeprang}",
year = "2007",
language = "English",
isbn = "978-3-540-73498-7",
volume = "4571",
pages = "741--755",
booktitle = "Proc. International Conference on Machine Learning and Data Mining MLDM'2007",
publisher = "Springer",

}

Christiansen, H & Dahmcke, CM 2007, A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders. in Proc. International Conference on Machine Learning and Data Mining MLDM'2007: Lecture Notes in Artificial Intelligence. vol. 4571, Springer, Lecture notes in artificial intelligence, no. 4571, pp. 741-755, Leipzig, Germany, 18/07/2007.

A Machine Learning Approach to Test Data Generation : A Case Study in Evaluation of Gene Finders. / Christiansen, Henning; Dahmcke, Christina Mackeprang.

Proc. International Conference on Machine Learning and Data Mining MLDM'2007: Lecture Notes in Artificial Intelligence. Vol. 4571 Springer, 2007. p. 741-755 (Lecture notes in artificial intelligence; No. 4571).

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

TY - GEN

T1 - A Machine Learning Approach to Test Data Generation

T2 - A Case Study in Evaluation of Gene Finders

AU - Christiansen, Henning

AU - Dahmcke, Christina Mackeprang

PY - 2007

Y1 - 2007

N2 - Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods based on semiartificially produced test data, often produced by {\em ad hoc} techniques complemented by statistical models such as Hidden Markov Models (HMM). The quality of such a test method depends on how well the test data reflect the regularities in known data and how well they generalize these regularities. So far only very simplified and generalized, artificial data sets have been tested, and a more thorough statistical foundation is required.We propose to use logic-statistical modelling methods for machine-learning for analyzing existing and manually marked up data, integrated with the generation of new, artificial data. More specifically, we suggest to use the PRISM system developed by Sato and Kameya. Based on logic programming extended with random variables and parameter learning, PRISM appears as a powerful modelling environment, which subsumes HMMs and a wide range of other methods, all embedded in a declarative language. We illustrate these principles here, showing parts of a model under development for genetic sequences and indicate first initial experiments producing test data for evaluation of existing gene finders, exemplified by GENSCAN, HMMGene and genemark.hmm.

AB - Programs for gene prediction in computational biology are examples of systems for which the acquisition of authentic test data is difficult as these require years of extensive research. This has lead to test methods based on semiartificially produced test data, often produced by {\em ad hoc} techniques complemented by statistical models such as Hidden Markov Models (HMM). The quality of such a test method depends on how well the test data reflect the regularities in known data and how well they generalize these regularities. So far only very simplified and generalized, artificial data sets have been tested, and a more thorough statistical foundation is required.We propose to use logic-statistical modelling methods for machine-learning for analyzing existing and manually marked up data, integrated with the generation of new, artificial data. More specifically, we suggest to use the PRISM system developed by Sato and Kameya. Based on logic programming extended with random variables and parameter learning, PRISM appears as a powerful modelling environment, which subsumes HMMs and a wide range of other methods, all embedded in a declarative language. We illustrate these principles here, showing parts of a model under development for genetic sequences and indicate first initial experiments producing test data for evaluation of existing gene finders, exemplified by GENSCAN, HMMGene and genemark.hmm.

KW - bioinformatik

KW - sekvensanalyse

KW - softwaretest

KW - maskinindlæring

KW - bioinformatics

KW - sequence analyses

KW - software testing

KW - machine learning

M3 - Article in proceedings

SN - 978-3-540-73498-7

VL - 4571

SP - 741

EP - 755

BT - Proc. International Conference on Machine Learning and Data Mining MLDM'2007

PB - Springer

ER -

Christiansen H, Dahmcke CM. A Machine Learning Approach to Test Data Generation: A Case Study in Evaluation of Gene Finders. In Proc. International Conference on Machine Learning and Data Mining MLDM'2007: Lecture Notes in Artificial Intelligence. Vol. 4571. Springer. 2007. p. 741-755. (Lecture notes in artificial intelligence; No. 4571).