An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems

Carolyn Mair, Martin Shepperd, Magne Jorgensen

Research output: Contribution to journalArticle

30 Downloads (Pure)


OBJECTIVE - the aim of this investigation is to build up a picture of the nature and type of data sets being used to develop and evaluate different software project effort prediction systems. We believe this to be important since there is a growing body of published work that seeks to assess different prediction approaches. Unfortunately, results ? to date ? are rather inconsistent so we are interested in the extent to which this might be explained by different data sets. METHOD - we performed an exhaustive search from 1980 onwards from three software engineering journals for research papers that used project data sets to compare cost prediction systems. RESULTS - this identified a total of 50 papers that used, one or more times, a total of 74 unique project data sets. We observed that some of the better known and publicly accessible data sets were used repeatedly making them potentially disproportionately influential. Such data sets also tend to be amongst the oldest with potential problems of obsolescence. We also note that only about 70 extracting relevant information from research papers has been time consuming due to different styles of presentation and levels of contextural information. CONCLUSIONS - we believe there are two lessons to learn. First, the community needs to consider the quality and appropriateness of the data set being utilised; not all data sets are equal. Second, we need to assess the way results are presented in order to facilitate meta-analysis and whether a standard protocol would be appropriate.
Original languageEnglish
Pages (from-to)1-6
Number of pages6
JournalACM SIGSOFT Software Engineering Notes
Issue number4
Publication statusPublished - 2006
Externally publishedYes


Dive into the research topics of 'An Analysis of Data Sets Used to Train and Validate Cost Prediction Systems'. Together they form a unique fingerprint.

Cite this