Junwei Li, University of Louisiana at Lafayette
Yun Yang, University of Louisiana at Lafayette
One prevalent method for evaluating the results of automated software analysis tools is to compare the tools' output to the judgment of human experts. This evaluation strategy is commonly assumed in the field of software clone detector research. We report our experiences from a study using several human judges who tried to establish "reference sets" of function clones for several medium-sized software systems written in C. The study employed multiple judges and followed a process typical for inter-coder reliability assurance wherein coders discussed classification discrepancies until consensus is reached. A high level of disagreement was found for reference sets made specifically for reengineering task contexts. The results, although preliminary, raise questions about limitations of prior clone detector evaluations and other similar tool evaluations. Implications are drawn for future work on reference data generation, tool evaluations, and benchmarking efforts.
Citation:
Andrew Walenstein, Nitin Jyoti, Junwei Li, Yun Yang, Arun Lakhotia, "Problems Creating Task-relevant Clone Detection Reference Data," wcre, pp.285, 10th Working Conference on Reverse Engineering (WCRE 2003), 2003