In this paper we study statistically sound ways of comparing classifiers in absence for fully reliable reference data. Based on previously published partial frameworks, we explore a more comprehensive approach to comparing and ranking classifiers that is robust to incomplete, erroneous or missing reference evaluation data. On the one hand, the use of a generalized McNemar’s test is shown to give reliable confidence measures in the ranking of two classifiers under the assumption of an existing better-than-random reference classifier. We extend its use to cases where its traditional formulation is notoriously unstable. We also provide a computational context that allows it to be used for large amounts of data. Our classifier evaluation model is generic and applies to any set of binary classifiers. We have more specifically tested and validated it on synthetic and real data coming from document image binarization.
In this paper, are presented a number of statistically grounded performance evaluation metrics capable of evaluating binary classifiers in absence of annotated Ground Truth. These metrics are generic and can be applied to any type of classifier but are experimentally validated on binarization algorithms. The statistically grounded metrics were applied and compared with metrics based on annotated data. This approach has statistically significant better than random results in classifiers selection, and our evaluation metrics requiring no Ground Truth have high correlation with traditional metrics. The experiments were conducted on the images from the DIBCO binarization contests between 2009 and 2013.