In this paper, are presented a number of statistically grounded performance evaluation metrics capable of evaluating binary classifiers in absence of annotated Ground Truth. These metrics are generic and can be applied to any type of classifier but are experimentally validated on binarization algorithms. The statistically grounded metrics were applied and compared with metrics based on annotated data. This approach has statistically significant better than random results in classifiers selection, and our evaluation metrics requiring no Ground Truth have high correlation with traditional metrics. The experiments were conducted on the images from the DIBCO binarization contests between 2009 and 2013.