Label evaluation is currently done with strict measures only, we should add an evaluation on non-strict ground and overhaul the evalapp