Compare training datasets by model performance on standard benchmarks. Higher score = better training data.