QA LLM evaluation datasets from task coverage, label quality, leakage risk, difficulty balance, and owner signoff