Automatically grade LLM-generated responses for factual accuracy, instruction adherence, coherence, completeness, and sa