Auto-BenchmarkCard: Automated Synthesis of Benchmark DocumentationAris HofmannInge Vejsbjerget al.2026AAAI 2026Demo paper
Risk Atlas Nexus: A System for Managing AI RisksInge VejsbjergRahul Nairet al.2026AAAI 2026Demo paper
Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk AssessmentSrishti YadavJasmina Gajcinet al.2026AAAI 2026Workshop paper
BenchmarkCards: Standardized Documentation for Large Language Model BenchmarksAnna SokolElizabeth Dalyet al.2025NeurIPS 2025Conference paper
FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language ModelsRadu MarinescuDebarun Bhattacharjyaet al.2025EMNLP 2025Paper
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssistElizabeth DalyErik Miehlinget al.2025EMNLP 2025Demo paper
Optimistic Exploration for Risk-Averse Constrained Reinforcement LearningRadu MarinescuElizabeth Dalyet al.2025ECAI 2025Conference paper
Localizing Persona Representations in LLMsCelia CintasMiriam Rateikeet al.2025AIES 2025Conference paper
Localizing Persona Representations in LLMsCelia CintasMiriam Rateikeet al.2025COLM 2025Workshop paper
EvalAssist: Insights on Task-Specific Evaluations and AI-assisted Judgement Strategy PreferencesZahra AshktorabMichael Desmondet al.2025UIST 2025Conference paper