⚗️

Lila Benchmark Review

Human-in-the-loop quality evaluation

Drop your v3-results-*.json file here

Supports v3 benchmark datasets (1,400+ outputs)