This benchmark framework evaluates whether LLM agents can learn and adapt in complex stateful environments where actions modify persistent state, entities have cross-references, and workflows span ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results