- Compare one-shot prompting vs. full agentic loop (with profiling feedback). - Measure Success@1, Success@3, Success@5, and tokens used. - Analyze quality of intermediate reasoning steps.