Support GAIA benchmark #1911

Jiayi-Pan · 2024-05-20T05:33:48Z

See Issue #1865
This PR introduces the GAIA benchmark as part of the evaluation harness. Currently, this is a draft version with known limitations and bugs:

File Handling: Some questions come with attached files (e.g., png, MOV, xml, xlsx, txt, json). At present, we simply move them to the workspace and inform the agent that the file is available.
- To reach a good score on gaia, the agents need to handle these files properly, and we should consider adding vision support to the agents.
Agent Hang Issue: The agent hangs and doesn’t stop after triggering finish.
- This issue has been discussed with @xingyaoww, and we believe it might be a bug in the agent implementation or the browser integration.

To reproduce error 2, run

python ./evaluation/gaia/run_infer.py \
--level 2023_level1 \
--data-split validation \
--eval-n-limit 1

li-boxuan · 2024-05-20T06:34:00Z

@Jiayi-Pan

I got the following error when trying to reproduce:

ERROR:root:<class 'datasets.exceptions.DatasetNotFoundError'>: Dataset 'gaia-benchmark/GAIA' doesn't exist on the Hub or cannot be accessed. If the dataset is private or gated, make sure to log in with huggingface-cli login or visit the dataset page at https://huggingface.co/datasets/gaia-benchmark/GAIA to ask for access.

Is it intended to be private for now?

UPDATE: never mind, I see it's public, and I just need to accept the conditions.
UPDATE2: I can reproduce the issue!
UPDATE3: It looks like there's a bug with browser env. I'll look into it and push a fix to your branch directly to unblock you, and then create a separate PR to fix the problem. ETA: 24 hours (it's midnight at PT right now).
UPDATE4: A fix has been pushed to this branch directly. I also opened #1933 to fix it on main.
UPDATE5: Fix pushed to main, and merged back to this branch.

li-boxuan · 2024-05-20T07:54:32Z

@Jiayi-Pan Could you please try again? I just pushed a fix to your branch.

frankxu2004 · 2024-05-20T11:01:24Z

@Jiayi-Pan I think for 1, the agent as the ability to open file:// inside the browser, and the browser observation will return a representation (screenshot also, but now there's no multimodal model yet). Maybe we could still claim that these are possible even for non-text-only format such as MOV

xingyaoww · 2024-05-20T11:05:22Z

@frankxu2004 But can the browser (in the app container) actually access file in the sandbox correctly though?

frankxu2004 · 2024-05-20T11:09:33Z

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

opendevin/runtime/browser/browser_env.py

yufansong · 2024-05-20T18:16:58Z

I am little confuse about the first point. Could someone explain more? Thanks for anyone's explanation.

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

we should consider adding vision support to the agents.

What vision support you want the agents have? Do you mean agents can access/open other kinds of file? (e.g., png, MOV, xml, xlsx, txt, json). I find #1914, maybe this is what you mean?

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

frankxu2004 · 2024-05-20T22:49:49Z

@yufansong

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

Currently we don't use any multimodal LLMs, so the screenshot is now for frontend showcase only (the browser tab in the frontend showing what the current browser state looks like.

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

Since the browser already have screenshot support in our codebase, as well as support for more complex file handling (e.g. the browser can open PDF files, but not so much for the cmdline), by allowing the browser to access files inside the sandbox's workspace should enable such scenario. Also, if the multimodal LLM takes in browser screenshot images in the future, this could be a way of unifying visual observation of various files. Still, we could still do filetype-specific handling as the other PR you mentioned is doing.

As for the web server thing - original I thought the files in the sandbox will not be directly accessible to the browser running on host. But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

li-boxuan · 2024-05-21T01:53:12Z

But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

That's sweet! Otherwise, hosting a web server just to serve static files under /workspace seems an anti-pattern to me.

Jiayi-Pan · 2024-05-23T04:31:34Z

Thanks everyone for the discussion! And thanks to @li-boxuan for fixing the agent hang bug.

After a few more bug fixes, I believe the Gaia evaluation harness is now pretty much complete. Although there’s still work to be done to develop a high-performing agent on gaia.

Should we merge this PR now and continue agent development in other threads? For instance, we have an ongoing PR focused on multimodal understanding, #1914.

Jiayi-Pan · 2024-05-23T04:35:30Z

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

neubig · 2024-05-23T15:51:54Z

@Jiayi-Pan this is amazing:

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@xingyaoww I'd like to get your review on this if possible, as the evals guru.

xingyaoww

Overall looks great to me - Just a few small things that need to be tweaked. Once these are addressed, I can help test running the infer and merge it.

evaluation/gaia/run_infer.py

Jiayi-Pan · 2024-05-24T02:08:37Z

thanks @xingyaoww for the review. I've addressed all the issues.

yufansong

Most LGTM

evaluation/gaia/README.md

evaluation/gaia/run_infer.py

iFurySt · 2024-05-24T03:39:52Z

I tried to run it yesterday. looks fine.

xingyaoww

LGTM! I added run_infer.sh, updated README.md to reflect that, and added a prompt that instructs the model to enclose its answer using <solution> tag for ease of parsing. I can confirm this works on the first example of 2023_level1.

But i think we still need to greatly improve the browser primitive functions (cc @frankxu2004):

On the 2nd instance of GAIA, I found that the agent tries to scroll down the web page and that primitive does not exist and causes issues.

xingyaoww · 2024-05-24T11:04:11Z

I think we can get this merged, and improve it in future PRs:

Improve browsing for CodeAct
Integrate agentskills to support reading multimodal documents: Support Multimodal Input for Agents #1914, Integrate Multimodal tools to agentskills. #2016

I'd appreciate if anyone can take a quick look at my newest changes? If they looks good - we can merge this one.

yufansong

@xingyaoww other new change looks good to me. Left one question.

yufansong · 2024-05-24T11:12:08Z

evaluation/gaia/run_infer.py

-        if isinstance(act, MessageAction) and act.source == 'agent':
-            model_answer = act.content
+        if isinstance(act, CmdRunAction) and act.source == 'agent':
+            model_answer_raw = act.thought


Should we also add break here?

good catch!! fixed in the new commit - will now do an auto-merge

Jiayi-Pan and others added 3 commits May 19, 2024 22:03

Add gaia test

6010a02

Merge branch 'OpenDevin:main' into eval-gaia

c1a42c2

Improve gaia prompts

9304d40

xingyaoww added the evaluation label May 20, 2024

Fix browser_env hang bug

914bb5f

huybery mentioned this pull request May 20, 2024

Add: a mechanism for tracking contributions to the paper #1917

Draft

yufansong reviewed May 20, 2024

View reviewed changes

opendevin/runtime/browser/browser_env.py Outdated Show resolved Hide resolved

li-boxuan mentioned this pull request May 21, 2024

Fix browser_env hung bug #1933

Merged

li-boxuan and others added 4 commits May 20, 2024 20:05

Merge remote-tracking branch 'upstream/main' into eval-gaia

34ed074

Merge branch 'OpenDevin:main' into eval-gaia

ba92feb

Fix gaia bugs

129ba02

Merge branch 'OpenDevin:main' into eval-gaia

06cd472

Jiayi-Pan marked this pull request as ready for review May 23, 2024 04:36

Jiayi-Pan mentioned this pull request May 23, 2024

Support Multimodal Input for Agents #1914

Closed

add gaia to eval readme

ccef1eb

neubig assigned xingyaoww May 23, 2024

xingyaoww reviewed May 23, 2024

View reviewed changes

evaluation/gaia/run_infer.py Show resolved Hide resolved

evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved

evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved

evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved

Fix gaia bugs

629b1c3

Merge branch 'main' into eval-gaia

5e93403

yufansong reviewed May 24, 2024

View reviewed changes

evaluation/gaia/README.md Outdated Show resolved Hide resolved

evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved

minor fix

67555d5

yufansong approved these changes May 24, 2024

View reviewed changes

xingyaoww added 5 commits May 24, 2024 17:20

add run_infer.sh and update readme

f08dfe0

set num eval worker to 1

f4aa218

default to 2023 gaia level1 subset

9e9e54e

default to level 1

b9729fe

add prompt to instruct model enclose answer within <solution> tag

351fe86

xingyaoww approved these changes May 24, 2024

View reviewed changes

Merge branch 'main' into eval-gaia

c02eb87

yufansong approved these changes May 24, 2024

View reviewed changes

add missing break

578ee07

xingyaoww enabled auto-merge (squash) May 24, 2024 11:15

xingyaoww merged commit 2d52298 into OpenDevin:main May 24, 2024
18 checks passed

This was referenced May 25, 2024

Integrate Multimodal tools to agentskills. #2016

Merged

Make sure the OPENAI_API_KEY is detected by agentskills #2052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support GAIA benchmark #1911

Support GAIA benchmark #1911

Jiayi-Pan commented May 20, 2024 •

edited

li-boxuan commented May 20, 2024 •

edited

li-boxuan commented May 20, 2024

frankxu2004 commented May 20, 2024

xingyaoww commented May 20, 2024

frankxu2004 commented May 20, 2024 •

edited

yufansong commented May 20, 2024 •

edited

frankxu2004 commented May 20, 2024

li-boxuan commented May 21, 2024

Jiayi-Pan commented May 23, 2024

Jiayi-Pan commented May 23, 2024

neubig commented May 23, 2024 •

edited

xingyaoww left a comment

Jiayi-Pan commented May 24, 2024

yufansong left a comment

iFurySt commented May 24, 2024

xingyaoww left a comment •

edited

xingyaoww commented May 24, 2024 •

edited

yufansong left a comment

yufansong May 24, 2024

xingyaoww May 24, 2024

Support GAIA benchmark #1911

Support GAIA benchmark #1911

Conversation

Jiayi-Pan commented May 20, 2024 • edited

li-boxuan commented May 20, 2024 • edited

li-boxuan commented May 20, 2024

frankxu2004 commented May 20, 2024

xingyaoww commented May 20, 2024

frankxu2004 commented May 20, 2024 • edited

yufansong commented May 20, 2024 • edited

frankxu2004 commented May 20, 2024

li-boxuan commented May 21, 2024

Jiayi-Pan commented May 23, 2024

Jiayi-Pan commented May 23, 2024

neubig commented May 23, 2024 • edited

xingyaoww left a comment

Choose a reason for hiding this comment

Jiayi-Pan commented May 24, 2024

yufansong left a comment

Choose a reason for hiding this comment

iFurySt commented May 24, 2024

xingyaoww left a comment • edited

Choose a reason for hiding this comment

xingyaoww commented May 24, 2024 • edited

yufansong left a comment

Choose a reason for hiding this comment

yufansong May 24, 2024

Choose a reason for hiding this comment

xingyaoww May 24, 2024

Choose a reason for hiding this comment

Jiayi-Pan commented May 20, 2024 •

edited

li-boxuan commented May 20, 2024 •

edited

frankxu2004 commented May 20, 2024 •

edited

yufansong commented May 20, 2024 •

edited

neubig commented May 23, 2024 •

edited

xingyaoww left a comment •

edited

xingyaoww commented May 24, 2024 •

edited