Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GAIA benchmark #1911

Merged
merged 19 commits into from
May 24, 2024
Merged

Support GAIA benchmark #1911

merged 19 commits into from
May 24, 2024

Conversation

Jiayi-Pan
Copy link
Contributor

@Jiayi-Pan Jiayi-Pan commented May 20, 2024

See Issue #1865
This PR introduces the GAIA benchmark as part of the evaluation harness. Currently, this is a draft version with known limitations and bugs:

  1. File Handling: Some questions come with attached files (e.g., png, MOV, xml, xlsx, txt, json). At present, we simply move them to the workspace and inform the agent that the file is available.
    • To reach a good score on gaia, the agents need to handle these files properly, and we should consider adding vision support to the agents.
  2. Agent Hang Issue: The agent hangs and doesn’t stop after triggering finish.
    • This issue has been discussed with @xingyaoww, and we believe it might be a bug in the agent implementation or the browser integration.
      image

To reproduce error 2, run

python ./evaluation/gaia/run_infer.py \
--level 2023_level1 \
--data-split validation \
--eval-n-limit 1

@li-boxuan
Copy link
Collaborator

li-boxuan commented May 20, 2024

@Jiayi-Pan

I got the following error when trying to reproduce:

ERROR:root:<class 'datasets.exceptions.DatasetNotFoundError'>: Dataset 'gaia-benchmark/GAIA' doesn't exist on the Hub or cannot be accessed. If the dataset is private or gated, make sure to log in with huggingface-cli login or visit the dataset page at https://huggingface.co/datasets/gaia-benchmark/GAIA to ask for access.

Is it intended to be private for now?

UPDATE: never mind, I see it's public, and I just need to accept the conditions.
UPDATE2: I can reproduce the issue!
UPDATE3: It looks like there's a bug with browser env. I'll look into it and push a fix to your branch directly to unblock you, and then create a separate PR to fix the problem. ETA: 24 hours (it's midnight at PT right now).
UPDATE4: A fix has been pushed to this branch directly. I also opened #1933 to fix it on main.
UPDATE5: Fix pushed to main, and merged back to this branch.

@li-boxuan
Copy link
Collaborator

@Jiayi-Pan Could you please try again? I just pushed a fix to your branch.

@frankxu2004
Copy link
Collaborator

@Jiayi-Pan I think for 1, the agent as the ability to open file:// inside the browser, and the browser observation will return a representation (screenshot also, but now there's no multimodal model yet). Maybe we could still claim that these are possible even for non-text-only format such as MOV

@xingyaoww
Copy link
Collaborator

@frankxu2004 But can the browser (in the app container) actually access file in the sandbox correctly though?

@frankxu2004
Copy link
Collaborator

frankxu2004 commented May 20, 2024

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

@yufansong
Copy link
Collaborator

yufansong commented May 20, 2024

I am little confuse about the first point. Could someone explain more? Thanks for anyone's explanation.

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

we should consider adding vision support to the agents.

What vision support you want the agents have? Do you mean agents can access/open other kinds of file? (e.g., png, MOV, xml, xlsx, txt, json). I find #1914, maybe this is what you mean?

That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

@frankxu2004
Copy link
Collaborator

@yufansong

I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?

Currently we don't use any multimodal LLMs, so the screenshot is now for frontend showcase only (the browser tab in the frontend showing what the current browser state looks like.

I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web?

Since the browser already have screenshot support in our codebase, as well as support for more complex file handling (e.g. the browser can open PDF files, but not so much for the cmdline), by allowing the browser to access files inside the sandbox's workspace should enable such scenario. Also, if the multimodal LLM takes in browser screenshot images in the future, this could be a way of unifying visual observation of various files. Still, we could still do filetype-specific handling as the other PR you mentioned is doing.

As for the web server thing - original I thought the files in the sandbox will not be directly accessible to the browser running on host. But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

@li-boxuan
Copy link
Collaborator

But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file://

That's sweet! Otherwise, hosting a web server just to serve static files under /workspace seems an anti-pattern to me.

@Jiayi-Pan
Copy link
Contributor Author

Thanks everyone for the discussion! And thanks to @li-boxuan for fixing the agent hang bug.

After a few more bug fixes, I believe the Gaia evaluation harness is now pretty much complete. Although there’s still work to be done to develop a high-performing agent on gaia.

Should we merge this PR now and continue agent development in other threads? For instance, we have an ongoing PR focused on multimodal understanding, #1914.

@Jiayi-Pan
Copy link
Contributor Author

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@Jiayi-Pan Jiayi-Pan marked this pull request as ready for review May 23, 2024 04:36
@neubig
Copy link
Contributor

neubig commented May 23, 2024

@Jiayi-Pan this is amazing:

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@xingyaoww I'd like to get your review on this if possible, as the evals guru.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great to me - Just a few small things that need to be tweaked. Once these are addressed, I can help test running the infer and merge it.

evaluation/gaia/run_infer.py Show resolved Hide resolved
evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved
evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved
evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved
@Jiayi-Pan
Copy link
Contributor Author

thanks @xingyaoww for the review. I've addressed all the issues.

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most LGTM

evaluation/gaia/README.md Outdated Show resolved Hide resolved
evaluation/gaia/run_infer.py Outdated Show resolved Hide resolved
@iFurySt
Copy link
Collaborator

iFurySt commented May 24, 2024

I tried to run it yesterday. looks fine.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I added run_infer.sh, updated README.md to reflect that, and added a prompt that instructs the model to enclose its answer using <solution> tag for ease of parsing. I can confirm this works on the first example of 2023_level1.

But i think we still need to greatly improve the browser primitive functions (cc @frankxu2004):
image

On the 2nd instance of GAIA, I found that the agent tries to scroll down the web page and that primitive does not exist and causes issues.

@xingyaoww
Copy link
Collaborator

xingyaoww commented May 24, 2024

I think we can get this merged, and improve it in future PRs:

I'd appreciate if anyone can take a quick look at my newest changes? If they looks good - we can merge this one.

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingyaoww other new change looks good to me. Left one question.

if isinstance(act, MessageAction) and act.source == 'agent':
model_answer = act.content
if isinstance(act, CmdRunAction) and act.source == 'agent':
model_answer_raw = act.thought
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add break here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!! fixed in the new commit - will now do an auto-merge

@xingyaoww xingyaoww enabled auto-merge (squash) May 24, 2024 11:15
@xingyaoww xingyaoww merged commit 2d52298 into OpenDevin:main May 24, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants