-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GAIA benchmark #1911
Support GAIA benchmark #1911
Conversation
I got the following error when trying to reproduce:
Is it intended to be private for now? UPDATE: never mind, I see it's public, and I just need to accept the conditions. |
@Jiayi-Pan Could you please try again? I just pushed a fix to your branch. |
@Jiayi-Pan I think for 1, the agent as the ability to open file:// inside the browser, and the browser observation will return a representation (screenshot also, but now there's no multimodal model yet). Maybe we could still claim that these are possible even for non-text-only format such as MOV |
@frankxu2004 But can the browser (in the |
That's a very good point!! Maybe we need to expose as a static web server hosting all the files under workspace, so the browser can access through http://SANDBOX_HOST:PORT/workspace/file.mov |
I am little confuse about the first point. Could someone explain more? Thanks for anyone's explanation. I find in codebase, the browser will convert the screenshoot into imagebase64. Will we pass the imagebase 64 into the LLM model, or what it will used for?
What vision support you want the agents have? Do you mean agents can access/open other kinds of file? (e.g., png, MOV, xml, xlsx, txt, json). I find #1914, maybe this is what you mean?
I didn't understand why a web server can help in this case. Do you mean moving file into the web server and let gymnasium can access the file url just like browsing the web? |
Currently we don't use any multimodal LLMs, so the screenshot is now for frontend showcase only (the browser tab in the frontend showing what the current browser state looks like.
Since the browser already have screenshot support in our codebase, as well as support for more complex file handling (e.g. the browser can open PDF files, but not so much for the cmdline), by allowing the browser to access files inside the sandbox's workspace should enable such scenario. Also, if the multimodal LLM takes in browser screenshot images in the future, this could be a way of unifying visual observation of various files. Still, we could still do filetype-specific handling as the other PR you mentioned is doing. As for the web server thing - original I thought the files in the sandbox will not be directly accessible to the browser running on host. But after checking, it seems like /workspace is mounted from host, so that maybe the browser can directly open these files with file:// |
That's sweet! Otherwise, hosting a web server just to serve static files under |
Thanks everyone for the discussion! And thanks to @li-boxuan for fixing the agent hang bug. After a few more bug fixes, I believe the Gaia evaluation harness is now pretty much complete. Although there’s still work to be done to develop a high-performing agent on gaia. Should we merge this PR now and continue agent development in other threads? For instance, we have an ongoing PR focused on multimodal understanding, #1914. |
One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself. |
@Jiayi-Pan this is amazing:
@xingyaoww I'd like to get your review on this if possible, as the evals guru. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great to me - Just a few small things that need to be tweaked. Once these are addressed, I can help test running the infer and merge it.
thanks @xingyaoww for the review. I've addressed all the issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most LGTM
I tried to run it yesterday. looks fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I added run_infer.sh
, updated README.md
to reflect that, and added a prompt that instructs the model to enclose its answer using <solution>
tag for ease of parsing. I can confirm this works on the first example of 2023_level1
.
But i think we still need to greatly improve the browser primitive functions (cc @frankxu2004):
On the 2nd instance of GAIA, I found that the agent tries to scroll down the web page and that primitive does not exist and causes issues.
I think we can get this merged, and improve it in future PRs:
I'd appreciate if anyone can take a quick look at my newest changes? If they looks good - we can merge this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xingyaoww other new change looks good to me. Left one question.
if isinstance(act, MessageAction) and act.source == 'agent': | ||
model_answer = act.content | ||
if isinstance(act, CmdRunAction) and act.source == 'agent': | ||
model_answer_raw = act.thought |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add break here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!! fixed in the new commit - will now do an auto-merge
See Issue #1865
This PR introduces the GAIA benchmark as part of the evaluation harness. Currently, this is a draft version with known limitations and bugs:
To reproduce error 2, run