-
Notifications
You must be signed in to change notification settings - Fork 587
Stagehand Evaluator & first agent evals #668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: b7caabf The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
height: 768, | ||
}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this mess with existing evals? should we make it an agent category property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be fine
|
||
dotenv.config(); | ||
|
||
export interface EvaluateOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR introduces a new Evaluator class for LLM-based task assessment and adds agent-focused evaluation tasks, enabling more scalable and flexible testing of model capabilities.
- Added
Evaluator
class inevals/evaluator.ts
with single/batch evaluation methods using LLM screenshot analysis - Added 5 new agent evaluation tasks in
/evals/tasks/agent/
for testing form filling and flight search capabilities - Added dynamic model selection in
taskConfig.ts
with support for multiple providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras) - Added viewport configuration in
initStagehand.ts
for consistent browserbase session rendering - Added optional
requestId
inLLMClient.ts
for more flexible API usage
13 file(s) reviewed, 12 comment(s)
Edit PR Review Bot Settings | Greptile
const response = await llmClient.createChatCompletion({ | ||
logger: () => {}, | ||
options: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Empty logger function could miss important LLM errors. Consider using the stagehand logger instead.
finalResults.length = 0; // Clear any potentially partially filled results | ||
for (let i = 0; i < questions.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Using array.length = 0 for clearing is not idiomatic. Consider finalResults = []
? input.name.split("/").pop() // Get the last part of the path for nested tasks | ||
: input.name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: pop() could return undefined if input.name ends with '/', causing runtime errors
if (provider) { | ||
return ALL_EVAL_MODELS.filter((model) => | ||
filterModelByProvider(model, provider), | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Provider filtering is applied to ALL_EVAL_MODELS but not to DEFAULT_AGENT_MODELS, which could lead to inconsistent behavior when provider filtering is enabled.
My only comment would be that in general for LLM-as-a-Judge you should enforce JSON output with a schema rather than freeform text. This helps drive more determinism in my experience even if its just {"passed": True} You can also then pass in response_mime_type: "application/json" in Gemini's case which does token filtering. I believe this is one of the largest drivers of Gemini's data extraction capabilities. Other than that, great idea EDIT Just realized you are doing that essentially but providing YES | NO as options inside a JSON object. Apologies. |
Thanks for the pointer on the mime type @peytoncasper! I was hoping to integrate that next into the gemini client abstraction |
typeof parsedResult.evaluation !== "string" || | ||
typeof parsedResult.reasoning !== "string" | ||
) { | ||
throw new Error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some custom errors?
@@ -322,6 +388,7 @@ const generateFilteredTestcases = (): Testcase[] => { | |||
logger, | |||
llmClient, | |||
useTextExtract: shouldUseTextExtract, | |||
modelName: input.modelName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pass in the modelName? is this bc agent wont work with external client providers like aiSDK etc?
why
Deterministic evals require high maintenance. A balanced approach is what most frontier labs use for evaluating models: LLMs as judges. Although not as deterministic as our current evals, this new category scales better while randomizing the potential judging error. Introducing:
Stagehand Evaluator
what changed
Created a new class
Evaluator
which reuses a stagehand object to leverage thellmClient
functionality, sending a screenshot (currently, maybe text-only in the future) asking a question or set of questions (batch).Example usage:
Evaluator will then return a response in the following format:
Also added a few evals for agent that use
Evaluator
as the judgetest plan