Skip to content

Stagehand Evaluator & first agent evals #668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

miguelg719
Copy link
Collaborator

@miguelg719 miguelg719 commented Apr 15, 2025

why

Deterministic evals require high maintenance. A balanced approach is what most frontier labs use for evaluating models: LLMs as judges. Although not as deterministic as our current evals, this new category scales better while randomizing the potential judging error. Introducing:

Stagehand Evaluator

what changed

Created a new class Evaluator which reuses a stagehand object to leverage the llmClient functionality, sending a screenshot (currently, maybe text-only in the future) asking a question or set of questions (batch).

Example usage:

  const evaluator = new Evaluator(stagehand);

  // single evaluation
  const result = await evaluator.evaluate({
    question: "Is the form name input filled with 'John Smith'?"
  });

  // batch evaluate
  const results = await evaluator.batchEvaluate({
    questions: [
      "Is the form name input filled with 'John Smith'?",
      "Is the form email input filled with '[email protected]'?",
      "Is the 'Are you the domain owner?' option selected as 'No'?",
    ],
    strictResponse: true, // whether or not to throw an error on evaulations !== YES | NO
  });

Evaluator will then return a response in the following format:

{
   evaluation: "YES" | "NO",
   reasoning:  "reasoning about the evaluation process"
}

Also added a few evals for agent that use Evaluator as the judge

test plan

  • Tested with local evals
  • Tested with BB evals

Copy link

changeset-bot bot commented Apr 15, 2025

🦋 Changeset detected

Latest commit: b7caabf

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@browserbasehq/stagehand Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

height: 768,
},
},
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this mess with existing evals? should we make it an agent category property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be fine


dotenv.config();

export interface EvaluateOptions {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move types

@miguelg719 miguelg719 marked this pull request as ready for review April 15, 2025 21:44
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR introduces a new Evaluator class for LLM-based task assessment and adds agent-focused evaluation tasks, enabling more scalable and flexible testing of model capabilities.

  • Added Evaluator class in evals/evaluator.ts with single/batch evaluation methods using LLM screenshot analysis
  • Added 5 new agent evaluation tasks in /evals/tasks/agent/ for testing form filling and flight search capabilities
  • Added dynamic model selection in taskConfig.ts with support for multiple providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras)
  • Added viewport configuration in initStagehand.ts for consistent browserbase session rendering
  • Added optional requestId in LLMClient.ts for more flexible API usage

13 file(s) reviewed, 12 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +119 to +121
const response = await llmClient.createChatCompletion({
logger: () => {},
options: {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Empty logger function could miss important LLM errors. Consider using the stagehand logger instead.

Comment on lines +348 to +349
finalResults.length = 0; // Clear any potentially partially filled results
for (let i = 0; i < questions.length; i++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Using array.length = 0 for clearing is not idiomatic. Consider finalResults = []

Comment on lines +310 to +311
? input.name.split("/").pop() // Get the last part of the path for nested tasks
: input.name;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: pop() could return undefined if input.name ends with '/', causing runtime errors

Comment on lines +117 to 121
if (provider) {
return ALL_EVAL_MODELS.filter((model) =>
filterModelByProvider(model, provider),
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Provider filtering is applied to ALL_EVAL_MODELS but not to DEFAULT_AGENT_MODELS, which could lead to inconsistent behavior when provider filtering is enabled.

@peytoncasper
Copy link

peytoncasper commented Apr 18, 2025

My only comment would be that in general for LLM-as-a-Judge you should enforce JSON output with a schema rather than freeform text. This helps drive more determinism in my experience even if its just

{"passed": True}

You can also then pass in response_mime_type: "application/json" in Gemini's case which does token filtering. I believe this is one of the largest drivers of Gemini's data extraction capabilities.

Other than that, great idea

EDIT

Just realized you are doing that essentially but providing YES | NO as options inside a JSON object. Apologies.

@miguelg719
Copy link
Collaborator Author

miguelg719 commented Apr 18, 2025

Thanks for the pointer on the mime type @peytoncasper! I was hoping to integrate that next into the gemini client abstraction

typeof parsedResult.evaluation !== "string" ||
typeof parsedResult.reasoning !== "string"
) {
throw new Error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some custom errors?

@@ -322,6 +388,7 @@ const generateFilteredTestcases = (): Testcase[] => {
logger,
llmClient,
useTextExtract: shouldUseTextExtract,
modelName: input.modelName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass in the modelName? is this bc agent wont work with external client providers like aiSDK etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants