Stagehand Evaluator & first agent evals #668

miguelg719 · 2025-04-15T18:38:49Z

why

Deterministic evals require high maintenance. A balanced approach is what most frontier labs use for evaluating models: LLMs as judges. Although not as deterministic as our current evals, this new category scales better while randomizing the potential judging error. Introducing:

Stagehand Evaluator

what changed

Created a new class Evaluator which reuses a stagehand object to leverage the llmClient functionality, sending a screenshot (currently, maybe text-only in the future) asking a question or set of questions (batch).

Example usage:

  const evaluator = new Evaluator(stagehand);

  // single evaluation
  const result = await evaluator.evaluate({
    question: "Is the form name input filled with 'John Smith'?"
  });

  // batch evaluate
  const results = await evaluator.batchEvaluate({
    questions: [
      "Is the form name input filled with 'John Smith'?",
      "Is the form email input filled with '[email protected]'?",
      "Is the 'Are you the domain owner?' option selected as 'No'?",
    ],
    strictResponse: true, // whether or not to throw an error on evaulations !== YES | NO
  });

Evaluator will then return a response in the following format:

{
   evaluation: "YES" | "NO",
   reasoning:  "reasoning about the evaluation process"
}

Also added a few evals for agent that use Evaluator as the judge

test plan

Tested with local evals
Tested with BB evals

changeset-bot · 2025-04-15T18:38:52Z

🦋 Changeset detected

Latest commit: b7caabf

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@browserbasehq/stagehand	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

miguelg719 · 2025-04-15T21:29:35Z

evals/initStagehand.ts

+        height: 768,
+      },
+    },
+  },


can this mess with existing evals? should we make it an agent category property?

should be fine

miguelg719 · 2025-04-15T21:32:14Z

evals/evaluator.ts

+
+dotenv.config();
+
+export interface EvaluateOptions {


need to move types

greptile-apps

PR Summary

This PR introduces a new Evaluator class for LLM-based task assessment and adds agent-focused evaluation tasks, enabling more scalable and flexible testing of model capabilities.

Added Evaluator class in evals/evaluator.ts with single/batch evaluation methods using LLM screenshot analysis
Added 5 new agent evaluation tasks in /evals/tasks/agent/ for testing form filling and flight search capabilities
Added dynamic model selection in taskConfig.ts with support for multiple providers (Google, Anthropic, OpenAI, Together, Groq, Cerebras)
Added viewport configuration in initStagehand.ts for consistent browserbase session rendering
Added optional requestId in LLMClient.ts for more flexible API usage

_{13 file(s) reviewed, 12 comment(s)}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-04-15T21:45:30Z

evals/evaluator.ts

+    const response = await llmClient.createChatCompletion({
+      logger: () => {},
+      options: {


logic: Empty logger function could miss important LLM errors. Consider using the stagehand logger instead.

evals/evaluator.ts

greptile-apps · 2025-04-15T21:45:32Z

evals/evaluator.ts

+      finalResults.length = 0; // Clear any potentially partially filled results
+      for (let i = 0; i < questions.length; i++) {


style: Using array.length = 0 for clearing is not idiomatic. Consider finalResults = []

greptile-apps · 2025-04-15T21:45:39Z

evals/index.eval.ts

+            ? input.name.split("/").pop() // Get the last part of the path for nested tasks
+            : input.name;


logic: pop() could return undefined if input.name ends with '/', causing runtime errors

greptile-apps · 2025-04-15T21:45:57Z

evals/taskConfig.ts

+  if (provider) {
+    return ALL_EVAL_MODELS.filter((model) =>
+      filterModelByProvider(model, provider),
+    );
  }


style: Provider filtering is applied to ALL_EVAL_MODELS but not to DEFAULT_AGENT_MODELS, which could lead to inconsistent behavior when provider filtering is enabled.

evals/tasks/agent/iframe_form.ts

evals/tasks/agent/iframe_form_multiple.ts

evals/tasks/agent/sf_library_card.ts

peytoncasper · 2025-04-18T15:19:52Z

My only comment would be that in general for LLM-as-a-Judge you should enforce JSON output with a schema rather than freeform text. This helps drive more determinism in my experience even if its just

{"passed": True}

You can also then pass in response_mime_type: "application/json" in Gemini's case which does token filtering. I believe this is one of the largest drivers of Gemini's data extraction capabilities.

Other than that, great idea

EDIT

Just realized you are doing that essentially but providing YES | NO as options inside a JSON object. Apologies.

miguelg719 · 2025-04-18T17:47:03Z

Thanks for the pointer on the mime type @peytoncasper! I was hoping to integrate that next into the gemini client abstraction

seanmcguire12 · 2025-04-18T20:49:00Z

evals/evaluator.ts

+        typeof parsedResult.evaluation !== "string" ||
+        typeof parsedResult.reasoning !== "string"
+      ) {
+        throw new Error(


can we add some custom errors?

seanmcguire12 · 2025-04-18T20:57:19Z

evals/index.eval.ts

@@ -322,6 +388,7 @@ const generateFilteredTestcases = (): Testcase[] => {
            logger,
            llmClient,
            useTextExtract: shouldUseTextExtract,
+            modelName: input.modelName,


why do we need to pass in the modelName? is this bc agent wont work with external client providers like aiSDK etc?

Stagehand evaluator and first agent evals

81d9cbb

viewport size for bb sessions

45d8393

miguelg719 commented Apr 15, 2025

View reviewed changes

prettier

05bb09e

miguelg719 commented Apr 15, 2025

View reviewed changes

evals/evaluator.ts

dotenv.config();

export interface EvaluateOptions {

Copy link

Collaborator Author

miguelg719 Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to move types

miguelg719 added 2 commits April 15, 2025 14:42

refactor

e1b4135

cleanup

fb0de8e

miguelg719 marked this pull request as ready for review April 15, 2025 21:44

greptile-apps bot reviewed Apr 15, 2025

View reviewed changes

miguelg719 added 2 commits April 15, 2025 14:49

changeset

afbe824

remove log

b7caabf

seanmcguire12 reviewed Apr 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stagehand Evaluator & first agent evals #668

Stagehand Evaluator & first agent evals #668

miguelg719 commented Apr 15, 2025 •

edited

Loading

changeset-bot bot commented Apr 15, 2025 •

edited

Loading

miguelg719 Apr 15, 2025

seanmcguire12 Apr 18, 2025

miguelg719 Apr 15, 2025

greptile-apps bot left a comment

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

greptile-apps bot Apr 15, 2025

peytoncasper commented Apr 18, 2025 •

edited

Loading

miguelg719 commented Apr 18, 2025 •

edited

Loading

seanmcguire12 Apr 18, 2025

seanmcguire12 Apr 18, 2025

		finalResults.length = 0; // Clear any potentially partially filled results
		for (let i = 0; i < questions.length; i++) {

		? input.name.split("/").pop() // Get the last part of the path for nested tasks
		: input.name;

Stagehand Evaluator & first agent evals #668

Are you sure you want to change the base?

Stagehand Evaluator & first agent evals #668

Conversation

miguelg719 commented Apr 15, 2025 • edited Loading

why