Skip to content

Self-debugging in CodeExecutionAgent #6207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
ekzhu opened this issue Apr 4, 2025 · 3 comments · Fixed by #6306
Closed
1 task done

Self-debugging in CodeExecutionAgent #6207

ekzhu opened this issue Apr 4, 2025 · 3 comments · Fixed by #6306
Labels
code-execution execute generated code proj-agentchat
Milestone

Comments

@ekzhu
Copy link
Collaborator

ekzhu commented Apr 4, 2025

Confirmation

  • I confirm that I am a maintainer and so can use this template. If I am not, I understand this issue will be closed and I will be asked to use a different template.

Issue body

Follow up to #6098, add auto-debugging loop to CodeExecutionAgent so it automatically tries to regenerate code when there is error.

@ekzhu ekzhu added code-execution execute generated code proj-agentchat labels Apr 4, 2025
@ekzhu ekzhu added this to the 0.4.x-python milestone Apr 4, 2025
@Ethan0456
Copy link
Contributor

Hi @ekzhu,

Below is a proposed approach along with a few questions I had while thinking about it:

Questions

  1. State management: Should we store each retry's code and result in model_context, or only update the agent context after a successful execution?
  2. User feedback during retries: Should we yield full code + execution result at every retry step, or keep it simple with messages like "Code execution failed with exit code ..., retrying..."?
  3. User intervention possibility: If we opt to yield full code + result, it opens up the opportunity to delegate control to a user_agent (if present), allowing for interactive feedback to improve the code generation trajectory. Would that align with the intended design?
model_result = None
execution_result: CodeExecutionEvent = None

for attempts in range(max_code_retries):
    async for inference_output in self._call_llm(
        model_client=model_client,
        model_client_stream=model_client_stream,
        system_messages=system_messages,
        model_context=model_context,
        agent_name=agent_name,
        cancellation_token=cancellation_token,
    ):
        if isinstance(inference_output, CreateResult):
            model_result = inference_output
        else:
            # Streaming chunk event
            yield inference_output
    
      assert model_result is not None, "No model result was produced."
      
      ...

      inferred_text_message: CodeGenerationEvent = CodeGenerationEvent(
          content=str(model_result.content),
          code_blocks=self._extract_markdown_code_blocks(model_result.content),
          source=agent_name,
      )
      
      # execute generated code if present
      execution_result = await self.execute_code_block([inferred_text_message], cancellation_token)
      if execution_result.result.exit_code == 0:
          break

@ekzhu
Copy link
Collaborator Author

ekzhu commented Apr 6, 2025

State management: Should we store each retry's code and result in model_context, or only update the agent context after a successful execution?

For simplicity, let's store all events to model context, including unsuccessful and successful ones.

User feedback during retries: Should we yield full code + execution result at every retry step, or keep it simple with messages like "Code execution failed with exit code ..., retrying..."?

Yield full code and result for transparency. User don't trust models at this point for most scenarios.

User intervention possibility: If we opt to yield full code + result, it opens up the opportunity to delegate control to a user_agent (if present), allowing for interactive feedback to improve the code generation trajectory. Would that align with the intended design?

User intervention should happen outside of the agent. It should be part of a team. So the agent after certain number of unsuccessful attempts, it should just stop, reflect, and pass it on to the next agent, which could be the user if the team orchestrated this way.

@ekzhu
Copy link
Collaborator Author

ekzhu commented Apr 6, 2025

I think a simple loop with counter may not address all cases.

At the end of each iteration, it should use the model to determine if the code error can be fixed, and if not, exit the loop and return a final response to the caller. We can use structured output for this to ensure the model output can be used with the loop condition.

If the code error can be improved, it attempts to try again.

If the code execution succeeded, then apply a final reflection.

It still needs a maximum retry to avoid the model goes off the rails.

Let's experiment with this using a few examples especially data science and data analytics scenarios.

Ethan0456 added a commit to Ethan0456/autogen that referenced this issue Apr 15, 2025
@ekzhu ekzhu closed this as completed in aad6caa Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-execution execute generated code proj-agentchat
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants