Python AI Agent Framework for Gymnasium Grid Worlds

Reverse engineered prompt

Build me a Python experimental AI agent framework for simple Gymnasium grid worlds, inspired by general game playing agents like SIMA.

I want the agent to follow a natural language mission like “go to the green square and stop”. It should work in a perceive, imagine, plan, act loop. It observes the current scene, uses a vision language model to describe what it sees, simulates possible next actions with an internal world model, asks a language model to choose a smart next move, then acts in the environment and remembers what happened.

Please make it modular so I can swap the environment, memory, model runtimes, and agent logic later. Include configuration for local Ollama models, with separate models for perception and planning. Save readable logs of the agent’s thoughts and actions, and record short videos or gifs of runs.

Set it up with clear install steps using uv, an example env config, a runnable main entry point, and tests that check the architecture without needing live LLM calls. Look up current docs online if you need to.

Want more depth? Deep Reverse

hemantjuyal/SIMA2-Agent — reverse-engineered prompt

Reverse engineered prompt