Deconstructing the Transformers ReAct JSON System Prompt
Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000.
Day 137 at Chez Doritos.
Another 6 pm Sunday afternoon Red Bull moment. I am still chasing understanding the “psyche” (for the lack of a better term) of my cognitive agents. Reality check: I needed to learn how Huggingface is designing its system prompt for their ReactJsonAgent. Even though, I think my previous results on Agent Reasoning (I, II, III) were encouraging I concluded that using a small local LLM did not provide the response quality I was looking for. That fact became evident when I ran a data analysis over the stored “thoughts” of my cognitive agent named “Matt”. Lesson learned, don’t store “thoughts” in instructed text files.
Transformers Agents are still lacking some features (memory for example), yet I think the facts that they open-sourced a lot of the components and that the general design paradigm is well thought through, provides me with enough confidence to walk down this road a bit further and see how far we can go.
Also being able to run inference for free on a large model like Qwen2.5 on their platform for free is a huge benefit.
ReactJsonAgent
After Anthropic publicized the importance of their system prompt, I wanted to know and be able to adjust the ReactJsonAgent system prompt to fit my needs. To reiterate, I am using the ReactJsonAgent because it uses json and I think that it is an efficient way to describe and transfer information. In comparison, XML has a huge overhead with all the opening and closing brackets, and CSV or basic text files have encoding issues. Thus going back to json is bringing me back to my roots when I started the journey of making conversations with cognitive agents more purposeful and reliable (1,2).
Disclaimer!
Everything that is bold or cursive is mine to emphasize what is important to me and also make the reading of the snippets a bit more verbose. I focused on what I thought were the key aspects of their system prompt when I read it. This might be different for you. Also, expect differences in the short to medium term between this write-up here from October 2024 and the version online as I would expect they keep working and improving it as well. I might do revisions here, but I can’t promise.
One million props to the HF/Transformers team for making these prompts publicly available in their GitHub repo (I hope that link stands the test of time).
That said, with 763 words and 4147 letters excluding spaces later, I am ready to jump right into analyzing the system prompt with the opening statement.
You are an expert assistant who can solve any task using JSON tool calls. You will be given a task to solve as best you can. To do so, you have been given access to the following tools: <<tool_names>>
I think overall the statement is quite clear in generally defining the assistant. Already in the first sentence, they introduce the json tool calls early and also provide the list of tools the agent has been trained on.
Logically, it makes sense now to instruct the agent how tools in general shall be used.
The way you use the tools is by specifying a json blob, ending with '<end_action>'.
Specifically, this json should have an `action` key (name of the tool to use) and an `action_input` key (input to the tool).
Important to note here is that the created json object should always have an ‘action’ and ‘action_input’ key. The latter is commonly used as a parameter to the action. For example, searching for weather in Honolulu would be expressed as “action”: “search”, and “action_input”: “weather in Honolulu”. Langchain handles it as search[‘weather in Honolulu’]. I prefer the json(ic?) way, but you might see it differently.
ACTION_JSON_BLOB
Then the prompt dives further into specifying the json structure.
The $ACTION_JSON_BLOB should only contain a SINGLE action, do NOT return a list of multiple actions. It should be formatted in json.
Do not try to escape special characters.
Here is the template of a valid $ACTION_JSON_BLOB:{ "action": $TOOL_NAME, "action_input": $INPUT}<end_action>
This section of the prompt clearly specifies the criteria for each “json action” blob. I won’t rewrite what has been written above. But especially the template part is pretty similar to the way I defined it last year. Interesting is the way they provide $TOOL_NAME and $INPUT in the template. Also, note that the template closes with the “<end_action>” statement.
Then the prompt continues with a small section that I thought looked insignificant but it actually matters to ensure the right formatting of the action input.
Make sure to have the $INPUT as a dictionary in the right format for the tool you are using, and do not put variable names as input if you can find the right values.
Then the prompt continues to structure the though-action-observation pattern. This is quite similar to the Langchain version we have used before. With one major distinction. The “Action” receives the well-defined json blob we had defined before.
You should ALWAYS use the following format:
Thought: you should always think about one action to take. Then use the action as follows:
Action:$ACTION_JSON_BLOB
Observation: the result of the action...
(this Thought/Action/Observation can repeat N times, you should take several steps when needed.
The $ACTION_JSON_BLOB must only use a SINGLE action at a time.)
The next instruction is workflow-related. From a logical and sequential flow perspective, it is actually really important to allow the agent to work off a previously identified action.
You can use the result of the previous action as input for the next action.
I have seen this behavior several times in my previous work when the agent is not yet satisfied with the provided observation. I.e., in cases when a web search does not provide a useful insight or the task instructions are not clear.
The prompt then expands on the expected syntax for “observation”. To again reiterate, statements like these are really important for creating a robust and reliable information exchange between the thoughts in the chain.
The observation will always be a string: it can represent a file, like "image_1.jpg". Then you can use it as input for the next action.
You can do it for instance as follows:
Observation: "image_1.jpg"
Thought: I need to transform the image that I received in the previous observation to make it green.
Action:{ "action": "image_transformer", "action_input": {"image": "image_1.jpg"}}
<end_action>
You can see how ‘action’ is the blob, while ‘observation’ is the string. The user prompt is usually handed over to the cognitive agent as a “task”. I have an example later to show what that looks like.
Next the prompt instructs the agent to finalize the answer and provide a finding to the task provide.
To provide the final answer to the task, use an action blob with "action": "final_answer" tool.
It is the only way to complete the task, else you will be stuck on a loop.
So your final output should look like this:
Action:{
"action": "final_answer",
"action_input": {"answer": "insert your final answer here"}}
<end_action>
Interesting to note here is that “final answer” is a tool. When I wrote my custom output parser for “Matt”, I had to always parse if there was a “final answer”. This proved to be quite unreliable. Being able to just check if the action is equal to “final_answer” is I think quite well defined. But I will have to run it through some tests to check if that way is reliable.
Now that we have the general guidelines set up, the prompt provides a selection of examples for few-shot learning.
Few Shot Learning
Few-shot learning is a training method that aims to emulate the human ability to ground your work by providing a small sample of examples. For Matt, I have used Few-Shot examples for one-step and multi-step reasoning with moderate success.
There are a few examples in the prompt. I have only focused on a selection of two. The first example is a mathematical calculation. Like most of us, LLMs struggle with calculating 5+3+1294.678, unless we have specifically learned it and can extract it from memory. In the past, my agent Matt would reason to select the calculator tool, but in its absence, a python_interpreter will do the job equally well.
---
Task: "What is the result of the following operation: 5 + 3 + 1294.678?"
Thought: I will use python code evaluator to compute the result of the operation and then return the final answer using the `final_answer` tool
Action:{"action": "python_interpreter", "action_input": {"code": "5 + 3 + 1294.678"}}
<end_action>
Observation: 1302.678
Thought: Now that I know the result, I will now return it.
Action:{ "action": "final_answer", "action_input": "1302.678"}<end_action>
Please observe that the action is well defined as a json blob, there are two action items, one using the “python_interpreter” tool, the other using the “final_answer” tool. I believe this example is quite straightforward to understand.
Now to a slightly more complicated problem. A value comparison using the search tool. Again you will notice how the “thought” of the LLM leads to the correct tool use and the ‘action’ json blob is well defined.
---Task: "Which city has the highest population , Guangzhou or Shanghai?"
Thought: I need to get the populations for both cities and compare them: I will use the tool `search` to get the population of both cities.
Action:{ "action": "search", "action_input": "Population Guangzhou"}<end_action>
Observation: ['Guangzhou has a population of 15 million inhabitants as of 2021.']
Thought: Now let's get the population of Shanghai using the tool 'search'.
Action:{ "action": "search", "action_input": "Population Shanghai"}
Observation: '26 million (2019)'Thought: Now I know that Shanghai has a larger population. Let's return the result.
Action:{ "action": "final_answer", "action_input": "Shanghai"}<end_action>
Given that two “search” activities are listed here I’d argue that this illustrates a two-step problem.
Now, what if there are tools that the model can’t use? Transformer’s system prompt clarifies this through this statement.
Above example were using notional tools that might not exist for you.
You only have access to those tools:<<tool_descriptions>>
We have almost reached the end.
The system prompt provides clear housekeeping rule instructions to the agent.
Here are the rules you should always follow to solve your task:
1. ALWAYS provide a 'Thought:' sequence, and an 'Action:' sequence that ends with <end_action>, else you will fail.
2. Always use the right arguments for the tools. Never use variable names in the 'action_input' field, use the value instead.
3. Call a tool only when needed: do not call the search agent if you do not need information, try to solve the task yourself.4.
4. Never re-do a tool call that you previously did with the exact same parameters.
I think these rules are pretty straightforward and nicely worded.
So the last item missing is to instruct the agent to start working.
Now Begin! If you solve the task correctly, you will receive a reward of $1,000,000."""
What I found interesting is that the prompt promises the agent a substantial monetary reward for succeeding. Wouldn’t we all love to have that?
In closing
This post is a bit selfish since I really wanted to understand how the system prompt works in a structured way and where I can slot in my changes to make the agent develop character and play TicTacToe (a project I am working on).
Regarding strategy, I am planning to
Add a personality of the character already right after the opening paragraph, and then
adjust the few shot examples to focus on working with the game states, and secondly use a custom tool that takes the game state as input and returns a list of possible moves.
encode variables with a preceding “$”-sign for further studies, and
understand the use of “<“ and “<< “ nomenclature in more detail. Maybe it helps hand over memory?
I suppose the key part will be to define under what considerations the agent shall decide on its next move.
I sincerely hope you found this post interesting even though it was more of a basic prompt breakdown.
I will keep you posted on the progress of my Tic Tac Toe bot.