A short story of using QWen3.6:35b-a3b with codex

05 Jul 2026 - tsp
Last update 05 Jul 2026
Reading time 19 mins

So we all know OpenAIs codex is totally awesome. Especially with GPT 5.5 it is amazing as soon as you know how to use it (which means - don’t tell it “write me an app” but handle it like a senior developer would handle his juniors). But then we also all know that any subscription runs out of credits very fast and when one uses an on demand plan - for example via the platform API - one gets broke very fast because it gets expensive. Especially with the looping /goal one can accumulate cost very fast.

In addition it is already very simple to run local large language models on consumer hardware, for example using ollama, even on small scale systems. These models are pretty amazing for small scale local jobs that do not require long term planning usually. Knowledge graph extraction, embedding vectors, simple chats, interpreting your E-Mails and similar applications. Running locally exchanges continuous cost for subscriptions against upfront hardware- and electricity cost.

The main problem with self hosted LLMs at the moment is available VRAM (or RAM on shared memory systems like Apple Silicon or the new Ryzen AI series). But even on larger consumer systems one is pretty limited. To run a large model like GLM 5.2 on your own hardware in a reasonable way you would need around 8x H100 80GB or 4 to 8 H200 141GB GPUs, a high end x86 CPU with AVX512, more system memory than VRAM (i.e. we are talking about 1TB of RAM) - and of course a mainboard, power supplies and cooling facilities to run that system. Technically doable but usually in the price range of around 200.000 - 300.000 Eur alone for the GPUs. This is not feasible for any private personal usage (at least if you are not an billionaire and if you are you will most likely know that this would be a total waste of money for personal use and would still fall back to a hosted cloud service).

Smaller models though work very well on consumer hardware. The 35B MoE models run on a single 24GB VRAM GPU, 70B models on a bit larger GPUs. In my experience they are noticeably weaker at long-range planning, maintaining coherence over many reasoning steps, and robust recovery from mistakes (which is likely accounted to missing emerging properties or larger scale models). But still many projects have shown that even small scale models can operate perfectly well for specific tasks when run in agentic frameworks, exploiting planning and refinement loops. And that is exactly the idea behind using such models with orchestrators like codex. For more complex tasks a growing context can introduce problems with these models though, as we will see in the bugs section of this article.

TL;DR: You can use small scale models with codex and sometimes the output is acceptable. Its totally not comparable with the performance of frontier models like GPT5.4 or GPT5.5 though. If you want to have a robust and efficient assistant for software development, don’t bother using the ollama backend. If you can accept many quirks, runaway context, sometimes injected deletions and want to play around, using a smaller model with codex might get interesting.

Model Deployment
Configuration using ollama Directly
Encountered bugs
Appendix: What the Model Parameters Mean
Conclusion
References

Model Deployment

I assume in the following we are going to use ollama as runtime for the large language models since this is the easiest runtime available for endusers. One could of course also use vLLM, which is the runtime of choice for distributed MoE models, or llama.cpp server command.

The best model that I used till now on smaller scale systems like the dual 12GB RTX3060 setup was in my opinion qwen3.6:35b-a3b, which is also directly available in ollamas model library. This model is an mixture of experts model with only 3B active at all times, while being a 35B model. Note that OpenAI themselves, who released that model, usually suggest (and target codex to) use the gpt-oss model. To install the qwen3.6-35B-A3B model on ollama one can simply pull it via

ollama pull qwen3.6:35b-a3b

Unfortunately the default parameters for this model are not really suited for operation with codex, especially due to the thresholds for repeated token output. The default parameters I found on the latest version I played with have been:

Model
  architecture       qwen35moe
  parameters         36.0B
  context length     262144
  embedding length   2048
  quantization       Q4_K_M

Capabilities
  completion
  vision
  tools
  thinking

Parameters
  presence_penalty   1.5
  repeat_penalty     1
  temperature        1
  top_k              20
  top_p              0.95
  min_p              0
  License            Apache License Version 2.0, January 2004

An explanation of the parameters can be found in the appendix. The main problems for useful execution with codes where:

The temperature. For coding tasks it’s a good idea to keep temperature low since this is what allows the sampler to get creative by flattening the probability distribution. Since this sometimes leaves to swapping of keywords (like continue instead of return) or similar mistakes its a good idea to keep the temperature low.
The context size (num_ctx). A huge context is good. But it also needs a huge amount of VRAM for the KV cache. One may want to limit the context window to 64k (65536) or 128k (131072) to keep the size of the KV cache manageable.
The sampling parameters top_k, top_p and min_p, which are also optimal for creativity but not optimal for coding due to the probability of swapping keywords.
The penalties for presence and repeated tokens are too low in some cases

This lead to a modified Modelfile for the model. I used the following parameter, already overriding the system prompt with some very simple instructions that turned out to be sufficient:

FROM qwen3.6:35b-a3b

PARAMETER num_ctx 65536
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER top_p 0.8
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.08
PARAMETER repeat_last_n 1024
PARAMETER presence_penalty 0

SYSTEM """
You are a coding agent. Be precise and concise.
Do not repeat yourself.
Do not emit long hidden reasoning.
When using tools, produce valid tool calls only.
When you are uncertain, summarize the uncertainty and stop rather than looping.
"""

This can then be used to create a new model identity:

ollama create qwen3.6:35b-a3b-codex -f Modelfile

This turned out to work way better than the unmodified model for myself.

Configuration using `ollama` Directly

So first - how can one use this when codex communicates directly via the responses API with ollama? One needs:

A profile configuration
A metadata JSON file
More patience than using cloud GPT 5.5 for sure

The profile configuration resides in ~/.codex/NAME.config.toml (I personally used ~/.codex/qwen.config.toml). Note that the model_catalog_json path and name are arbitrary, the base_url of course has to point to the ollama instance at the correct port.

model_provider = "ollama-local"
model_context_window = 65536
model_auto_compact_token_limit = 56000
model_catalog_json = "/usr/home/USERNAME/.codex/qwen-models.json"

[model_providers.ollama-local]
name = "Ollama local"
base_url = "http://198.51.100.1:1234/v1"
wire_api = "responses"
requires_openai_auth = false
supports_websockets = false
stream_idle_timeout_ms = 3000000
stream_max_retries = 1

What these settings do is:

Restrict the context window size to something manageable. One has to make sure this is compatible with num_ctx from before.
Set an auto compactification limit after which the context will get compactified. Note that multiple compactification steps can really degrade model performance massively.
It points to the model_catalog_json that contains Metadata about the model itself
It overrides the model provider configuration to not use the OpenAI authentication token and not try to use the WebSockets API
It points to the proper backend, which can be overridden by a environment variable as shown later
It massively increases the stream idle timeout.

In addition the metadata file specified under model_catalog_json is required to provide capability information about the model to the codex runtime. This provides codex with information about reasoning support, properties of the template, reasoning support of the model, reasoning levels, again context window configuration (one has again to ensure that this is consistent with the num_ctx of the model) as well as embedded tool support:

{
  "models": [
    {
      "slug": "qwen3.6:35b-a3b-codex",
      "display_name": "Qwen3.6 35B A3B Codex",
      "description": "Local Qwen3.6 35B A3B via Ollama, tuned for Codex OSS.",
      "provider": "ollama-local",
      "visibility": "list",
      "supported_in_api": true,
      "priority": 100,

      "default_reasoning_level": "low",
      "supported_reasoning_levels": [
        {
          "effort": "low",
          "description": "Fast local reasoning"
        },
        {
          "effort": "medium",
          "description": "Balanced local reasoning"
        }
      ],

      "supports_reasoning_summaries": false,
      "default_reasoning_summary": "none",
      "support_verbosity": true,
      "default_verbosity": "low",

      "shell_type": "shell_command",
      "apply_patch_tool_type": "freeform",
      "web_search_tool_type": "text_and_image",
      "supports_parallel_tool_calls": false,
      "supports_image_detail_original": false,

      "context_window": 32768,
      "max_context_window": 32768,
      "effective_context_window_percent": 85,

      "truncation_policy": {
        "mode": "tokens",
        "limit": 10000
      },

      "experimental_supported_tools": [],
      "input_modalities": ["text"],
      "supports_search_tool": false,

      "base_instructions": "You are Codex, a coding agent. Work in short, precise steps. Use tools carefully. Do not repeat yourself. When tool calls are needed, emit valid tool calls only. If you are stuck, summarize the blocker and stop instead of looping."
    }
  ]
}

Having those files in place it is possible to launch codex using

codex --oss -p qwen -m qwen3.6:35b-a3b-codex

or, setting the CODEX_OSS_BASE_URL via the environment instead:

env CODEX_OSS_BASE_URL="http://198.51.100.1:1234/v1" codex --oss -p qwen -m qwen3.6:35b-a3b-codex

This is already enough to use the models.

Encountered bugs

stream closed before response.completed

Unfortunately this was the error that appeared most of the time when something failed. The exact codex output is

stream disconnected before completion: stream closed before response.completed

This turned out to be caused by different reasons:

Too slow response when loading the model took long or the system was blocked by a concurrent user of the same backend. Increase the idle timeout in such cases. Usually the idle timeout only plays a role if the backend is used concurrently from different clients.
The model failing due to context overflow when running the backend with a smaller context window than stated in the metadata JSON This can be solved by providing correct metadata in the JSON and limiting the context size in the profile configuration.
The model aborting due to repeated patterns. This seems to happen often when the context grows too far with qwen models.

The most common error is the last one - the model aborting due to repeated patterns. This happens especially when the context window reaches a larger size, yielding Qwens operation to degenerate into a thesaurus like continuation. This is one of the events that happens when running small large language models with larger context. To approach those problems one:

Can reduce context size
Trigger compactification earlier
Modify the system prompt and prompts to split the tasks in smaller parts
Solve easier tasks
Reduce reasoning complexity

This is one of the major drawbacks of small self hosted models in comparison to well tuned large scaled cloud models

Tool calls not working

This also happened from time to time using qwen models in codex. At some point after they ran for some time they seem to stop producing proper tool calling output. Then the console gets flooded with output like

<function=exec_command>
<parameter=cmd>
python3 -m pytest tests/ -v --tb=short
</parameter>
</function>
</tool_call>

Its pretty obvious that this happens whenever the model stops to emit the initial <tool_call> tag - for whatever reason. This also happens for a growing context, like most bugs.

Qwen-coder loves to delete

It seems qwen code - as soon as the context grows - loves to approach problems with a simple solution: It often suggests to delete the entire codebase over a simple indention error. This is something that is well known from smaller models from the past when the context grows. It behaves like many beginners in this case - it does not try to understand bugs but just simply drops everything and wants to start over.

Appendix: What the Model Parameters Mean

Temperature

This is the best-known parameter. The model computes probabilities

[ P(t_i) ]

for each possible next token. Temperature rescales these probabilities before sampling:

[ P'(t_i) \propto P(t_i)^{\frac{1}{T}} ]

where $T$ is the temperature.

Low temperature (0–0.3)

In this configuration the highest probability token almost always wins. For example, if the networks outputs

Probability	Token
0.82	`return`
0.09	`yield`
0.05	`break`
0.04	`continue`

At temperature 0.1 the model will nearly always emit return, which is excellent for programming.

Temperature = 1

The distribution is unchanged. The model occasionally chooses the second or third best option (inserting yield or break instead of return.

High temperature (>1)

The distribution becomes flatter, which is amazing for creative writing. Unlikely tokens may be sampled. Absolutely terrible for programming.

top_k

This provides a filter of the tokens that undergo probabilistic selection. If one applies top_k = 2 to the above example data set, the model will only choose from the two options return and yield

top_p

This is a filter that is not applied to the number of best candidates - candidates are selected by adding up their probability. As long as the total probability in the selected pool stays below top_p another token is added to the candidate list. This is also called nucleus sampling.

min_p

This option drops every token that has a probability of $P_\mathrm{maxtoken} * \mathrm{min}_p$. In many cases this produces more stable results than filtering via top_p.

repeat_penalty

This factor is used as a divider for every token probability that has recently been emitted. This discourages the repeated emission of the same token, without suppressing the probability. The typical values are in the range of 1 to 1.15. If too high, legitimate repetition gets suppressed.

repeat_last_n

This is the sliding window that repeat_penalty is applied to. Larger values prevent long loops but can slightly reduce consistency because earlier identifiers are also penalized.

presence_penalty

Is similar to repeat_penaltybut is not applied for every time the token is encountered but is applied to tokens it they appeared at least once in the sliding window. This reduces probability of word repetition in text. Large presence penalties are usually undesirable for programming because identifiers and keywords often need to repeat consistently.

frequency_penalty

This is a penalty that scales with the number of times a token has been reused. Again, not optimal for code.

Context length (num_ctx)

This is the maximum size of the attention window. Increasing context length increases the short term memory of the model as well as it’s context dependent behaviour. Increasing the context length increases KV cache memory approximately linearly while increasing the computational cost of standard attention roughly quadratically.

Maximum response size (num_predict)

This is the maximum number of tokens an LLM is allowed to generate in a single response. Reducing it prevents too much gibberish to be generated in a single response while it of course also reduces the amount of information that an LLM can generate. It is primarily used as a safeguard to prevent infinite loops, cap costs and safeguard system resources.

Conclusion

It is entirely possible to run complex long range planning orchestrators and agent frameworks like codex using self hosted models. It is some work to get them up and running and the quality of the output as well as the long range planning will not work as one is used to from frontier models like GPT 5.4 or GPT 5.5. One would need a large scale model like GLM 5.2 on very expensive hardware, being totally not economical for private use except the privacy of running a local model is the primary reason, to achieve this level of reasoning and long term planning capabilities. When one thinks about buying such hardware one also has to take into account that models will grow over time - ones hardware wont.

If one wants to play around, needs agents that do long running jobs or wants to hammer an LLM with request on the other hand it is often an interesting approach of running an LLM offline. The gains are obvious:

One owns the hardware and has a copy of the model weights. No company will ever decide that the model is obsolete and not usable anymore
The data never leaves your facility
The only cost incurring is electricity and the upfront payment of hardware, which is especially interesting when one uses fully automated agentic solutions which can produce a huge amount of requests, burning tokens and thus money on typical cloud backends

Quality of the generated code

First the positive:

The repository was structured coherently. It separated configuration, job models, sheduling, network clients, daemonization and CLI as well as tests
Unit tests looked largely plausible and where readable
Type hints and docstrings were used in Python code

And the negative:

The model often produced indention errors in Python and was unable to fix them.
The model forgot to wire spearate components of a small scale project together
It was not able to parse simple structures like crontabs, not even with a /goal
It generated test cases but these where broken
It had inconsistencies between loading and saving state - which again deviated from the spec.
When performing git operations it also included artifacts that never should go into a repository (credentials, __pycache__, etc.).

Overall the model was compareable to a junior developer who can draft a service shaped codebase quickly but does not reliably close the loop by running, testing and reconciling design assumptions with actual runtime behaviour.

The resulting code was way below production quality, parts not even working at all. LLMs on this scale - in contrast to large scale models like GPT5.4 or GPT5.5 - are merly some kind of auto-complete …

References

Runtimes:
- Ollama is the runtime of my choice when I want to have small scales models that just run with near zero configuration and setup effort
- vLLM is the runtime of choice when running persistent models distributed over multiple backends (for example complex MoE models).
- llama.cpp, which is also the backend of ollama, is another option to run models using its builtin server. In addition it includes tools for finetuning and low rank adaption.
Models:
- qwen3-coder, also available in ollamas model library in 30b and 480b variants, is one of the most capable medium scale coding models. Its based on a mixture of experts architecture.
- qwen3.6-35B-A3B, also available in the ollama model library, is a successor of the previous models. A blog post by qwen describes the model in detail.
- gpt-oss is OpenAIs open weights coding model. It is available in a 20b and a 120b variant. This is also the default model for codex in --oss mode.
- z-AIs GLM 5.2 would be the current frontier open weights model. If you have way too much money or infrastructure this would be the model of choice for sure.
Commercial models
- The current frontier models that work best with Codex are for sure OpenAIs GPT5.4 and GPT5.5
Agentic frameworks
- codex, which is actually open source and available on GitHub is the frontier coding agent at this time.
- Claude Code is the largest closed competitor at this time.
- Kilo and Cline are two nice open source coding agents that offer plugins into IDEs like code-oss (which is the open variant of Microsofts VSCode).
- OpenClaw follows a slightly different target. It is also an agentic framework and tries to generalize from coding to a personalized assistant. I’ve personally not encountered any case where I preferred OpenClaw over codex though
- OpenCode is a pretty well developed alternative to codex
Running your own LLMs:
My mini-apigw in case one wants to execute the code using the OpenAI API on shared hardware.

A short story of using QWen3.6:35b-a3b with codex

Model Deployment

Configuration using ollama Directly

Encountered bugs

stream closed before response.completed

Tool calls not working

Qwen-coder loves to delete

Appendix: What the Model Parameters Mean

Temperature

Low temperature (0–0.3)

Temperature = 1

High temperature (>1)

top_k

top_p

min_p

repeat_penalty

repeat_last_n

presence_penalty

frequency_penalty

Context length (num_ctx)

Maximum response size (num_predict)

Conclusion

Quality of the generated code

References

Related articles

Using Codex with a custom API gateway

Another quick glance on the OpenAI API to ChatGPT using function calling

Programmatic 3D Model Generation with the Tripo3D API

How I Use Large Language Models (LLMs) in My Daily Work and Hobbies

What (in my opinion) one can learn from Erlang/OTP for other programming languages

GPU size estimation for LLMs

Setting Parameters like Context Length and Temperature in Ollama Models

Using Codex with Hardware In The Loop for Microcontrollers

Also on this blog

Solving linear least squares problems using QR decomposition

Exploring Cursor AI on FreeBSD: A Developer's Perspective and Installation Guide (and a note on local models)

Playing with ChatGPT RemoteMCP without OAuth

The Illogic of Influence: Social Hierarchies, Personality Cults, and the Danger of Blind Leadership

Configuration using `ollama` Directly