A short story of using QWen3.6:35b-a3b with codex

05 Jul 2026 - tsp
Last update 05 Jul 2026
Reading time 19 mins

So we all know OpenAIs codex is totally awesome. Especially with GPT 5.5 it is amazing as soon as you know how to use it (which means - don’t tell it “write me an app” but handle it like a senior developer would handle his juniors). But then we also all know that any subscription runs out of credits very fast and when one uses an on demand plan - for example via the platform API - one gets broke very fast because it gets expensive. Especially with the looping /goal one can accumulate cost very fast.

In addition it is already very simple to run local large language models on consumer hardware, for example using ollama, even on small scale systems. These models are pretty amazing for small scale local jobs that do not require long term planning usually. Knowledge graph extraction, embedding vectors, simple chats, interpreting your E-Mails and similar applications. Running locally exchanges continuous cost for subscriptions against upfront hardware- and electricity cost.

The main problem with self hosted LLMs at the moment is available VRAM (or RAM on shared memory systems like Apple Silicon or the new Ryzen AI series). But even on larger consumer systems one is pretty limited. To run a large model like GLM 5.2 on your own hardware in a reasonable way you would need around 8x H100 80GB or 4 to 8 H200 141GB GPUs, a high end x86 CPU with AVX512, more system memory than VRAM (i.e. we are talking about 1TB of RAM) - and of course a mainboard, power supplies and cooling facilities to run that system. Technically doable but usually in the price range of around 200.000 - 300.000 Eur alone for the GPUs. This is not feasible for any private personal usage (at least if you are not an billionaire and if you are you will most likely know that this would be a total waste of money for personal use and would still fall back to a hosted cloud service).

Smaller models though work very well on consumer hardware. The 35B MoE models run on a single 24GB VRAM GPU, 70B models on a bit larger GPUs. In my experience they are noticeably weaker at long-range planning, maintaining coherence over many reasoning steps, and robust recovery from mistakes (which is likely accounted to missing emerging properties or larger scale models). But still many projects have shown that even small scale models can operate perfectly well for specific tasks when run in agentic frameworks, exploiting planning and refinement loops. And that is exactly the idea behind using such models with orchestrators like codex. For more complex tasks a growing context can introduce problems with these models though, as we will see in the bugs section of this article.

TL;DR: You can use small scale models with codex and sometimes the output is acceptable. Its totally not comparable with the performance of frontier models like GPT5.4 or GPT5.5 though. If you want to have a robust and efficient assistant for software development, don’t bother using the ollama backend. If you can accept many quirks, runaway context, sometimes injected deletions and want to play around, using a smaller model with codex might get interesting.

Model Deployment

I assume in the following we are going to use ollama as runtime for the large language models since this is the easiest runtime available for endusers. One could of course also use vLLM, which is the runtime of choice for distributed MoE models, or llama.cpp server command.

The best model that I used till now on smaller scale systems like the dual 12GB RTX3060 setup was in my opinion qwen3.6:35b-a3b, which is also directly available in ollamas model library. This model is an mixture of experts model with only 3B active at all times, while being a 35B model. Note that OpenAI themselves, who released that model, usually suggest (and target codex to) use the gpt-oss model. To install the qwen3.6-35B-A3B model on ollama one can simply pull it via

ollama pull qwen3.6:35b-a3b

Unfortunately the default parameters for this model are not really suited for operation with codex, especially due to the thresholds for repeated token output. The default parameters I found on the latest version I played with have been:

Model
  architecture       qwen35moe
  parameters         36.0B
  context length     262144
  embedding length   2048
  quantization       Q4_K_M

Capabilities
  completion
  vision
  tools
  thinking

Parameters
  presence_penalty   1.5
  repeat_penalty     1
  temperature        1
  top_k              20
  top_p              0.95
  min_p              0
  License            Apache License Version 2.0, January 2004

An explanation of the parameters can be found in the appendix. The main problems for useful execution with codes where:

This lead to a modified Modelfile for the model. I used the following parameter, already overriding the system prompt with some very simple instructions that turned out to be sufficient:

FROM qwen3.6:35b-a3b

PARAMETER num_ctx 65536
PARAMETER temperature 0.2
PARAMETER top_k 40
PARAMETER top_p 0.8
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.08
PARAMETER repeat_last_n 1024
PARAMETER presence_penalty 0

SYSTEM """
You are a coding agent. Be precise and concise.
Do not repeat yourself.
Do not emit long hidden reasoning.
When using tools, produce valid tool calls only.
When you are uncertain, summarize the uncertainty and stop rather than looping.
"""

This can then be used to create a new model identity:

ollama create qwen3.6:35b-a3b-codex -f Modelfile

This turned out to work way better than the unmodified model for myself.

Configuration using ollama Directly

So first - how can one use this when codex communicates directly via the responses API with ollama? One needs:

The profile configuration resides in ~/.codex/NAME.config.toml (I personally used ~/.codex/qwen.config.toml). Note that the model_catalog_json path and name are arbitrary, the base_url of course has to point to the ollama instance at the correct port.

model_provider = "ollama-local"
model_context_window = 65536
model_auto_compact_token_limit = 56000
model_catalog_json = "/usr/home/USERNAME/.codex/qwen-models.json"

[model_providers.ollama-local]
name = "Ollama local"
base_url = "http://198.51.100.1:1234/v1"
wire_api = "responses"
requires_openai_auth = false
supports_websockets = false
stream_idle_timeout_ms = 3000000
stream_max_retries = 1

What these settings do is:

In addition the metadata file specified under model_catalog_json is required to provide capability information about the model to the codex runtime. This provides codex with information about reasoning support, properties of the template, reasoning support of the model, reasoning levels, again context window configuration (one has again to ensure that this is consistent with the num_ctx of the model) as well as embedded tool support:

{
  "models": [
    {
      "slug": "qwen3.6:35b-a3b-codex",
      "display_name": "Qwen3.6 35B A3B Codex",
      "description": "Local Qwen3.6 35B A3B via Ollama, tuned for Codex OSS.",
      "provider": "ollama-local",
      "visibility": "list",
      "supported_in_api": true,
      "priority": 100,

      "default_reasoning_level": "low",
      "supported_reasoning_levels": [
        {
          "effort": "low",
          "description": "Fast local reasoning"
        },
        {
          "effort": "medium",
          "description": "Balanced local reasoning"
        }
      ],

      "supports_reasoning_summaries": false,
      "default_reasoning_summary": "none",
      "support_verbosity": true,
      "default_verbosity": "low",

      "shell_type": "shell_command",
      "apply_patch_tool_type": "freeform",
      "web_search_tool_type": "text_and_image",
      "supports_parallel_tool_calls": false,
      "supports_image_detail_original": false,

      "context_window": 32768,
      "max_context_window": 32768,
      "effective_context_window_percent": 85,

      "truncation_policy": {
        "mode": "tokens",
        "limit": 10000
      },

      "experimental_supported_tools": [],
      "input_modalities": ["text"],
      "supports_search_tool": false,

      "base_instructions": "You are Codex, a coding agent. Work in short, precise steps. Use tools carefully. Do not repeat yourself. When tool calls are needed, emit valid tool calls only. If you are stuck, summarize the blocker and stop instead of looping."
    }
  ]
}

Having those files in place it is possible to launch codex using

codex --oss -p qwen -m qwen3.6:35b-a3b-codex

or, setting the CODEX_OSS_BASE_URL via the environment instead:

env CODEX_OSS_BASE_URL="http://198.51.100.1:1234/v1" codex --oss -p qwen -m qwen3.6:35b-a3b-codex

This is already enough to use the models.

Encountered bugs

stream closed before response.completed

Unfortunately this was the error that appeared most of the time when something failed. The exact codex output is

stream disconnected before completion: stream closed before response.completed

This turned out to be caused by different reasons:

The most common error is the last one - the model aborting due to repeated patterns. This happens especially when the context window reaches a larger size, yielding Qwens operation to degenerate into a thesaurus like continuation. This is one of the events that happens when running small large language models with larger context. To approach those problems one:

This is one of the major drawbacks of small self hosted models in comparison to well tuned large scaled cloud models

Tool calls not working

This also happened from time to time using qwen models in codex. At some point after they ran for some time they seem to stop producing proper tool calling output. Then the console gets flooded with output like

<function=exec_command>
<parameter=cmd>
python3 -m pytest tests/ -v --tb=short
</parameter>
</function>
</tool_call>

Its pretty obvious that this happens whenever the model stops to emit the initial <tool_call> tag - for whatever reason. This also happens for a growing context, like most bugs.

Qwen-coder loves to delete

It seems qwen code - as soon as the context grows - loves to approach problems with a simple solution: It often suggests to delete the entire codebase over a simple indention error. This is something that is well known from smaller models from the past when the context grows. It behaves like many beginners in this case - it does not try to understand bugs but just simply drops everything and wants to start over.

Appendix: What the Model Parameters Mean

Temperature

This is the best-known parameter. The model computes probabilities

[ P(t_i) ]

for each possible next token. Temperature rescales these probabilities before sampling:

[ P'(t_i) \propto P(t_i)^{\frac{1}{T}} ]

where $T$ is the temperature.

Low temperature (0–0.3)

In this configuration the highest probability token almost always wins. For example, if the networks outputs

Probability Token
0.82 return
0.09 yield
0.05 break
0.04 continue

At temperature 0.1 the model will nearly always emit return, which is excellent for programming.

Temperature = 1

The distribution is unchanged. The model occasionally chooses the second or third best option (inserting yield or break instead of return.

High temperature (>1)

The distribution becomes flatter, which is amazing for creative writing. Unlikely tokens may be sampled. Absolutely terrible for programming.

top_k

This provides a filter of the tokens that undergo probabilistic selection. If one applies top_k = 2 to the above example data set, the model will only choose from the two options return and yield

top_p

This is a filter that is not applied to the number of best candidates - candidates are selected by adding up their probability. As long as the total probability in the selected pool stays below top_p another token is added to the candidate list. This is also called nucleus sampling.

min_p

This option drops every token that has a probability of $P_\mathrm{maxtoken} * \mathrm{min}_p$. In many cases this produces more stable results than filtering via top_p.

repeat_penalty

This factor is used as a divider for every token probability that has recently been emitted. This discourages the repeated emission of the same token, without suppressing the probability. The typical values are in the range of 1 to 1.15. If too high, legitimate repetition gets suppressed.

repeat_last_n

This is the sliding window that repeat_penalty is applied to. Larger values prevent long loops but can slightly reduce consistency because earlier identifiers are also penalized.

presence_penalty

Is similar to repeat_penaltybut is not applied for every time the token is encountered but is applied to tokens it they appeared at least once in the sliding window. This reduces probability of word repetition in text. Large presence penalties are usually undesirable for programming because identifiers and keywords often need to repeat consistently.

frequency_penalty

This is a penalty that scales with the number of times a token has been reused. Again, not optimal for code.

Context length (num_ctx)

This is the maximum size of the attention window. Increasing context length increases the short term memory of the model as well as it’s context dependent behaviour. Increasing the context length increases KV cache memory approximately linearly while increasing the computational cost of standard attention roughly quadratically.

Maximum response size (num_predict)

This is the maximum number of tokens an LLM is allowed to generate in a single response. Reducing it prevents too much gibberish to be generated in a single response while it of course also reduces the amount of information that an LLM can generate. It is primarily used as a safeguard to prevent infinite loops, cap costs and safeguard system resources.

Conclusion

It is entirely possible to run complex long range planning orchestrators and agent frameworks like codex using self hosted models. It is some work to get them up and running and the quality of the output as well as the long range planning will not work as one is used to from frontier models like GPT 5.4 or GPT 5.5. One would need a large scale model like GLM 5.2 on very expensive hardware, being totally not economical for private use except the privacy of running a local model is the primary reason, to achieve this level of reasoning and long term planning capabilities. When one thinks about buying such hardware one also has to take into account that models will grow over time - ones hardware wont.

If one wants to play around, needs agents that do long running jobs or wants to hammer an LLM with request on the other hand it is often an interesting approach of running an LLM offline. The gains are obvious:

Quality of the generated code

First the positive:

And the negative:

Overall the model was compareable to a junior developer who can draft a service shaped codebase quickly but does not reliably close the loop by running, testing and reconciling design assumptions with actual runtime behaviour.

The resulting code was way below production quality, parts not even working at all. LLMs on this scale - in contrast to large scale models like GPT5.4 or GPT5.5 - are merly some kind of auto-complete …

References

This article is tagged: Programming, Artificial Intelligence, System administration, Administration, Large Language Models, Machine learning


Data protection policy

Dipl.-Ing. Thomas Spielauer, Wien (webcomplainsQu98equt9ewh@tspi.at)

This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/

Valid HTML 4.01 Strict Powered by FreeBSD IPv6 support