đĄTL;DR (Too Long; Did not Read)
You can find the project on GitHub and PyPi
Introduction - The Problem Nobody Wants to Admit
Modern AI applications often behave as if they are the only process in the universe. Every new AI service, notebook, or microservice starts up, loads its favorite model, and assumes that no one else dares to touch the GPU or system resources like RAM. The result is chaos: multiple workloads thrashing the same hardware. Applications today also very often lack proper error handling - if resources are exhausted or GPU buffers are lost due to competition, they simply crash as if error handling had never been invented this century. Itâs a silent epidemic of resource arrogance. This reminds of the time of JavaEE applications who tend to hog all of a servers RAM without actually needing it.
Commercial and open source solutions exist - Langfuse, Helicone and open projects like TU Wienâs Aqueduct, which we use in front of vLLM clusters at the university. These gateways are capable, but theyâre heavy: complex multi-component architectures with databases, dashboards, and web-based configuration layers. Theyâre very great for large institutions but overkill for small labs, hobby projects, or local offline clusters where you just want control and load distribution without bureaucracy and administrative overhead.
Thatâs why mini-apigw exists: a tiny, transparent, locally controlled OpenAI-compatible gateway designed to bridge multiple model backends - OpenAI (ChatGPT), Anthropic (Claude), Ollama (offline models), xAI (Grok), Google (Gemini) and soon vLLM as well as Fooocus - while adding governance and arbitration features missing from the modern AI ecosystem.

Motivation - Why I Built mini-apigw
I wanted a service that sits quietly between clients and model backends and fixes the problems most people donât even notice until itâs too late (obviously I personally ran into them):
- Uniform interface: Everything speaks the OpenAI API, including tools like LibreChat, my own orchestration agents, and external tools. You do not have to implement different backends for ollama, Anthropic, vLLM and other backends in every single application. The gateway provides translation to the different protocols.
- Multiple backends: Mix and match OpenAI cloud models with local Ollama and vLLM without rewriting client code. The appropriate backend is decided on by model name, you can use the same client transparently.
- Resource arbitration: Prevent GPU overload by serializing and queuing heavy model workloads. This is especially interesting when you have multiple applications that want to access different backends that run on the same hardware and are capable of dynamically loading their models. If they run in sequence resources are there, if they run in parallel they reclaim each others memory and due to lost buffers most AI applications today just crash.
- Per-app governance: Define API keys, model access lists, token quotas, and aliases. In addition the middleware can perform complete tracing of all requests and responses in case one wants to configure this for an application. There is no need to implement this in every app and trust it. You just set the configuration option and get a list of JSON objects that contains all interactions with the backends.
- Hot reconfiguration: Reload settings without restart, because uptime matters - even in a small lab or hobby setting.
- Minimal dependencies: No database except for persistent accounting if wanted (optional), no dashboard, no heavy template engines, no external service dependencies - just FastAPI and a JSON configuration directory.
In short: I wanted something that behaves like a modern version of an old-school UNIX daemon that abstracted the LLM and image generation services (and later on additional services) from the actual backends. No unnecessary web interface, no unnecessary moving parts. Configuration through files. Trace logs into files. Simplicity (as simple as possible, though not too simple).
Design Philosophy - Minimalism Meets Control
mini-apigw lives in ~/.config/mini-apigw/ and uses three JSON files:
- daemon.json for runtime and logging options
- backends.json for definitions for connected APIs (OpenAI, Ollama; later hopefully Anthropic, vLLM, Fooocus, etc.)
- apps.json contains per-app keys, permissions, and limits
It can reload configuration with SIGHUP like a traditional Unix daemon, report status via local-only admin endpoints and forward all /v1/* calls exactly as the OpenAI API does - so existing clients need no modification except for the base URL.
This minimalism also reduces the exposed surface dramatically. No admin web UI means no authentication endpoint to break into. It makes it unsuitable for large multi-user teams but perfect for controlled environments or single-node setups.
Architecture - A Single Front Door for Many Models

mini-apigw exposes a single /v1/ endpoint structure and routes each request to the appropriate backend according to the model name, aliases, or policy rules. Backends can be cloud APIs or local inference servers. The gateway supports both streaming and non-streaming responses and can transform metadata and logging as needed.
The most important feature, however, is arbitration.
Arbitration - The Missing Layer in Todays AI Ecosystem
Most AI frameworks as of today assume exclusive GPU access. In multi-user, multi-service setups, that assumption fails spectacularly. Two LLMs start loading, each thinks it owns the GPU, and everything crashes. Sometimes even the system due to the state of current GPU computation frameworks.
mini-apigw introduces sequence groups - small arbitration queues per groups of backends that serialize model loads and executions. The effect is similar to thread pools or database connection pools: predictable throughput, no GPU thrashing, and clean recovery.
Itâs a rediscovery of a concept that mainstream software engineering once valued deeply. JavaEE, for example, had sophisticated thread pools and resource managers, ensuring fairness and throughput under load. Modern AI software, in contrast, is a jungle of processes fighting for VRAM without a referee. mini-apigw brings that sanity back.
Quick Tour - Configuration and Usage
Installation of the application is simple thanks due to PyPi:
One can also install the gateway from the cloned GitHub repository editable in place for development:
$ git clone git@github.com:tspspi/mini-apigw.git
$ cd mini-apigw
$ pip install -e .
Start the gateway manually:
$ mini-apigw --config ~/.config/mini-apigw/
Unix domain socket & reverse proxy: In my personal deployment I usually run the daemon bound to a Unix domain socket and expose it through Apache (or another reverse proxy) using ProxyPass and ProxyPassReverse to the local socket. This has several operational advantages:
- Single vhost per service: Apache can route multiple backends and vhosts easily while keeping the gateway accessible under the usual HTTP hostnames and TLS termination. Itâs much easier for a sysadmin to manage virtual hosts this way than to reconfigure services to listen on many different TCP ports. Itâs also easier to configure TLS termination on a single server than to configure a bunch of services and keep them up to date.
- Proofen and working QoS solutions: With solutions like
mod_qos Apache provides proofen real world solutions to provide rate limiting. You can just drop then in with your usual configuration without having to implement anything in the API gateway.
- Reduced network attack surface: The gateway itself does not open a public TCP port; only the trusted reverse proxy does TLS termination and public exposure. The HTTP implementation of Apache is pretty well tested over decades - having a load of small HTTP servers (or even HTTPS servers) running and keeping them up to date with a myriad of different implementations increases the attack surface massively, especially when using younger libraries that are quickly hacked together by small communities or single developers (this includes my API gateway of course).
- Fileâsystem permissions: Unix sockets can be protected efficiently and very simple by filesystem permissions, restricting which system users and services may talk to the gateway.
- Easier integration: Reverse proxy features such as HTTP auth, access logging, rate limiting, and client certificate checks remain available at the proxy layer.
A typical configuration of the reverse proxy - in this case of Apache httpd - may look like the following:
<VirtualHost *:80>
ServerName api.example.com
ServerAdmin complains@example.com
DocumentRoot /usr/www/www.example.com/www/
ProxyTimeout 600
ProxyPass / "unix:/var/run/mini-apigw.sock|http://localhost/" connectiontimeout=10 timeout=600
ProxyPassReverse / "unix://var/run/mini-apigw.sock|http://localhost/"
<LocationMatch "^/(admin|stats)">
AuthType Basic
AuthName "mini-apigw admin"
AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
<RequireAll>
Require valid-user
Require ip 127.0.0.1 ::1 192.168.1.0/24
</RequireAll>
</LocationMatch>
</VirtualHost>
<VirtualHost *:443>
ServerName api.example.com
ServerAdmin complains@example.com
DocumentRoot /usr/www/www.example.com/www/
ProxyTimeout 600
ProxyPass / "unix:/var/run/mini-apigw.sock|http://localhost/" connectiontimeout=10 timeout=600
ProxyPassReverse / "unix://var/run/mini-apigw.sock|http://localhost/"
SSLOptions +StdEnvVars
# SSLVerifyClient optional
SSLVerifyDepth 5
SSLCertificateFile "/usr/www/www.example.com/conf/ssl.cert"
SSLCertificateKeyFile "/usr/www/www.example.com/conf/ssl.key"
SSLCertificateChainFile "/usr/www/www.example.com/conf/ssl.cert"
# SSLCACertificateFile "/usr/www/www.example.com/conf/ca01_01.cert"
<LocationMatch "^/(admin|stats)">
AuthType Basic
AuthName "mini-apigw admin"
AuthUserFile "/usr/local/etc/httpd/miniapigw-admin.htpasswd"
<RequireAll>
Require valid-user
Require ip 127.0.0.1 ::1 192.0.2.0/24
</RequireAll>
</LocationMatch>
</VirtualHost>
mini-apigw can also listen directly on TCP (IPv6 by default, legacy IPv4 if configured) when that is preferred, but for controlled server deployments the Unix domain socket + reverse proxy pattern tends to make more sense from a system administration perspective.
Configuration
Daemon configuration (daemon.json)
The main daemon configuration daemon.json defines:
- Where the daemon listens at. In the following example this is a
unix_socket (and the port is unused).
One can alternatively specify an ipv4 or ipv6 field containing the listen addresses
- Where the
admin endpoint is located and who can access it
- Logging allows one to specify a log file into which the daemon should write its generic and itâs access log. Currently
there is no support for
syslog
- The
database configuration specifies an optional PostgreSQL database. If it is specified an accounting log is
written into the database
- The
reload option allows one to disable or enable reloading of configuration files using a SIGHUP handler
{
"listen": {
"unix_socket" : "/usr/home/tsp/miniapigw/gw.sock",
"port": 8080
},
"admin": {
"stats_networks": ["127.0.0.1/32", "::1/128", "192.0.2.0/24" ]
},
"logging": {
"level": "INFO",
"redact_prompts": false,
"access_log": true,
"file" : "/var/log/miniapigw.log"
},
"database" : {
"host" : "192.0.2.3",
"database" : "database_name",
"username" : "database_user",
"password" : "database_secret"
},
"reload": {
"enable_sighup": true
},
"timeouts": {
"default_connect_s": 60,
"default_read_s": 600
}
}
Application configuration (apps.json)
The apps.json file contains configuration for the applications that can access the API gateway.
- Each application has an
app_id (this should be machine readable, Iâd not recommend special characters there) as well as a name. The app_id has to be unique. The name can e any description of the application.
- The
api_keys array contains a list of API keys. Those are transparent bearer tokens, at this moment they are not parsed by the gateway in any way. They also have to be uniquely assigned to one application (i.e. not the same API key to different applications).
- The
policy allows one to specify which models are allowed for this application (this can include alias definitions in the backend). If the whitelist is not used the blacklist can be used via deny
- The
cost_limit enforces a rough resource cap on each application. This is in particular useful when designing automatic systems to prevent them running havoc and billing you thousands of EUR/USD on your credit card. Itâs good to have safeguards in place.
- The
trace configuratoin allows you to define a jsonl log file into which every request is logged. Depending on the three configuration options below it logs different aspecs of the requests. This allows one to trace what an application has been done without having to implement logging in the application itself. In case imagedir is specified all generated graphics from this application are also archived in the specified directory to have a trace of what has been generated.
{
"apps" : [
{
"app_id": "demo",
"name": "Demo application",
"api_keys": [
"secretkey1",
"secretkey2"
],
"policy": {
"allow": [ "llama3.2", "gpt-4o-mini", "gpt-oss", "llama3.2:latest", "text-embedding-3-small", "nomic-embed-text", "dall-e-3" ],
"deny": []
},
"cost_limit": {
"period": "day",
"limit": 10.0
},
"trace": {
"file": "/var/log/miniapigw/logs/demo.jsonl",
"imagedir": "/var/log/miniapigw/logs/demo.images/",
"includeprompts": true,
"includeresponse": true,
"includekeys": true
}
]
}
Backend configuration (backends.json)
This is the place where one defines which backends are available and which sequence groups they belong to. In addition aliases are defined here.
The file is one huge JSON dictionary.
The aliases section is a simple dictionary mapping from an arbitrary string to an actual model name. The model name later on selects the
backend. In the following example one can see that some aliases have been used to select model sizes or versions. In addition a transparent name
called blogembed has been used. This is a technique that I use also for my personal gateway to select the embeddings used by the tools operating
on this blog on the API gateway. All tools use the transparent name blogembed when querying the gateway. If I ever want to switch to a different
embedding I just have to change the mapping in the alias. The tools detect the different size of the embeddings and regenerate their indices.
The next section are sequence_groups. This is a dictionary that contains one entry per so called sequence group. All requests that go the
backends that belong to the same sequence group are executed serially, never in parallel. Other requests may be processed in parallel.
The following list of backends is then the main configuration of the backends. As one can see every backend has:
- A type that selects the code that will be used to communicate and translate to/from this backend
- A name for logging purposes
- Connection parameters like
base_url, api_key required to access the remote host, etc. For backends like fooocus one will
also be able to specify stuff like selected styles, used models and refiners and other parameters.
- The
supports list defines which models are exposed for the different operations. Those are exposed to the routing framework. The
selection of the backends operates on the model names used here - a client requesting for example gpt-4o-mini for chat will be
routed to the opanai-primary backend, a client requesting llama3.2:latest for completion will be routed to ollama-local
- The
cost configuration allows one to specify how much each of the requests costs for each token. This is not fully implemented and is part
of the safeguard against runaway applications.
{
"aliases": {
"llama3.2" : "llama3.2:latest",
"gpt-oss" : "gpt-oss:20b",
"llama3.2-vision" : "llama3.2-vision:latest",
"blogembed" : "mxbai-embed-large:latest"
},
"sequence_groups": {
"local_gpu_01": {
"description": "Serialized work for local GPU tasks"
}
},
"backends": [
{
"type": "openai",
"name": "openai-primary",
"base_url": "https://api.openai.com/v1",
"api_key": "YOUROPENAI_PLATFORM_KEY",
"concurrency": 1,
"supports": {
"chat": [ "gpt-4o-mini" ],
"embeddings": [ "text-embedding-3-small" ],
"images": [ "dall-e-3" ]
},
"cost": {
"currency": "usd",
"unit": "1k_tokens",
"models": {
"gpt-4o-mini": {"prompt": 0.002, "completion": 0.004},
"text-embedding-3-small": {"prompt": 0.0001, "completion": 0.0}
}
}
},
{
"type": "ollama",
"name": "ollama-local",
"base_url": "http://192.0.2.1:8182",
"sequence_group": "local_gpu_01",
"concurrency": 1,
"supports": {
"chat": ["llama3.2:latest", "gpt-oss:20b", "llama3.2-vision:latest"],
"completions" : ["llama3.2:latest" ],
"embeddings": ["nomic-embed-text", "mxbai-embed-large:latest"]
},
"cost": {
"models": {
"llama3.2:latest": {"prompt": 0.0, "completion": 0.0},
"gpt-oss:20b": {"prompt": 0.0, "completion": 0.0},
"nomic-embed-text": {"prompt": 0.0, "completion": 0.0}
}
}
}
]
}
Then any OpenAI-compatible client can use it transparently:
import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "sk-..."
response = openai.ChatCompletion.create(
model="llama3",
messages=[{"role": "user", "content": "Explain quantum tunneling."}]
)
mini-apigw will automatically pick the right backend (Ollama in this case) and manage concurrency and logging.
Creating and Using API keys
To ease the creation of API keys - these are only transparent bearer tokens so actually just arbitrary strings - the mini-apigw client implements the token command. This creates a random access token that can then be used in the application configuration. At the moment of writing the API tokens have been threatened as transparent sequences of bytes. In a later stage they will be JWTs that include permissions for the given clients to allow end to end authorization.
Note that API keys should never be used over plain http except on the local network or over the Unix Domain Socket. Always use https .
Starting and Stopping the Services, Reloading Configuration
Starting the service can be done via the mini-apigw command line interface via the start subcommand (or without any subcommand) or via an rc.init script in case on runs on FreeBSD. Stopping and reloading configuration can be done using two distinct mechanisms:
- Signals like a traditional Unix daemon. On
SIGHUPÂ the daemon reloads itâs configuration from the JSON files. On SIGTERMÂ the daemon shuts down.
- An HTTP administrative interface that is exposed via the command line interfaces
stop and reload commands.
The rc.init script also supports checking the status of the daemon using the PID file.
#!/bin/sh
# PROVIDE: mini_apigw
# REQUIRE: LOGIN
# KEYWORD: shutdown
. /etc/rc.subr
name="mini_apigw"
rcvar="mini_apigw_enable"
load_rc_config $name
: ${mini_apigw_enable:="NO"}
: ${mini_apigw_command:="/usr/local/bin/mini-apigw"}
: ${mini_apigw_config_dir:="/usr/local/etc/mini-apigw"}
: ${mini_apigw_user:="mini-apigw"}
: ${mini_apigw_pidfile:="${mini_apigw_config_dir}/mini-apigw.pid"}
: ${mini_apigw_unix_socket:="${mini_apigw_config_dir}/mini-apigw.sock"}
: ${mini_apigw_flags:=""}
: ${mini_apigw_timeout:="10"}
command="${mini_apigw_command}"
pidfile="${mini_apigw_pidfile}"
required_files="${mini_apigw_config_dir}/daemon.json"
extra_commands="reload status"
start_cmd="${name}_start"
stop_cmd="${name}_stop"
reload_cmd="${name}_reload"
status_cmd="${name}_status"
mini_apigw_build_args()
{
_subcmd="$1"
shift
_cmd="${command} ${_subcmd} --config-dir \"${mini_apigw_config_dir}\""
if [ -n "${mini_apigw_unix_socket}" ]; then
_cmd="${_cmd} --unix-socket \"${mini_apigw_unix_socket}\""
fi
for _arg in "$@"; do
_cmd="${_cmd} ${_arg}"
done
if [ -n "${mini_apigw_flags}" ]; then
_cmd="${_cmd} ${mini_apigw_flags}"
fi
echo "${_cmd}"
}
mini_apigw_run()
{
_cmd=$(mini_apigw_build_args "$@")
if [ "$(id -un)" = "${mini_apigw_user}" ]; then
/bin/sh -c "${_cmd}"
else
su -m "${mini_apigw_user}" -c "${_cmd}"
fi
}
mini_apigw_start()
{
mini_apigw_run start
}
mini_apigw_stop()
{
mini_apigw_run stop --timeout "${mini_apigw_timeout}"
}
mini_apigw_reload()
{
mini_apigw_run reload --timeout "${mini_apigw_timeout}"
}
mini_apigw_status()
{
if [ ! -f "${pidfile}" ]; then
echo "${name} is not running"
return 1
fi
_pid=$(cat "${pidfile}" 2>/dev/null)
if [ -z "${_pid}" ]; then
echo "${name} pidfile exists but is empty"
return 1
fi
if kill -0 "${_pid}" 2>/dev/null; then
echo "${name} running as pid ${_pid}"
return 0
fi
echo "${name} pidfile exists but process not running"
return 1
}
run_rc_command "$1"
Security Considerations
- Please note that API keys are stored in plain text in the
apps.json configuration file. This will be fixed in later iterations. This is
of course bad design that now has been choosen for simplicity. A quick fix later on will be to store just hashes here. This is on the ToDo list
(and maybe done at the moment you read this article later on)
- The API keys for backends have to be stored in plain text in the configuration files. There is no way to prevent this.
- The API keys are passed in plaintext in the HTTP headers. If you use a public network you have to use TLS. Never ever use plain
http
over any public or not fully trusted network!
Grafana for Visualization
A nice side effect of the PostgreSQL database that mini-apigw creates is that you can utilize utilities like Grafana to visualize
the usage of resources. The following example shows one of the first tests:

The queries I used are extremly simple and are presented below.
Requests per Model:
SELECT
$__timeGroup(created_at, $__interval) AS time,
model,
COUNT(*) AS requests
FROM requests
WHERE
$__timeFilter(created_at)
GROUP BY 1, 2
ORDER BY 1;
Runtime per Model:
SELECT
$__timeGroup(created_at, $__interval) AS time,
model,
SUM(latency_ms)/1000 AS latency
FROM requests
WHERE
$__timeFilter(created_at)
GROUP BY 1, 2
ORDER BY 1;
Requests per Application:
SELECT
$__timeGroup(created_at, $__interval) AS time,
(app_id || ' - ' || model) AS metric,
COUNT(*) AS value
FROM requests
WHERE $__timeFilter(created_at)
GROUP BY 1, 2
ORDER BY 1;
Runtime per Application:
SELECT
$__timeGroup(created_at, $__interval) AS time,
(app_id || ' - ' || model) AS metric,
SUM(latency_ms)/1000 AS latency
FROM requests
WHERE $__timeFilter(created_at)
GROUP BY 1, 2
ORDER BY 1;
Tokens per Model:
SELECT
$__timeGroup(created_at, $__interval) AS time,
model,
SUM(COALESCE(total_tokens, COALESCE(prompt_tokens, 0) + COALESCE(completion_tokens, 0))) AS tokens
FROM requests
WHERE
$__timeFilter(created_at)
GROUP BY 1, 2
ORDER BY 1;
Conclusion - Control, Simplicity, and Fairness
mini-apigw is a reminder that simplicity and control can coexist. Itâs about reclaiming responsibility for resources and infrastructure
in a world where every AI application assumes to be alone. Itâs not a massive platformâitâs a scalpel: small, precise, and reliable.
When others build towers of YAML and Kubernetes operators - or require loads of virtual environments and Docker containers to be deployed
without any control over the content, sometimes all you need is a well-behaved little daemon that keeps the peace between your models.
This utility has been designed to solve a given simple task in a simple environment. It will never scale to a huge cluster and it will not
scale to any worldwide operation. It has never been designed to do so. Itâs there to solve a small local problem. And it works flawless
up until now.
References
The following are useful resources when working with a utility like my mini-apigw:
- GitHub presence of mini-apigw
- PyPi package of mini-apigw
- Commercial LLM providers:
- Open source models:
- Ollama is an easy to use runtime for open models that you can execute locally if you have enough resources. It allows quick loading and unloading of models
- vLLM is an easy to use runtime to execute permanently loaded open models that is often used by professional environments that keeps models loaded all the time.
- Fooocus is an very easy to use gradio based solution to run SDXL and other stable diffusion image generation models.
Interesting books about the topic of Large Language Models (LLMs)
Note: these links are Amazon affiliate links, this pages author profits from qualified purchases
This article is tagged: