Part 1. Why API Costs Hit $10 a Day — Server Design and Model Strategy

Series: Can AI Agents Actually Run Business Operations? — Part 1

OpenClaw has been getting a lot of attention as an AI agent platform since late 2025. But how usable is it in real business work? And how does it hold up from a security standpoint?

In this series, I share what I learned from using OpenClaw on an actual client engagement — rough edges, failures, and course corrections included. I am not an OpenClaw expert; this is one engineer working through the real problems. I hope it helps people who are evaluating OpenClaw or similar AI agent platforms for adoption, as a kind of “here is what to watch out for” reference.

In this post (Part 1), I walk through why my API bill came in 5-10x higher than expected on day one. The traps include: OpenClaw does not auto-compress context, Cron and HEARTBEAT can run the same task twice, and a 3-tier model strategy (including a local LLM) brings the numbers back under control.

For the overall project summary, please see the project page.


1. Server Selection and Base Setup

We chose Hetzner Cloud for a 24/7 agent server.

ItemValue
LocationHelsinki (hel1)
OSUbuntu 24.04
Spec4vCPU / 16GB RAM / 150GB SSD
Monthlyabout €12

Latency is not a big deal for long-running agents, so the European region was fine. Cloud was the right fit because procuring dedicated hardware was not realistic for a client project, we needed a sandbox separated from the internal network, and 24/7 uptime was a requirement.


2. OpenClaw Installation — Things to Watch

What --accept-risk actually means

OpenClaw onboarding runs like this:

npm install -g openclaw@latest
openclaw onboard --install-daemon --non-interactive --accept-risk

If you leave out --accept-risk, you get:

Non-interactive setup requires explicit risk acknowledgement.

OpenClaw runs with full host access, so it asks for explicit risk acknowledgement. The agent can read and write files and run arbitrary commands. You should understand this before deciding to install it.

Browser settings on a server without GUI

On a GUI-less server running as root, two settings are needed:

openclaw config set browser.noSandbox true
openclaw config set browser.headless true

You also need to install Google Chrome Stable in advance:

wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb -O /tmp/chrome.deb
apt-get install -y /tmp/chrome.deb

Exposing the web dashboard

OpenClaw has a built-in web management UI. To access it from outside, I used sslip.io to turn the server IP into a domain, then set up Nginx as a reverse proxy with Let’s Encrypt.

One setting to be careful about: OpenClaw requires device pairing as a brute-force protection. For external access, this has to be disabled:

openclaw config set gateway.controlUi.dangerouslyDisableDeviceAuth true

The name has dangerously in it for a reason. I compensated with three layers of protection: Nginx Basic Auth, a gateway token, and HTTPS.


3. Security Settings — Do Them First

If you run OpenClaw on a cloud server, please do these on day one.

  • Full host access: --accept-risk is literal. The agent can read/write files and run commands freely.
  • .env file permissions: The file is created with 644 (world-readable) by default. It contains API keys, so chmod 600 is required.
  • Template files: Be careful not to accidentally write real API keys into git-tracked template files like .env.example.
chmod 600 /root/.openclaw/.env

It is easy to skip these while you are focused on getting things to work. I would recommend doing them before anything else.


4. Cost Surprise — 5-10x Over Budget

The day after setup, I checked OpenClaw’s /usage dashboard and found costs were much higher than expected.

  • Expected: $1-2/day ($30-60/month)
  • Actual: $8-10/day ($240-300/month)

I broke down the causes and found four issues stacking up.

Cause 1: Context token bloat (the biggest factor)

The main session had grown to about 310,000 tokens.

agent:main:main │ 310k/200k (155%)

OpenClaw does not auto-compress conversation history. Unlike Claude Code, everything from setup logs and error messages stays. Every message sends the full 310K tokens to the API again.

StatePer messagePer 100 messages
310K tokens$0.023$2.30
40K tokens (after reset)$0.003$0.30

Even if you are using a cheap model, a large context will eat that advantage.

Cause 2: Cron jobs running all the time

A monitoring cron running every 30 minutes was hitting the API 48 times a day.

Cause 3: Cron and HEARTBEAT both running

This one is the hardest to notice. OpenClaw has two scheduling mechanisms:

MechanismSessionHow to stop
CronIsolated sessionManage with openclaw cron
HEARTBEATInside main sessionEdit HEARTBEAT.md

If you stop cron but leave the same task in HEARTBEAT.md, it keeps running. In practice, after I had disabled all crons, web_search calls continued.

web_search: 186 calls
web_fetch:   33 calls

The HEARTBEAT auto-processing was triggering web searches more than I realized.


5. Countermeasures — Compress Sessions, Review Scheduled Tasks

Immediate steps and their effect:

ActionEffect
Session compression (/reset)310K → 21K tokens (93% reduction)
Disable all cron jobsBackground API calls went to zero
Remove monitoring tasks from HEARTBEATStopped the 30-minute auto-search

Before vs. after:

MetricBeforeAfter
Main session tokens310,00021,000
Background API calls48/day0/day
Estimated monthly cost$240-300$10-20

6. Local LLM — Ollama + Gemma3 4B

To fundamentally solve the 24/7 cron cost problem (which would have been $50-100/month on its own), I added a local LLM.

Model selection

ModelResultNote
Qwen3 4BNot adoptedThinking mode always on. A 3-line answer took 2,500 tokens and 3 minutes
Gemma3 4BAdoptedThe same task took 60 tokens and 3 seconds

Qwen3’s chat template force-injects a <think> tag. On CPU inference, this slows things down significantly. I tried /no_think and a custom Modelfile, but the effect was limited.

In my view, Thinking-forced models are not a good fit for CPU-only inference environments.

Server resource usage

Hetzner: 4vCPU (AMD EPYC) / 16GB RAM / no GPU

Gemma3 4B loaded:
  Disk:    3.3 GB
  RAM:     4.2 GB (CPU inference)
  Free RAM: 10 GB (plenty of room for OpenClaw + OS)
  Speed:   14-15 tok/s

A 16GB-RAM server is enough. Even without a GPU, response speed is workable for routine tasks.


7. The 3-Tier Model Strategy

Based on this experience, we settled on a 3-tier strategy that matches model tier to task characteristics.

TierModelUseCost band
Tier 1Claude Sonnet / OpusCode generation, deep analysis, inconsistency detectionmedium-high
Tier 2Gemini 3 FlashOrchestration, web summarization, HTML extractionlow
Tier 3Gemma3 4B (local)Heartbeat response, routine processing$0

Execution strategy

  1. Code-First: Convert rule checks into Python code and run them without an LLM (zero execution cost)
  2. Smart Routing: Low-load tasks go to Gemini Flash, heavy tasks go to Claude Sonnet
  3. Local Preference: Keep routine work on Ollama + Gemma3
  4. Code generation is Sonnet only: Gemini Flash tends to fall back to hardcoding (details in Part 2), so it is not suitable for code generation

Let OpenClaw configure itself

Rather than editing config files directly from Claude Code, it worked better to prompt OpenClaw itself and let it update its own settings.

openclaw agent --agent main --message 'Please change the default model to google/gemini-3-flash-preview.
Use models based on task difficulty:
- Normal: google/gemini-3-flash-preview
- Medium: anthropic/claude-sonnet-4-6
- Hard: anthropic/claude-opus-4-6'

OpenClaw updated its own default model and recorded the routing strategy in SOUL.md.


8. Takeaways

  1. OpenClaw does not auto-compress context. Unlike Claude Code, it grows without bound unless you /reset manually or configure a rule.
  2. Cron and HEARTBEAT are different mechanisms. If you stop one, always check the other.
  3. Cheap models do not save you from large contexts. Gemini Flash is low-cost per token, but if you send 310K tokens every message, it is not cheap anymore. Cost = price × tokens × calls — you have to evaluate all three.
  4. Run /reset right after setup. Setup sessions accumulate a lot of trial-and-error context that you do not need going forward.
  5. Avoid Thinking-forced models on CPU inference. They are fine on GPU, but unusable on CPU-only servers.
  6. Check /usage daily. A one-day delay in noticing an anomaly can mean a $10 difference.

9. Post-Setup Checklist

  • Check cost on /usage (establish a baseline)
  • Check token count per session with openclaw status
  • Run openclaw cron list to find unneeded crons
  • Review HEARTBEAT.md for unnecessary tasks
  • Verify .env has 600 permissions
  • Run /reset after setup is complete
  • Set a monthly spend cap in your API provider dashboard
  • Browser settings: headless: true + noSandbox: true

In Part 2, I will share an experiment where we gave AI models only the rule specification (in Japanese text) and asked them to generate Python check code on their own. Comparing three models, we found that Gemini Flash had been hardcoding the expected answers.

Share this article

Related Posts