Copied to Clipboard
)
python vision_server.py
"""LitServe vision API — MiniCPM-V 4.6 photo understanding for OpenClaw."""
from __future__ import annotations
import os
from pathlib import Path
import litserve as ls
from dotenv import load_dotenv
import vision_backend as vb
load_dotenv()
PORT = int(os.getenv("PORT", "8002"))
STRUCTURED_PROMPT = """Analyze the image and answer the user's question.
Return markdown with these sections when relevant:
## Summary
(one paragraph)
## Details
(bullet points)
## Text found
(any visible text, or "none")
## Suggested reply
(a short message suitable for Telegram/WhatsApp)
"""
class VisionPhotoAPI(ls.LitAPI):
def setup(self, device):
self.model = vb.VISION_MODEL
def decode_request(self, request):
return {
"query": (request.get("query") or "What is in this photo?").strip(),
"image_path": (request.get("image_path") or "").strip(),
}
def predict(self, inputs):
path = Path(inputs["image_path"]).expanduser().resolve()
prompt = f"{STRUCTURED_PROMPT}\n\nUser question: {inputs['query']}"
try:
answer = vb.chat_vision(prompt, [path])
return {"output": answer, "model": self.model, "image_path": str(path)}
except vb.VisionError as exc:
return {"error": str(exc), "model": self.model}
def encode_response(self, output):
return output
if __name__ == " __main__":
server = ls.LitServer(VisionPhotoAPI(), accelerator="auto", timeout=False)
print(f"Vision API on http://127.0.0.1:{PORT}/predict (model: {vb.VISION_MODEL})")
server.run(port=PORT)
Server prints: Vision API on http://127.0.0.1:8002/predict
Request shape:
{
"query": "What is the total on this receipt?",
"image_path": "/absolute/path/to/receipt.png"
}
Response:
{
"output": "## Summary\n...",
"model": "minicpm-v4.6",
"image_path": "..."
}
Sample image the API reads:
Test with client.py:
pythonclient.py--imagesamples/receipt.png--query"OCR this receipt"client.py#!/usr/bin/envpython3"""CLI client for the vision photo API."""from__future__importannotationsimportargparseimportjsonimportosimporturllib.requestDEFAULT_URL=os.environ.get("VISION_API_URL","http://127.0.0.1:8002")defmain()->None:p=argparse.ArgumentParser(description="Query local MiniCPM-V vision API")p.add_argument("--image",required=True,help="Path to image file")p.add_argument("--query",default="Describe this photo in detail.")p.add_argument("--url",default=f"{DEFAULT_URL.rstrip('/')}/predict")args=p.parse_args()body=json.dumps({"query":args.query,"image_path":args.image}).encode()req=urllib.request.Request(args.url,data=body,headers={"Content-Type":"application/json"})withurllib.request.urlopen(req,timeout=180)asresp:data=json.loads(resp.read().decode())print(data.get("output")ordata.get("error")ordata)if__name__==" __main__":main()
Expected sections in the output: Summary , Details , Text found , Suggested reply.
Part 3 — Install OpenClaw
Terminal B:
cdguides/openclaw-minicpm-vsource./use-node22.shnpminstall-gopenclaw@latestopenclawonboard--install-daemonopenclawmodelssetollama/minicpm-v4.6#!/usr/bin/envbashset-euopipefailexportNVM_DIR="${NVM_DIR:-$HOME/.nvm}"if[[-s"$NVM_DIR/nvm.sh"]];then."$NVM_DIR/nvm.sh"nvmuse"$(cat "$(dirname"0ドル")/.nvmrc")"elseecho"nvm not found — install Node 22+">&2exit1fiecho"Node: $(node -v)"
In openclaw.json sets the primary model and VISION_API_URL.
// Merge into ~/.openclaw/openclaw.json
{
agents: {
defaults: {
model: { primary: "ollama/minicpm-v4.6" },
skills: ["vision-photo"],
},
},
models: {
providers: {
ollama: {
apiKey: "ollama-local",
baseUrl: "http://127.0.0.1:11434",
api: "ollama",
timeoutSeconds: 300,
models: [
{
id: "minicpm-v4.6",
name: "MiniCPM-V 4.6",
reasoning: false,
input: ["text", "image"],
contextWindow: 256000,
maxTokens: 8192,
params: { keep_alive: "15m" },
},
],
},
},
},
skills: {
entries: {
"vision-photo": {
enabled: true,
env: {
VISION_API_URL: "http://127.0.0.1:8002",
},
},
},
},
}
Looking for a faster setup? TechLatest offers a pre-configured OpenClaw environment that includes the gateway, agent runtime, and common dependencies, allowing developers to focus on building skills and automations instead of infrastructure setup.
Link: https://techlatest.net/support/openclaw-support/
Part 4 — Install vision-photo skill
chmod +x install-skill.sh skills/vision-photo/scripts/*.sh
./install-skill.sh
openclaw gateway restart
The skill tells the agent to run:
vision_query.sh "/path/to/image.jpg" "user question"
See skills/vision-photo/SKILL.md.
Part 5 — Telegram / WhatsApp
Follow OpenClaw channels docs for your platform. Keep DM pairing enabled for security.
When a user sends a photo:
- OpenClaw saves media to a local path
- Agent invokes vision-photo with path + caption
- LitServe returns structured markdown
- Agent sends suggested reply to the channel
Example channel reply from the demo receipt:
Your receipt total is _ **10ドル.75_**
Part 6 — Smoke test
./test-local.sh
Runs: Ollama check → sample image → API health → skill script query.
For a step-by-step walkthrough and complete implementation details, check out the full guide here.
Deploy the Complete Stack on TechLatest
You can deploy the entire private photo assistant stack using TechLatest AI infrastructure:
- Open WebUI + Ollama for local multimodal inference
- OpenClaw for agent orchestration and messaging integrations
- JupyterHub for experimentation and evaluation
- AWS, Azure, and GCP deployment options
This allows developers to build privacy-first vision assistants without spending hours configuring infrastructure and dependencies.
Conclusion
You’ve successfully built a private multimodal assistant that can see, read, and understand images directly from your messaging channels.
Using OpenClaw as the orchestration layer, MiniCPM-V 4.6 as the vision model, and LitServe as the local inference API, you’ve created a workflow where users can send a photo and receive structured, actionable insights without relying on external vision services.
The architecture is intentionally simple:
- OpenClaw handles agent orchestration and channel integrations
- MiniCPM-V 4.6 provides image understanding and OCR capabilities
- LitServe exposes a lightweight local API
- The vision-photo skill connects everything
Because the entire stack runs locally, sensitive screenshots, receipts, documents, and personal photos never leave your infrastructure. Whether you’re building customer support agents, document processing workflows, field inspection tools, or personal AI assistants, the same pattern can be extended with additional skills and automation.
From a single image to a complete conversation, your OpenClaw agent now has eyes.
Thank you so much for reading
Like | Follow | Subscribe to the newsletter.
Catch us on
Website: https://www.techlatest.net/
Newsletter: https://substack.com/@parvezmohammed
Twitter: https://twitter.com/TechlatestNet
LinkedIn: https://www.linkedin.com/in/techlatest-net/
YouTube:https://www.youtube.com/@techlatest_net/
Blogs: https://medium.com/@techlatest.net
Reddit Community: https://www.reddit.com/user/techlatest_net/