Build a Private Photo Assistant on Telegram with OpenClaw + MiniCPM-V 4.6

DEV Community

) python vision_server.py """LitServe vision API — MiniCPM-V 4.6 photo understanding for OpenClaw.""" from __future__ import annotations import os from pathlib import Path import litserve as ls from dotenv import load_dotenv import vision_backend as vb load_dotenv() PORT = int(os.getenv("PORT", "8002")) STRUCTURED_PROMPT = """Analyze the image and answer the user's question. Return markdown with these sections when relevant: ## Summary (one paragraph) ## Details (bullet points) ## Text found (any visible text, or "none") ## Suggested reply (a short message suitable for Telegram/WhatsApp) """ class VisionPhotoAPI(ls.LitAPI): def setup(self, device): self.model = vb.VISION_MODEL def decode_request(self, request): return { "query": (request.get("query") or "What is in this photo?").strip(), "image_path": (request.get("image_path") or "").strip(), } def predict(self, inputs): path = Path(inputs["image_path"]).expanduser().resolve() prompt = f"{STRUCTURED_PROMPT}\n\nUser question: {inputs['query']}" try: answer = vb.chat_vision(prompt, [path]) return {"output": answer, "model": self.model, "image_path": str(path)} except vb.VisionError as exc: return {"error": str(exc), "model": self.model} def encode_response(self, output): return output if __name__ == " __main__": server = ls.LitServer(VisionPhotoAPI(), accelerator="auto", timeout=False) print(f"Vision API on http://127.0.0.1:{PORT}/predict (model: {vb.VISION_MODEL})") server.run(port=PORT)

Server prints: Vision API on http://127.0.0.1:8002/predict

Request shape:

{
 "query": "What is the total on this receipt?",
 "image_path": "/absolute/path/to/receipt.png"
}

Response:

{
 "output": "## Summary\n...",
 "model": "minicpm-v4.6",
 "image_path": "..."
}

Sample image the API reads:

Test with client.py:

pythonclient.py--imagesamples/receipt.png--query"OCR this receipt"client.py#!/usr/bin/envpython3"""CLI client for the vision photo API."""from__future__importannotationsimportargparseimportjsonimportosimporturllib.requestDEFAULT_URL=os.environ.get("VISION_API_URL","http://127.0.0.1:8002")defmain()->None:p=argparse.ArgumentParser(description="Query local MiniCPM-V vision API")p.add_argument("--image",required=True,help="Path to image file")p.add_argument("--query",default="Describe this photo in detail.")p.add_argument("--url",default=f"{DEFAULT_URL.rstrip('/')}/predict")args=p.parse_args()body=json.dumps({"query":args.query,"image_path":args.image}).encode()req=urllib.request.Request(args.url,data=body,headers={"Content-Type":"application/json"})withurllib.request.urlopen(req,timeout=180)asresp:data=json.loads(resp.read().decode())print(data.get("output")ordata.get("error")ordata)if__name__==" __main__":main()

Expected sections in the output: Summary , Details , Text found , Suggested reply.

Part 3 — Install OpenClaw

Terminal B:

cdguides/openclaw-minicpm-vsource./use-node22.shnpminstall-gopenclaw@latestopenclawonboard--install-daemonopenclawmodelssetollama/minicpm-v4.6#!/usr/bin/envbashset-euopipefailexportNVM_DIR="${NVM_DIR:-$HOME/.nvm}"if[[-s"$NVM_DIR/nvm.sh"]];then."$NVM_DIR/nvm.sh"nvmuse"$(cat "$(dirname"0ドル")/.nvmrc")"elseecho"nvm not found — install Node 22+">&2exit1fiecho"Node: $(node -v)"

In openclaw.json sets the primary model and VISION_API_URL.

// Merge into ~/.openclaw/openclaw.json
{
 agents: {
 defaults: {
 model: { primary: "ollama/minicpm-v4.6" },
 skills: ["vision-photo"],
 },
 },
 models: {
 providers: {
 ollama: {
 apiKey: "ollama-local",
 baseUrl: "http://127.0.0.1:11434",
 api: "ollama",
 timeoutSeconds: 300,
 models: [
 {
 id: "minicpm-v4.6",
 name: "MiniCPM-V 4.6",
 reasoning: false,
 input: ["text", "image"],
 contextWindow: 256000,
 maxTokens: 8192,
 params: { keep_alive: "15m" },
 },
 ],
 },
 },
 },
 skills: {
 entries: {
 "vision-photo": {
 enabled: true,
 env: {
 VISION_API_URL: "http://127.0.0.1:8002",
 },
 },
 },
 },
}

Looking for a faster setup? TechLatest offers a pre-configured OpenClaw environment that includes the gateway, agent runtime, and common dependencies, allowing developers to focus on building skills and automations instead of infrastructure setup.

Link: https://techlatest.net/support/openclaw-support/

Part 4 — Install vision-photo skill

chmod +x install-skill.sh skills/vision-photo/scripts/*.sh
./install-skill.sh
openclaw gateway restart

The skill tells the agent to run:

vision_query.sh "/path/to/image.jpg" "user question"

See skills/vision-photo/SKILL.md.

Part 5 — Telegram / WhatsApp

Follow OpenClaw channels docs for your platform. Keep DM pairing enabled for security.

When a user sends a photo:

OpenClaw saves media to a local path
Agent invokes vision-photo with path + caption
LitServe returns structured markdown
Agent sends suggested reply to the channel

Example channel reply from the demo receipt:

Your receipt total is _ **10ドル.75_**

Part 6 — Smoke test

./test-local.sh

Runs: Ollama check → sample image → API health → skill script query.

For a step-by-step walkthrough and complete implementation details, check out the full guide here.

Deploy the Complete Stack on TechLatest

You can deploy the entire private photo assistant stack using TechLatest AI infrastructure:

Open WebUI + Ollama for local multimodal inference
OpenClaw for agent orchestration and messaging integrations
JupyterHub for experimentation and evaluation
AWS, Azure, and GCP deployment options

This allows developers to build privacy-first vision assistants without spending hours configuring infrastructure and dependencies.

Conclusion

You’ve successfully built a private multimodal assistant that can see, read, and understand images directly from your messaging channels.

Using OpenClaw as the orchestration layer, MiniCPM-V 4.6 as the vision model, and LitServe as the local inference API, you’ve created a workflow where users can send a photo and receive structured, actionable insights without relying on external vision services.

The architecture is intentionally simple:

OpenClaw handles agent orchestration and channel integrations
MiniCPM-V 4.6 provides image understanding and OCR capabilities
LitServe exposes a lightweight local API
The vision-photo skill connects everything

Because the entire stack runs locally, sensitive screenshots, receipts, documents, and personal photos never leave your infrastructure. Whether you’re building customer support agents, document processing workflows, field inspection tools, or personal AI assistants, the same pattern can be extended with additional skills and automation.

From a single image to a complete conversation, your OpenClaw agent now has eyes.

Thank you so much for reading

Like | Follow | Subscribe to the newsletter.

Catch us on

Website: https://www.techlatest.net/

Newsletter: https://substack.com/@parvezmohammed

Twitter: https://twitter.com/TechlatestNet

LinkedIn: https://www.linkedin.com/in/techlatest-net/

YouTube:https://www.youtube.com/@techlatest_net/

Blogs: https://medium.com/@techlatest.net

Reddit Community: https://www.reddit.com/user/techlatest_net/