The first step was to create a Dockerfile and a main.py file to download, install, run the model and serve it using FastAPI. After minor debugging to get the appropriate download URL, we hit a wall. llama-cpp-python did not yet support Gemma4. After going in circles for a while, I asked/suggested to Gemini whether using Ollama would help solve the issue. It said yes! And after a few more rounds of debugging, we had a working deployment of Gemma4 running fully on a free-tier HuggingFace space. No credit card required!
Do check your excitement as there are massive limitations. The model takes about 3 to 4 minutes to process a prompt and start responding. However, given that it's running on absolutely free infrastructure, I think it's worth it. You can check out this deployment here. You can also try prompting it using the link: https://ismizo-gemma4.hf.space/api/generate . See a curl example below.
curl -X POST https://ismizo-gemma4.hf.space/api/generate -H "Content-Type: application/json" -d '{
"model": "gemma4",
"prompt": "Explain the concept of a Dyson Sphere in three short sentences.",
"stream": true
}'
To try this out on your own free-tier HugginFace space, you'll need the Dockerfile and entrypoint.sh file shown below. Just deploy a new free-tier space using the blank Docker option, then upload these two files, after some time, you'll have your own instance of Gemma4 running on HuggingFace.
If you can afford 8ドル per month and want more performance, I recommend you use Cloud Run, you can follow this article on how to deploy any LLM you want from Ollama to Cloud Run.
I hope this is useful to at least someone.
Good luck and enjoy!
Dockerfile
FROM python:3.10-slim
# 1. Install only what we need for the model download
RUN apt-get update && apt-get install -y \
curl wget \
&& rm -rf /var/lib/apt/lists/*
# 2. THE BULLETPROOF FIX: Copy the binary directly from the official Ollama image
COPY --from=ollama/ollama:latest /usr/bin/ollama /usr/bin/ollama
WORKDIR /app
# 3. Download the Gemma 4 GGUF
RUN --mount=type=secret,id=HF_TOKEN,mode=0444 \
wget --header="Authorization: Bearer $(cat /run/secrets/HF_TOKEN)" \
https://huggingface.co/bartowski/google_gemma-4-E2B-it-GGUF/resolve/main/google_gemma-4-E2B-it-Q4_K_M.gguf \
-O model.gguf
# 4. Create the Modelfile with CPU Performance Parameters
RUN printf "FROM ./model.gguf\nPARAMETER num_ctx 2048\nPARAMETER num_thread 2\nPARAMETER num_batch 256\nPARAMETER num_keep 500" > Modelfile
# 5. Configure for Hugging Face Spaces (Port 7860)
EXPOSE 7860
ENV OLLAMA_HOST=0.0.0.0:7860
ENV OLLAMA_KEEP_ALIVE=-1
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh
#!/bin/bash
# Start Ollama server in the background
ollama serve &
# Wait until the server is responsive on port 7860
until curl -s localhost:7860 > /dev/null; do
echo "Waiting for Ollama server..."
sleep 2
done
# Create the model using the local GGUF
ollama create gemma4 -f Modelfile
# Keep the script running
wait