Why Run AI Locally?
As a DevOps Engineer in Nepal, I've experimented with every AI coding assistant out there. Claude, Copilot, Cursor — they're great, but they all share one problem: your code leaves your machine. For teams handling sensitive infrastructure, that's a non-starter.
Enter Ollama — a free, open-source tool that runs AI models entirely on your laptop. No API keys, no subscriptions, no data leaving your device. And with Google's Gemma 4:e2b model, you get a surprisingly capable coding assistant for free.
In this guide, I'll walk you through setting up Ollama + Gemma 4:e2b and integrating it with your workflow — all at zero cost.
What You'll Need
| Requirement | Details |
|---|---|
| RAM | 8 GB minimum, 16 GB recommended |
| Storage | ~4 GB for the Gemma 4:e2b model |
| OS | Linux, macOS, or Windows (WSL2) |
| Internet | Only for the initial download |
Most modern laptops in Nepal — even budget ones — can handle this. I run it on my development machine alongside Docker containers without issues.
Step 1: Install Ollama
Ollama makes installation dead simple. Open your terminal and run:
# Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
# Windows (WSL2)
# Install WSL2 first, then run the Linux command above
Verify the installation:
ollama --version
You should see the Ollama version. The service runs in the background automatically.
Step 2: Download the Gemma 4:e2b Model
Gemma 4 is Google's latest open-weight model. The e2b variant is optimized for coding tasks — perfect for a local development assistant.
ollama pull gemma4:e2b
This downloads roughly 2–4 GB. On Nepal's average fiber connection (50–100 Mbps), it takes about 5–10 minutes.
Verify it's ready:
ollama list
You should see gemma4:e2b in the list.
Step 3: Chat with Your Local AI
Test the model right away:
ollama run gemma4:e2b "Write a Dockerfile for a Node.js Express app"
You'll get a response generated entirely on your machine. No cloud. No API calls. No cost.
Interactive Chat Mode
For an ongoing conversation:
ollama run gemma4:e2b
Then just type your questions. Exit with /bye.
Step 4: Use Ollama as an API Server
Ollama runs a local API server at http://localhost:11434. This means any tool that supports OpenAI-compatible APIs can use your local model.
# Test the API
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e2b",
"prompt": "Explain what Kubernetes does in 2 sentences",
"stream": false
}'
Integrating with VS Code / IDE
Several VS Code extensions support Ollama out of the box:
- Install Continue or Twinny from the VS Code marketplace
- Set the provider to Ollama
- Set the model to
gemma4:e2b - Start coding with AI assistance
These extensions give you:
- Code completion
- Inline explanations
- Refactoring suggestions
- Test generation
All running locally, all free.
Step 5: Optimize for Your Hardware
If Gemma 4:e2b feels slow, here are some tuning tips:
Limit GPU/CPU Usage
# Set number of layers to offload to GPU (if available)
OLLAMA_NUM_GPU_LAYERS=20 ollama serve
# Limit CPU threads
OLLAMA_NUM_PARALLEL=2 ollama serve
Use a Smaller Model
If gemma4:e2b is too heavy, try lighter alternatives:
ollama pull gemma4:1b # 1 billion parameters — runs on anything
ollama pull qwen2.5:3b # Great for coding, moderate RAM usage
Why This Matters for Developers in Nepal
Let me be honest — most AI coding tools are subscription-based, and paying $20/month for GitHub Copilot or $200/month for Claude Pro adds up fast when you're earning in NPR.
Running Ollama locally gives you:
- Zero ongoing costs — download once, use forever
- Privacy — your code never leaves your machine
- Offline access — works without internet (great for Nepal's occasional connectivity issues)
- No rate limits — ask as many questions as you want
- Full control — swap models, tweak parameters, fine-tune if needed
For DevOps Engineers, system administrators, and developers in Nepal, this is a game-changer. You get AI-assisted coding without the monthly bill.
Limitations — Be Realistic
Gemma 4:e2b is impressive but not magical:
- It won't match Claude Sonnet or GPT-4 on complex reasoning
- It can hallucinate on obscure topics
- Context window is smaller than cloud models
But for everyday coding — writing scripts, explaining code, generating boilerplate, debugging — it's genuinely useful. And it's free.
What's Next?
Once you're comfortable with Ollama and Gemma 4, explore:
- Multi-model setups — Run different models for different tasks
- Custom system prompts — Tailor responses for your workflow
- CI/CD integration — Use Ollama in pipelines for automated code review
- Docker deployment — Containerize Ollama for team access
The local AI ecosystem is evolving fast. Getting started now puts you ahead of the curve.
Building AI-powered infrastructure or need help setting up local tooling for your team in Nepal? Let's talk.