Exposed AI servers: how LLMjacking happens (and how to stop it)

TL;DR

Exposed inference servers are the new cryptojacking target: attackers steal GPU time (“LLMjacking”) and sometimes the host.
Most exposures are self-inflicted: binding to 0.0.0.0, port-forwarding from WSL/Docker, and assuming VLANs are a security boundary.
Treat your inference server like a database: no public bind, authentication at the edge, network allowlists, patching, and logging.
Agentic deployments (MCP/tools/“exec”) raise the stakes: exposure can turn into full remote code execution.
A minimal safe pattern is: inference on 127.0.0.1, a hardened reverse proxy with auth, and access only over VPN/ZTNA.

Self-hosting LLMs used to be a hobbyist thing: run a local model, test a prompt, shut it down. In 2026 it’s different. Teams are hosting inference engines (Ollama, vLLM, SGLang, OpenAI-compatible APIs) on VMs and GPU nodes because it’s cheaper, faster, and feels “more private” than an external API.

The problem is that inference servers are still servers - and when they get exposed, they’re exposed like any other unauthenticated backend.

If you’ve ever done something like this:

export OLLAMA_HOST=0.0.0.0

…to make a local UI or container “just work”, you’ve touched the core failure mode. It’s not that 0.0.0.0 is evil. It’s that it’s easy to do accidentally, and hard to notice later - especially once it’s running on a cloud VM with a public IP.

When these incidents show up, the story is usually boring:

“We exposed it temporarily for testing.”
“We put it on a VLAN so it was fine.”
“We added TLS, so it was secure.”

It’s boring right up until your GPU node is pegged at 100% with prompts you didn’t send.

This post breaks down:

how self-hosted LLM endpoints end up on the public internet,
what attackers do when they find them,
and the hardening steps that actually change your risk profile.

If you’re building agentic systems, pair this with our post on prompt injection and the introduction to Silo.

What “exposed AI server” means in practice

An exposed AI server is simply an inference endpoint that accepts requests from an untrusted network.

Common examples:

an Ollama API listening on port 11434 on a public IP
a vLLM OpenAI-compatible API on port 8000 reachable from the internet
an internal MCP server reachable from any laptop on the corporate network

Sometimes the impact is “just” stolen compute. Sometimes it’s worse: model theft, credential exposure, and full host compromise.

Self-hosted LLM security: why exposure keeps happening

Most inference engines start safe-ish by default (loopback only, no external bind). Exposure usually happens when someone tries to make it “work across machines” and reaches for the quickest fix.

Here are the patterns we see most often:

Exposure pattern	Why it happens	Why it’s dangerous
Binding to `0.0.0.0`	Someone wants Docker containers or coworkers to reach the model	You’ve made the endpoint globally reachable if the host has a public interface
Port forwarding (WSL/Docker/SSH)	Local dev workflows need bridging	You accidentally bypass host firewall assumptions
“It’s on a VLAN”	Teams confuse segmentation with authentication	Any compromised device on that segment can query the model
“It’s behind a reverse proxy” (without auth)	Proxy added for TLS or convenience	TLS without identity still allows anyone to use it

If your model server has no auth (many don’t, by design), you have to add the security boundary yourself.

Ollama security and vLLM security: safer deployment patterns

If you only take one thing from this post, take this: make the safe thing the default, so nobody has to remember to be careful.

Here are practical patterns that work across most environments.

Docker: bind the port to loopback

If you must publish a port, publish it to 127.0.0.1, not the world:

# Only accessible from the host machine
docker run \
	-p 127.0.0.1:11434:11434 \
	ollama/ollama

This single detail prevents “I opened the port” turning into “the internet can reach it.”

systemd: be explicit about the bind address

If you’re configuring Ollama with environment variables, prefer an explicit loopback bind:

export OLLAMA_HOST=127.0.0.1:11434

If you intentionally bind to 0.0.0.0, do it with your eyes open and pair it with firewall rules + authenticated ingress.

Kubernetes: avoid public Services by default

In Kubernetes, the most common accidental exposure is using type: LoadBalancer or a public NodePort for convenience.

Safer defaults:

Service is ClusterIP
access goes through an Ingress that enforces identity (or via internal network only)

If your threat model includes “any workstation on the corporate network might be compromised,” treat the cluster network as untrusted too.

How to tell if you’re exposed (a 5-minute check)

You don’t need an audit program to get a fast signal. You just need to answer two questions:

Is anything listening on a non-loopback interface?
Can something untrusted reach it?

Step 1: find listening ports

On macOS (dev laptops):

lsof -nP -iTCP -sTCP:LISTEN | egrep ':(11434|8000|8080)\b' || true

On Linux (most VMs):

ss -ltnp | egrep ':(11434|8000|8080)\b' || true

If you see 0.0.0.0:11434 or *:8000, that’s a red flag. If you only see 127.0.0.1:11434, that’s usually what you want.

Step 2: confirm external reachability

From another machine on the same network (or a controlled external host), test the port explicitly:

curl -sS -m 2 http://YOUR_HOSTNAME_OR_IP:11434/ | head

If you get a response, assume scanners will too.

Step 3: check your cloud firewall / security group

Even if the service is listening broadly, a strict firewall can still save you. In AWS terms, you want inbound rules that look like:

allow 11434/8000 only from VPN/ZTNA CIDRs
deny from 0.0.0.0/0

If you’re not sure, treat “open to the world” as the default until proven otherwise.

What attackers do once they find an exposed endpoint

Think of exposed inference servers as high-value, low-friction infrastructure. Attackers don’t need to phish anyone or exploit a browser. They just send HTTP requests.

In the real world, that tends to fall into three buckets:

1) Steal compute (LLMjacking)

The simplest attack is also the most common kind of abuse: attackers route their own prompts through your GPUs.

Why this works as a business model:

inference is expensive and easy to monetize
stolen endpoints can be “resold” as cheap API access
the victim pays the bill (cloud GPU costs, power, or degraded performance)

2) Escalate from “free inference” to “owned host”

Once a service is exposed, vulnerabilities in the inference engine become remotely reachable.

For example, several Ollama issues have been publicly tracked in the NVD, including:

CVE-2024-37032: path traversal via digest validation issues (“Probllama”)
CVE-2024-39721: resource exhaustion/DoS via blocking file reads
CVE-2024-39722: path traversal leading to file existence disclosure

And supply-chain patterns show up too. One example: CVE-2024-50050 covers remote code execution risk from using pickle over sockets (changed to JSON).

The exact exploit chain varies by project and version. The takeaway is consistent: exposure turns implementation bugs into remote compromise.

3) Pivot with “agents” (MCP/tools/exec)

An LLM server that only generates text is already risky when exposed. An agentic server that can run tools is worse.

If a model can:

read local files
call internal APIs
execute shell commands
access secrets from environment variables

…then exposing the endpoint is not “someone got free chat completions.” It’s closer to exposing a remote admin API.

OWASP frames this clearly: prompt injection + excessive agency is where incidents happen, because untrusted text can steer actions. If you haven’t read it yet, OWASP’s LLM Top 10 is worth having in every AI project doc.

The minimal safe architecture for self-hosted LLMs

You don’t need a perfect setup on day one. You need a setup that eliminates the catastrophic failure modes.

Here’s the baseline pattern that holds up:

Bind inference to loopback (127.0.0.1) or a private-only interface.
Put a hardened reverse proxy in front for TLS + authentication.
Restrict network access (VPN/ZTNA, IP allowlists, security groups).
Run as least privilege (dedicated user, minimal filesystem access).
Log and monitor (requests, auth failures, unusual prompt volume).

If you want a simple rule of thumb: your inference port should not be reachable from the open internet, ever. If you need multi-user access, put it behind identity and network control.

Reverse proxy notes (don’t break streaming)

LLM responses often stream token-by-token using SSE or chunked responses. Reverse proxies can accidentally buffer those responses and “hang” the UI.

Here’s an Nginx skeleton that keeps streaming working. (You still need to add authentication -OAuth2/mTLS/basic auth -based on your environment.)

location / {
	proxy_pass http://127.0.0.1:11434;
	proxy_http_version 1.1;
	proxy_set_header Connection "";
	proxy_buffering off;
	proxy_read_timeout 600s;
}

If you can’t add strong identity easily, don’t publish the endpoint at all. Put it behind a VPN tunnel and treat it as internal-only.

If you think you’ve been LLMjacked (what to do next)

If you discover an exposed endpoint, don’t stop at “close the port.” Assume someone may have used it.

Here’s a practical response sequence:

Contain: block inbound access at the firewall/security group first (fastest lever).
Snapshot evidence: pull logs (proxy logs, service logs, cloud flow logs) before you redeploy.
Rotate exposed secrets: treat any secrets on that host as potentially compromised.
Patch and redeploy: update the inference engine and dependencies, then re-launch on a clean image.
Look for secondary access: check for unusual SSH keys, cron jobs, new users, or outbound connections.

The uncomfortable truth: when an unauthenticated service is exposed, you rarely get to know “it was only used for chat.” Plan your response accordingly.

A practical hardening checklist

If you want one “publisher-approved” checklist to hand to an engineering team, it’s this:

Exposure: confirm no inference ports are reachable from the public internet.
Identity: require authentication outside the model (proxy/gateway), not “secret prompt instructions.”
Network: allowlist callers (VPN/ZTNA + security group rules). Block the world by default.
Patching: track versions and patch cadence (treat inference engines like any other internet-exposed service).
Agent safety: if tools exist, add a policy gate and least-privilege tool permissions.
Secrets: avoid mounting broad credential directories; don’t run agents with developer workstation access.
Observability: log request volume, origins, error spikes, and long-running sessions.

If you’re shipping this into a team, add one more operational item: scan your own IP space. You can do it with internal tooling or external monitoring, but the important part is having an alert for “we accidentally exposed 11434 again.”

Key takeaways

“Self-hosted” does not mean “private” if the bind address is public.
Segmentation helps, but it’s not authentication.
If the server has no auth, you must supply the auth boundary.
Agentic systems convert “bad prompts” into real-world actions. That’s why you must constrain tools and privileges.

FAQ

Is binding to `0.0.0.0` always wrong?

Not always -but it’s almost always the moment you should slow down. If you bind broadly, you need compensating controls: strict firewall rules, authenticated ingress, and strong logging.

“It’s behind TLS” - isn’t that enough?

TLS protects the connection. It does not tell you who is connecting. If anyone can reach the endpoint, anyone can use it.

If I put it behind a VPN, am I safe?

You’re safer. But you still need least privilege and patching, because internal compromise is a real scenario. VPN/ZTNA is a great first layer, not a reason to stop.

Do I need to worry about prompt injection if I’m only self-hosting?

Yes -especially if your deployment is agentic (tools/MCP). Prompt injection is about untrusted instructions in context, not about where the model runs. See Prompt injection is the new SQL injection.