LLM AI Server with llama.cpp

Name: LLM AI Server with llama.cpp
Availability: InStock
Author: Mick Lab - Mitsuo Kuroda

Mick Lab - Mitsuo Kuroda

Everyone

1K+

Downloads

Everyone

Learn more

About this app

1. What This App Can Do
This app is a fully local LLM AI Server for Android, enabling private, offline text and multimodal generation.
It supports Hugging Face search, GGUF downloads, local model loading, and now MTP decoding for faster inference on supported models.
User friendly UI and 9 languages are supported.

Compatible families include Gemma‑4 / Gemma‑4 Vision, Qwen / Qwen Vision, Mistral, LLaMA, Phi, Bonsai etc.

A built‑in Ollama‑compatible and OpenAI‑compatible API server provides /api/chat, /api/generate, /v1/chat/completions, /v1/models, and new embedding endpoints (/api/embed, /api/embeddings, /v1/embeddings).
Models can be accessed by apps on the device or by other devices on the same network.

The WebUI supports model switching, parameter editing, log viewing, and PWA installation for home‑screen launch.
For multimodal models, compatible mmproj/projector files are automatically detected.

MCP and Function Calling enable tool‑augmented workflows and structured responses.
Structured output is supported via GBNF and JSON Schema.

2. Intended Users and Supported Devices
Designed for users who want a private, fully local LLM environment:

- Offline‑first users
- Developers integrating a local backend
- Advanced users needing fine‑grained sampling control
- Researchers testing inference behavior
- Privacy‑focused users

The app runs on a wide range of Android devices.
Performance can be tuned via context size, threads, batch size, GPU offload, and optional MTP speculative decoding.

3. Key Features
- Hugging Face model search & download
- Local GGUF loading
- Fully offline LLM server
- Gemma‑4 Vision & Qwen Vision projector detection
- MTP speculative decoding (optional, model‑dependent)
- Detailed parameter control (Mirostat, DRY, XTC, Min‑p, Typical‑p, penalties, dynamic temperature)
- Integrated Ollama / OpenAI‑compatible API server
└ Accessible by local apps and other devices on the same network
- Embedding API
- Structured output via GBNF / JSON Schema
- Automatic template selection
- Streaming / non‑streaming output
- Timestamped logs
- Enhanced WebUI with multi‑language support (9 languages)
- PWA support
- MCP and Function Calling
- Improved GPU inference performance

4. Getting Started
1. Open Settings.
2. Search Hugging Face for a GGUF model or import a local file.
3. Adjust parameters, including optional MTP settings.
4. Tap Save Config, then SAVE & CLOSE to load the model.

5. Settings Highlights
- Save / load / delete configurations
- Hugging Face search, URL input, or local import
- Parameter control (context, temperature, penalties, Mirostat, DRY, XTC)
- MTP settings (draft count, head selection)
- Streaming toggle
- Auto/custom templates
- API/WebUI port settings
- MCP and Function Calling
- Log level selection
- Language switching (9 languages)
- Manual & privacy policy access

6. Prompt Templates and Stop Sequences
Templates are auto‑selected from GGUF metadata or filename.
Supported families include Gemma, Qwen, Mistral, LLaMA, Phi, Bonsai, etc.

Gemma‑4 may repeat short phrases; stronger penalties or explicit anti‑repetition instructions help.

7. API Server Capabilities
Available endpoints:

- /api/chat
- /api/generate
- /api/tags
- /v1/chat/completions
- /v1/models
- /api/embed
- /api/embeddings
- /v1/embeddings
- /props, /slots
- WebUI at http://:/

One generation runs at a time; up to 10 requests queue.
Android 13+ may require notification permission.

8. How This App Stands Out
- Direct Hugging Face search & download
- Multi‑language support
- Web UI
- Reliable GGUF loading
- Multimodal Gemma‑4 Vision & Qwen Vision support
- Detailed parameter & MTP control
- Integrated Ollama / OpenAI‑compatible API server
- Embedding API
- Structured output via GBNF / JSON Schema
- PWA support
- Optimized GPU inference
- MCP and Function Calling

Updated on

Jul 25, 2026

Data safety

Safety starts with understanding how developers collect and share your data. Data privacy and security practices may vary based on your use, region, and age. The developer provided this information and may update it over time.