server : add VSCode's Github Copilot Chat support #12896

ggerganov · 2025-04-11T14:17:24Z

Overview

VSCode recently added support to use local models with Github Copilot Chat:

https://code.visualstudio.com/updates/v1_99#_bring-your-own-key-byok-preview

This PR adds compatibility of llama-server with this feature.

Usage

Start a llama-server on port 11434 with an instruct model of your choice. For example, using Qwen 2.5 Coder Instruct 3B:

# downloads ~3GB of data

llama-server \
    -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF \
    --port 11434 -fa -ngl 99 -c 0

In VSCode -> Chat -> Manage models -> select "Ollama" (not sure why it is called like this):
Select the available model from the list and click "OK":
Enjoy local AI assistance using vanilla llama.cpp:
Advanced context reuse for faster prompt reprocessing can be enabled by adding --cache-reuse 256 to the llama-server command

Speculative decoding is also supported. Simply start the llama-server like this for example:

llama-server \
    -m  ./models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md ./models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf \
    --port 11434 -fa -ngl 99 -ngld 99 -c 0 --cache-reuse 256

examples/server/server.cpp

ExtReMLapin · 2025-04-12T02:07:20Z

select "Ollama" (not sure why it is called like this):

Sounds like someone just got Edison'd 🤡

ericcurtin · 2025-04-16T20:57:41Z

There's a lot of tools like this, that work, but don't explicitly say llama.cpp, open-webui is another one (ramalama serve is just vanilla llama-server, but we try and make it easier to use, easier to pull accelerator runtimes and models):

https://github.com./open-webui/docs/pull/455/files

In RamaLama we are going to create a proxy that forks llama-server processes to mimic Ollama to make it even easier to use everyday llama-server.

With most tools if you select generic OpenAI endpoint, llama-server works.

* server : add VSCode's Github Copilot Chat support * cont : update handler name

kabakaev · 2025-04-25T22:23:49Z

@ggerganov, it seems, GET /api/tags API is missing.

At least, my vscode-insiders with github.copilot version 1.308.1532 (updated 2025-04-25, 18:46:22) requests /api/tags and gets HTTP/404 response.

server : add VSCode's Github Copilot Chat support

b1a6c8b

ggerganov requested a review from ngxson as a code owner April 11, 2025 14:17

github-actions bot added examples server labels Apr 11, 2025

ngxson reviewed Apr 11, 2025

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

ngxson approved these changes Apr 11, 2025

View reviewed changes

cont : update handler name

359cf64

ggerganov merged commit c94085d into master Apr 11, 2025
50 checks passed

ggerganov deleted the gg/vscode-integration branch April 11, 2025 20:37

ggerganov mentioned this pull request Apr 20, 2025

Misc. bug: The KV cache is sometimes truncated incorrectly when making v1/chat/completions API calls #11970

Open

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025

server : add VSCode's Github Copilot Chat support (ggml-org#12896)

9378269

* server : add VSCode's Github Copilot Chat support * cont : update handler name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : add VSCode's Github Copilot Chat support #12896

server : add VSCode's Github Copilot Chat support #12896

ggerganov commented Apr 11, 2025 •

edited

Loading

ExtReMLapin commented Apr 12, 2025

ericcurtin commented Apr 16, 2025 •

edited

Loading

kabakaev commented Apr 25, 2025

server : add VSCode's Github Copilot Chat support #12896

server : add VSCode's Github Copilot Chat support #12896

Conversation

ggerganov commented Apr 11, 2025 • edited Loading

Overview

Usage

ExtReMLapin commented Apr 12, 2025

ericcurtin commented Apr 16, 2025 • edited Loading

kabakaev commented Apr 25, 2025

ggerganov commented Apr 11, 2025 •

edited

Loading

ericcurtin commented Apr 16, 2025 •

edited

Loading