Skip to content

server : add VSCode's Github Copilot Chat support #12896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 11, 2025

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 11, 2025

Overview

VSCode recently added support to use local models with Github Copilot Chat:

https://code.visualstudio.com/updates/v1_99#_bring-your-own-key-byok-preview

This PR adds compatibility of llama-server with this feature.

Usage

  • Start a llama-server on port 11434 with an instruct model of your choice. For example, using Qwen 2.5 Coder Instruct 3B:

    # downloads ~3GB of data
    
    llama-server \
        -hf ggml-org/Qwen2.5-Coder-3B-Instruct-Q8_0-GGUF \
        --port 11434 -fa -ngl 99 -c 0
  • In VSCode -> Chat -> Manage models -> select "Ollama" (not sure why it is called like this):

    image

  • Select the available model from the list and click "OK":

    image

  • Enjoy local AI assistance using vanilla llama.cpp:

    image

  • Advanced context reuse for faster prompt reprocessing can be enabled by adding --cache-reuse 256 to the llama-server command

  • Speculative decoding is also supported. Simply start the llama-server like this for example:

    llama-server \
        -m  ./models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
        -md ./models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf \
        --port 11434 -fa -ngl 99 -ngld 99 -c 0 --cache-reuse 256

@ggerganov ggerganov merged commit c94085d into master Apr 11, 2025
50 checks passed
@ggerganov ggerganov deleted the gg/vscode-integration branch April 11, 2025 20:37
@ExtReMLapin
Copy link
Contributor

select "Ollama" (not sure why it is called like this):

Sounds like someone just got Edison'd 🤡

@ericcurtin
Copy link
Collaborator

ericcurtin commented Apr 16, 2025

There's a lot of tools like this, that work, but don't explicitly say llama.cpp, open-webui is another one (ramalama serve is just vanilla llama-server, but we try and make it easier to use, easier to pull accelerator runtimes and models):

https://github.com./open-webui/docs/pull/455/files

In RamaLama we are going to create a proxy that forks llama-server processes to mimic Ollama to make it even easier to use everyday llama-server.

With most tools if you select generic OpenAI endpoint, llama-server works.

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* server : add VSCode's Github Copilot Chat support

* cont : update handler name
@kabakaev
Copy link

@ggerganov, it seems, GET /api/tags API is missing.

At least, my vscode-insiders with github.copilot version 1.308.1532 (updated 2025-04-25, 18:46:22) requests /api/tags and gets HTTP/404 response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants