Roadmap Apr 2023 · ggml-org/llama.cpp · Discussion #784

ggerganov
Apr 5, 2023
Maintainer

High-prio

Project llama : add LoRA support

Add capabilities for low-rank adaptation of LLaMA models and derivatives
Project ggml : improve integer quantization

Make the inference of quantized models faster and more accurate
Project ggml : improve threading implementation

Better utilization of the available CPU resources via improved thread management
Start implementing inference of other models and extend ggml operators

For now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM)
Add llama_state to allow parallel text generation sessions with a single model

Should be done in a similar way it is done in whisper.cpp

Low-prio

Add 2-bit integer quantization

Replies: 7 comments 9 replies

linouxis9
Apr 5, 2023

My understanding based on the linked discussion is that there will be some "partial" models available, where only some of the tensors are provided. How do we get such models? Are they out yet?

FYI: Outside of LoRa models, Vicuna is also distributed by their creators as a diff between llama-13b and their fine tuned weights: https://huggingface.co/lmsys/vicuna-13b-delta-v0

See README as well:

Vicuna Weights
We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. Instructions:
Get the original LLaMA weights in the huggingface format by following the instructions here.
Use the following scripts to get Vicuna weights by applying our delta. It will automatically download delta weights from our Hugging Face account.
NOTE: Our released weights are only compatible with the latest main branch of huggingface/transformers. We install the correct version of transformers when fastchat is installed.

0 replies

xurwy
Apr 7, 2023

For now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM)

I read your tweet yesterday @ggerganov. And it suddenly cross my mind, whether image segmentation is also part of your roadmap. Turns out its your high-prio now. Im wondering, since the first day you publish whisper.cpp, that it would definitly be amazing to combine ggml with image inference model. I have a project that's not specifically ML, also i dont have much of experience on ML, but image segmentation is the core functionality. If you decide to implement SAM, i think it will definitely useful for my little project. How can i keep in touch with the update? Thanks

3 replies

@ggerganov

ggerganov Apr 7, 2023
Maintainer Author

I am very interested in implementing SAM inference with ggml. Had a quick look at the code and I think it is definitely possible and wouldn't be too much work. Curious if 4-bit quantization of the Encoder would work. I just need to find some time to focus and when I have it running, will post an update on Twitter. Hopefully in the following days

@ggerganov

ggerganov Apr 10, 2023
Maintainer Author

This week I will try to implement SAM inference using ggml. Yesterday, started working on it here: ggml-org/ggml#74

There are a couple of things not clear yet:

what is the difference between global and non-global attention in the Encoder
how to implement the positional encoding

I hope I will be able to figure these out and have an efficient inference available soon. I think that using Accelerate BLAS for the Encoder will provide significant performance boost, similar to what we had in whisper.cpp.

Anyway, in the meantime, I will not be paying too much attention to llama.cpp in order to focus on SAM.
Hope that the collaborators will help out with the new PRs

Edit: SAM will have to wait until we improve the quantization. It seems much more important at the moment

@khimaros

khimaros Apr 22, 2023

i'm not sure if this is something that you find interesting, but another piece of the ecosystem that i'd personally love to see is "bark.cpp" (https://github.com/suno-ai/bark) -- inference on CPU is very slow with the distributed weights and inference code.

ivanstepanovftw
Apr 7, 2023

#370 (comment) to allow parallel text generation sessions with a single model

Does that means that context can be serialized to the disk (e.g. after initial prompt evaluation), and deserialized later on?

If no, could you please point me which structures are responsible for storing hidden state, so I can implement initial prompt caching. It will be useful for parameter tuning.

1 reply

@ivanstepanovftw

ivanstepanovftw Apr 8, 2023

Oh wait!
https://github.com/ggerganov/llama.cpp/pull/685/files
It's actually kv_cache!

raymond-infinitecode
Apr 7, 2023

Please turn the llama.cpp into a HTTP service, as it is hard to interface with raw power without going thru python binding which is very very slow.

3 replies

@BetaDoggo

BetaDoggo Apr 8, 2023

Yeah I think the option to run a web api would make implementation into other things much easier, similar to how it is with automatic1111's webui.

@howard0su

howard0su Apr 8, 2023

I have a simple version. check here: https://github.com/howard0su/llama.cpp/blob/web/examples/web/worker.cpp

I implemented the web interface in order to use Vicuna web frontend. If there are more interests, I can polish it.

I didn't finish it as I would like to have llama_state first so that the web server can support multi users.

@mudler

mudler Apr 14, 2023

You can also check out https://github.com/go-skynet/llama-cli , supports multi-models and it has an OpenAI compatible API.

nanmu42
Apr 15, 2023

Hi, I found that llama_state was edited out of the roadmap.

Is it not in the plan anymore? I am interested because I am trying building an API service.

Thanks.

0 replies

iplayfast
Apr 24, 2023

I'm confused about the relationship of langchain with LLM, seems to me, that langchain is just talking to the model, and would be an easy fit into llama.cpp The agent's part could be done at a later date, I'm more interested in prepping the model from a document before doing the lookup. (If I'm understanding how it works correctly).

2 replies

@syndimann

syndimann Apr 26, 2023

@iplayfast
i do not know if this is of any use for you, but we subclassed the LLM Class from Langchain to be able to use this within a langchain pipeline. (see https://python.langchain.com/en/latest/modules/models/llms/examples/custom_llm.html for reference)

This is a rudimentary code and it has a lot of room for improvement but it currently works for us.
basically you need a compiled version of llama.cpp and this gets called from python via a subprocess routine. The Model and Executable can be passed from a .env file

import subprocess
from langchain.llms.base import LLM
from dotenv import dotenv_values
config = dotenv_values(".env")
class LLamaLLM(LLM):
 llamaExecutablePath = config['LLAMA_EXECUTABLE_PATH']
 modelPath = config['MODEL_PATH']
 threads = '8'
 temp = '0.2'
 topK = '10000'
 topP = '0.95'
 repeatLastN = '64'
 repeatPenalty = '1.3'
 promptSize = '4096'
 nPredict = '-1'
 @property
 def _llm_type(self) -> str:
 return "custom"
 def _call(self, prompt: str, stop) -> str:
 print(prompt)
 args = [
 self.llamaExecutablePath,
 '-m',
 self.modelPath,
 '-t',
 self.threads,
 '--temp',
 self.temp,
 '--top_k',
 self.topK,
 '--top_p',
 self.topP,
 '--repeat_last_n',
 self.repeatLastN,
 '--repeat_penalty',
 self.repeatPenalty,
 '--ctx_size',
 self.promptSize,
 '--n_predict',
 self.nPredict,
 '-p',
 prompt
 ]
 process = subprocess.Popen(args, 
 stdout=subprocess.PIPE,
 universal_newlines=True)
 result = ''
 while True:
 return_code = process.poll()
 if return_code is not None:
 print(return_code)
 # Process has finished, read rest of the output 
 for output in process.stdout.readlines():
 result += output
 break
 return result

@iplayfast

iplayfast Apr 30, 2023

Thanks, just noticed this now. I'll try it out.

Uh oh!

Roadmap Apr 2023 #784

Uh oh!

Uh oh!

ggerganov Apr 5, 2023 Maintainer

High-prio

Low-prio

Replies: 7 comments · 9 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Apr 7, 2023 Maintainer Author

Uh oh!

Uh oh!

ggerganov Apr 10, 2023 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov
Apr 5, 2023
Maintainer

Replies: 7 comments 9 replies

ggerganov Apr 7, 2023
Maintainer Author

ggerganov Apr 10, 2023
Maintainer Author