Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Roadmap Apr 2023 #784

Pinned
Apr 5, 2023 · 7 comments · 9 replies
Discussion options

High-prio

  • Project llama : add LoRA support

    Add capabilities for low-rank adaptation of LLaMA models and derivatives

  • Project ggml : improve integer quantization

    Make the inference of quantized models faster and more accurate

  • Project ggml : improve threading implementation

    Better utilization of the available CPU resources via improved thread management

  • Start implementing inference of other models and extend ggml operators

    For now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM)

  • Add llama_state to allow parallel text generation sessions with a single model

    Should be done in a similar way it is done in whisper.cpp

Low-prio

You must be logged in to vote

Replies: 7 comments 9 replies

Comment options

My understanding based on the linked discussion is that there will be some "partial" models available, where only some of the tensors are provided. How do we get such models? Are they out yet?

FYI: Outside of LoRa models, Vicuna is also distributed by their creators as a diff between llama-13b and their fine tuned weights: https://huggingface.co/lmsys/vicuna-13b-delta-v0

See README as well:

Vicuna Weights
We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. Instructions:
Get the original LLaMA weights in the huggingface format by following the instructions here.
Use the following scripts to get Vicuna weights by applying our delta. It will automatically download delta weights from our Hugging Face account.
NOTE: Our released weights are only compatible with the latest main branch of huggingface/transformers. We install the correct version of transformers when fastchat is installed.

You must be logged in to vote
0 replies
Comment options

GG->GOAT.

You must be logged in to vote
0 replies
Comment options

For now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM)

I read your tweet yesterday @ggerganov. And it suddenly cross my mind, whether image segmentation is also part of your roadmap. Turns out its your high-prio now. Im wondering, since the first day you publish whisper.cpp, that it would definitly be amazing to combine ggml with image inference model. I have a project that's not specifically ML, also i dont have much of experience on ML, but image segmentation is the core functionality. If you decide to implement SAM, i think it will definitely useful for my little project. How can i keep in touch with the update? Thanks

You must be logged in to vote
3 replies
Comment options

ggerganov Apr 7, 2023
Maintainer Author

I am very interested in implementing SAM inference with ggml. Had a quick look at the code and I think it is definitely possible and wouldn't be too much work. Curious if 4-bit quantization of the Encoder would work. I just need to find some time to focus and when I have it running, will post an update on Twitter. Hopefully in the following days

Comment options

ggerganov Apr 10, 2023
Maintainer Author

This week I will try to implement SAM inference using ggml. Yesterday, started working on it here: ggml-org/ggml#74

There are a couple of things not clear yet:

  • what is the difference between global and non-global attention in the Encoder
  • how to implement the positional encoding

I hope I will be able to figure these out and have an efficient inference available soon. I think that using Accelerate BLAS for the Encoder will provide significant performance boost, similar to what we had in whisper.cpp.

Anyway, in the meantime, I will not be paying too much attention to llama.cpp in order to focus on SAM.
Hope that the collaborators will help out with the new PRs

Edit: SAM will have to wait until we improve the quantization. It seems much more important at the moment

Comment options

i'm not sure if this is something that you find interesting, but another piece of the ecosystem that i'd personally love to see is "bark.cpp" (https://github.com/suno-ai/bark) -- inference on CPU is very slow with the distributed weights and inference code.

Comment options

#370 (comment) to allow parallel text generation sessions with a single model

Does that means that context can be serialized to the disk (e.g. after initial prompt evaluation), and deserialized later on?

If no, could you please point me which structures are responsible for storing hidden state, so I can implement initial prompt caching. It will be useful for parameter tuning.

You must be logged in to vote
1 reply
Comment options

Comment options

Please turn the llama.cpp into a HTTP service, as it is hard to interface with raw power without going thru python binding which is very very slow.

You must be logged in to vote
3 replies
Comment options

Yeah I think the option to run a web api would make implementation into other things much easier, similar to how it is with automatic1111's webui.

Comment options

I have a simple version. check here: https://github.com/howard0su/llama.cpp/blob/web/examples/web/worker.cpp

I implemented the web interface in order to use Vicuna web frontend. If there are more interests, I can polish it.

I didn't finish it as I would like to have llama_state first so that the web server can support multi users.

Comment options

You can also check out https://github.com/go-skynet/llama-cli , supports multi-models and it has an OpenAI compatible API.

Comment options

Hi, I found that llama_state was edited out of the roadmap.

Is it not in the plan anymore? I am interested because I am trying building an API service.

Thanks.

You must be logged in to vote
0 replies
Comment options

I'm confused about the relationship of langchain with LLM, seems to me, that langchain is just talking to the model, and would be an easy fit into llama.cpp The agent's part could be done at a later date, I'm more interested in prepping the model from a document before doing the lookup. (If I'm understanding how it works correctly).

You must be logged in to vote
2 replies
Comment options

@iplayfast
i do not know if this is of any use for you, but we subclassed the LLM Class from Langchain to be able to use this within a langchain pipeline. (see https://python.langchain.com/en/latest/modules/models/llms/examples/custom_llm.html for reference)

This is a rudimentary code and it has a lot of room for improvement but it currently works for us.
basically you need a compiled version of llama.cpp and this gets called from python via a subprocess routine. The Model and Executable can be passed from a .env file

import subprocess
from langchain.llms.base import LLM
from dotenv import dotenv_values
config = dotenv_values(".env")
class LLamaLLM(LLM):
 llamaExecutablePath = config['LLAMA_EXECUTABLE_PATH']
 modelPath = config['MODEL_PATH']
 threads = '8'
 temp = '0.2'
 topK = '10000'
 topP = '0.95'
 repeatLastN = '64'
 repeatPenalty = '1.3'
 promptSize = '4096'
 nPredict = '-1'
 @property
 def _llm_type(self) -> str:
 return "custom"
 def _call(self, prompt: str, stop) -> str:
 print(prompt)
 args = [
 self.llamaExecutablePath,
 '-m',
 self.modelPath,
 '-t',
 self.threads,
 '--temp',
 self.temp,
 '--top_k',
 self.topK,
 '--top_p',
 self.topP,
 '--repeat_last_n',
 self.repeatLastN,
 '--repeat_penalty',
 self.repeatPenalty,
 '--ctx_size',
 self.promptSize,
 '--n_predict',
 self.nPredict,
 '-p',
 prompt
 ]
 process = subprocess.Popen(args, 
 stdout=subprocess.PIPE,
 universal_newlines=True)
 result = ''
 while True:
 return_code = process.poll()
 if return_code is not None:
 print(return_code)
 # Process has finished, read rest of the output 
 for output in process.stdout.readlines():
 result += output
 break
 return result
Comment options

Thanks, just noticed this now. I'll try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

AltStyle によって変換されたページ (->オリジナル) /