-
Couldn't load subscription status.
- Fork 13.4k
Roadmap Apr 2023 #784
-
High-prio
-
Project llama : add LoRA support
Add capabilities for low-rank adaptation of LLaMA models and derivatives
-
Project ggml : improve integer quantization
Make the inference of quantized models faster and more accurate
-
Project ggml : improve threading implementation
Better utilization of the available CPU resources via improved thread management
-
Start implementing inference of other models and extend
ggmloperatorsFor now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like
llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM) -
Add
llama_stateto allow parallel text generation sessions with a single modelShould be done in a similar way it is done in whisper.cpp
Low-prio
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 22 -
😄 1 -
❤️ 25 -
🚀 12 -
👀 1
Replies: 7 comments 9 replies
-
My understanding based on the linked discussion is that there will be some "partial" models available, where only some of the tensors are provided. How do we get such models? Are they out yet?
FYI: Outside of LoRa models, Vicuna is also distributed by their creators as a diff between llama-13b and their fine tuned weights: https://huggingface.co/lmsys/vicuna-13b-delta-v0
See README as well:
Vicuna Weights
We release Vicuna weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the Vicuna weights. Instructions:
Get the original LLaMA weights in the huggingface format by following the instructions here.
Use the following scripts to get Vicuna weights by applying our delta. It will automatically download delta weights from our Hugging Face account.
NOTE: Our released weights are only compatible with the latest main branch of huggingface/transformers. We install the correct version of transformers when fastchat is installed.
Beta Was this translation helpful? Give feedback.
All reactions
-
GG->GOAT.
Beta Was this translation helpful? Give feedback.
All reactions
-
For now, I think it is best to implement basic inference examples in the ggml repo, similar to GPT-2, GPT-J, Cerebras-GPT. There is no need for dedicated repos like
llama.cpp, unless a new very cool model appears (edit: I think it just appeared SAM)
I read your tweet yesterday @ggerganov. And it suddenly cross my mind, whether image segmentation is also part of your roadmap. Turns out its your high-prio now. Im wondering, since the first day you publish whisper.cpp, that it would definitly be amazing to combine ggml with image inference model. I have a project that's not specifically ML, also i dont have much of experience on ML, but image segmentation is the core functionality. If you decide to implement SAM, i think it will definitely useful for my little project. How can i keep in touch with the update? Thanks
Beta Was this translation helpful? Give feedback.
All reactions
-
I am very interested in implementing SAM inference with ggml. Had a quick look at the code and I think it is definitely possible and wouldn't be too much work. Curious if 4-bit quantization of the Encoder would work. I just need to find some time to focus and when I have it running, will post an update on Twitter. Hopefully in the following days
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 11
-
This week I will try to implement SAM inference using ggml. Yesterday, started working on it here: ggml-org/ggml#74
There are a couple of things not clear yet:
- what is the difference between global and non-global attention in the Encoder
- how to implement the positional encoding
I hope I will be able to figure these out and have an efficient inference available soon. I think that using Accelerate BLAS for the Encoder will provide significant performance boost, similar to what we had in whisper.cpp.
Anyway, in the meantime, I will not be paying too much attention to llama.cpp in order to focus on SAM.
Hope that the collaborators will help out with the new PRs
Edit: SAM will have to wait until we improve the quantization. It seems much more important at the moment
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1 -
❤️ 10
-
i'm not sure if this is something that you find interesting, but another piece of the ecosystem that i'd personally love to see is "bark.cpp" (https://github.com/suno-ai/bark) -- inference on CPU is very slow with the distributed weights and inference code.
Beta Was this translation helpful? Give feedback.
All reactions
-
#370 (comment) to allow parallel text generation sessions with a single model
Does that means that context can be serialized to the disk (e.g. after initial prompt evaluation), and deserialized later on?
If no, could you please point me which structures are responsible for storing hidden state, so I can implement initial prompt caching. It will be useful for parameter tuning.
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh wait!
https://github.com/ggerganov/llama.cpp/pull/685/files
It's actually kv_cache!
Beta Was this translation helpful? Give feedback.
All reactions
-
❤️ 2
-
Please turn the llama.cpp into a HTTP service, as it is hard to interface with raw power without going thru python binding which is very very slow.
Beta Was this translation helpful? Give feedback.
All reactions
-
Yeah I think the option to run a web api would make implementation into other things much easier, similar to how it is with automatic1111's webui.
Beta Was this translation helpful? Give feedback.
All reactions
-
I have a simple version. check here: https://github.com/howard0su/llama.cpp/blob/web/examples/web/worker.cpp
I implemented the web interface in order to use Vicuna web frontend. If there are more interests, I can polish it.
I didn't finish it as I would like to have llama_state first so that the web server can support multi users.
Beta Was this translation helpful? Give feedback.
All reactions
-
👀 5
-
You can also check out https://github.com/go-skynet/llama-cli , supports multi-models and it has an OpenAI compatible API.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Hi, I found that llama_state was edited out of the roadmap.
Is it not in the plan anymore? I am interested because I am trying building an API service.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm confused about the relationship of langchain with LLM, seems to me, that langchain is just talking to the model, and would be an easy fit into llama.cpp The agent's part could be done at a later date, I'm more interested in prepping the model from a document before doing the lookup. (If I'm understanding how it works correctly).
Beta Was this translation helpful? Give feedback.
All reactions
-
@iplayfast
i do not know if this is of any use for you, but we subclassed the LLM Class from Langchain to be able to use this within a langchain pipeline. (see https://python.langchain.com/en/latest/modules/models/llms/examples/custom_llm.html for reference)
This is a rudimentary code and it has a lot of room for improvement but it currently works for us.
basically you need a compiled version of llama.cpp and this gets called from python via a subprocess routine. The Model and Executable can be passed from a .env file
import subprocess from langchain.llms.base import LLM from dotenv import dotenv_values config = dotenv_values(".env") class LLamaLLM(LLM): llamaExecutablePath = config['LLAMA_EXECUTABLE_PATH'] modelPath = config['MODEL_PATH'] threads = '8' temp = '0.2' topK = '10000' topP = '0.95' repeatLastN = '64' repeatPenalty = '1.3' promptSize = '4096' nPredict = '-1' @property def _llm_type(self) -> str: return "custom" def _call(self, prompt: str, stop) -> str: print(prompt) args = [ self.llamaExecutablePath, '-m', self.modelPath, '-t', self.threads, '--temp', self.temp, '--top_k', self.topK, '--top_p', self.topP, '--repeat_last_n', self.repeatLastN, '--repeat_penalty', self.repeatPenalty, '--ctx_size', self.promptSize, '--n_predict', self.nPredict, '-p', prompt ] process = subprocess.Popen(args, stdout=subprocess.PIPE, universal_newlines=True) result = '' while True: return_code = process.poll() if return_code is not None: print(return_code) # Process has finished, read rest of the output for output in process.stdout.readlines(): result += output break return result
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks, just noticed this now. I'll try it out.
Beta Was this translation helpful? Give feedback.