Getting Started with Self-Hosting Checkpoints
Here is how you can download the checkpoints from the studio after you have fine-tuned your model. Once you have
downloaded the checkpoints, you can unzip them and use the following inference engines to load the checkpoints and use them for inference.
Inference Engines
You can use the following inference engines to load the checkpoints and use them for inference.
Hugging Face
You can use the Hugging Face library to load the checkpoints and use them for inference using the transformers library.
Full Fine-Tuning
LoRA Fine-Tuning
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = 'path/to/your/finetuned/model/checkpoint'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype = torch.bfloat16, device_map = "auto" )
SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""
USER_PROMPT = """Title: Lemon Drizzle Cake
Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]
Generic ingredients:"""
conversation = [
{ "role" : "system" , "content" : SYSTEM_PROMPT },
{ "role" : "user" , "content" : USER_PROMPT },
]
# format and tokenize the tool use prompt
inputs = tokenizer.apply_chat_template(conversation, add_generation_prompt = True , return_dict = True , return_tensors = "pt" )
inputs.to(model.device)
outputs = model.generate( ** inputs, max_new_tokens = 1000 , use_cache = False )
print (tokenizer.decode(outputs[ 0 ][inputs.input_ids.shape[ 1 ]:], skip_special_tokens = True ))
VLLM
You can use the VLLM library to load the checkpoints and use them for inference using the VLLM library.
Full Fine-Tuning
LoRA Fine-Tuning
from vllm import LLM , SamplingParams
model_path = 'path/to/your/finetuned/model/checkpoint'
llm = LLM( model = model_path, tokenizer = model_path)
SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""
USER_PROMPT = """Title: Lemon Drizzle Cake
Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]
Generic ingredients:"""
conversation = [
{ "role" : "system" , "content" : SYSTEM_PROMPT },
{ "role" : "user" , "content" : USER_PROMPT },
]
outputs = llm.chat(
messages = conversation,
sampling_params = SamplingParams( temperature = 0 , max_tokens = 256 ),
)
print (outputs[ 0 ].outputs[ 0 ].text)
Ollama
You can use the Ollama library to load the checkpoints and use them for inference using the Ollama library.
To create a model, you first need to create a Modelfile:
Full Fine-Tuning
LoRA Fine-Tuning
FROM path/to/finetuned/model/checkpoint
Then run the following command to create the model.
Full Fine-Tuning
LoRA Fine-Tuning
ollama create my-model -f Modelfile
Then you can use the model for inference:
Full Fine-Tuning
LoRA Fine-Tuning
from ollama import Client
client = Client( host = 'http://localhost:11434' )
SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""
USER_PROMPT = """Title: Lemon Drizzle Cake
Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]
Generic ingredients:"""
conversation = [
{ "role" : "system" , "content" : SYSTEM_PROMPT },
{ "role" : "user" , "content" : USER_PROMPT },
]
res = client.chat( model = 'my-model' , messages = conversation)
print (res.message.content)