Getting Started with Self-Hosting Checkpoints

Here is how you can download the checkpoints from the studio after you have fine-tuned your model. Once you have downloaded the checkpoints, you can unzip them and use the following inference engines to load the checkpoints and use them for inference.


Inference Engines

You can use the following inference engines to load the checkpoints and use them for inference.

Hugging Face

You can use the Hugging Face library to load the checkpoints and use them for inference using the transformers library.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch


model_path = 'path/to/your/finetuned/model/checkpoint'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")

SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""

USER_PROMPT = """Title: Lemon Drizzle Cake

Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]

Generic ingredients:"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

# format and tokenize the tool use prompt
inputs = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, return_dict=True, return_tensors="pt")

inputs.to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000, use_cache=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

VLLM

You can use the VLLM library to load the checkpoints and use them for inference using the VLLM library.

from vllm import LLM, SamplingParams

model_path = 'path/to/your/finetuned/model/checkpoint'
llm = LLM(model=model_path, tokenizer=model_path)

SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""

USER_PROMPT = """Title: Lemon Drizzle Cake

Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]

Generic ingredients:"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]

outputs = llm.chat(
    messages=conversation,
    sampling_params=SamplingParams(temperature=0, max_tokens=256),
)
print(outputs[0].outputs[0].text)

Ollama

You can use the Ollama library to load the checkpoints and use them for inference using the Ollama library.

To create a model, you first need to create a Modelfile:

FROM path/to/finetuned/model/checkpoint

Then run the following command to create the model.

ollama create my-model -f Modelfile

Then you can use the model for inference:

from ollama import Client

client = Client(host='http://localhost:11434')

SYSTEM_PROMPT = """You are a helpful recipe assistant. You are to extract the generic ingredients from each of the recipes provided."""

USER_PROMPT = """Title: Lemon Drizzle Cake

Ingredients: ["200g unsalted butter", "200g caster sugar", "4 eggs", "200g self-raising flour", "1 tsp baking powder", "zest of 1 lemon", "100ml lemon juice", "150g icing sugar"]

Generic ingredients:"""

conversation = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]
res = client.chat(model='my-model', messages=conversation)
print(res.message.content)