Welcome to the Prem AI cookbook section. In this recipe, we are going to implement a custom Retrieval Augmented Generation (RAG) pipeline that can answer questions and search through ML-related arXiv paper. We are going to use PremAI, Qdrant and DSPy for this recipe.

For those who need to become more familiar with Qdrant, it is an excellent open-source vector database and similarity search engine. You can also host Qdrant locally.

If you are not familiar with DSPy, check out our introductory recipe on using DSPy. We have covered many introductory concepts there. You can also check out DSPy documentation for more information.

To give a nice visualization, we use Streamlit, and here is how the final app would look:

So without further ado, let’s get started. You can find the full source code here.

Objective

This recipe aims to show developers and users how to get started with Prem’s Generative AI Platform and build different use cases around it. We will build a simple RAG pipeline using the abovementioned tools to search through relevant ML-related papers in arXiv and answer user questions correctly by citing those answers. So high level, here are the steps:

  1. Download a sample dataset from HuggingFace for our experiment. We will use ML-ArXiv-Papers, which contains a vast subset of Machine Learning papers. This dataset includes the title of the paper and the abstract.

  2. Once downloaded, we do some preprocessing (which includes converting the data into proper formats and converting the dataset into smaller batches)

  3. We get the embeddings using Prem Embeddings and initialize a Qdrant Collection to store those embeddings and their corresponding data.

  4. After this, we connect the Qdrant collection with DSPy and build a simple RAG Module.

  5. Finally, we test this with some sample questions.

Sounds interesting, right? Let’s start by installing and importing all the essential packages.

Setting up the project

Let’s start by creating a virtual environment and installing dependencies.

python3 -m venv .venv
source .venv/bin/activate

Up next, we need to install some dependencies. You can check out all the dependencies in the requirements.txt file. To install the Qdrant engine, you need to have docker installed. You can build and run Qdrant’s official docker image using the following command:

docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

Where:

  • REST API will run in: localhost:6333
  • Web UI will run in: localhost:6333/dashboard
  • GRPC API will run in: localhost:6334

Once all the dependencies are installed, we import the following packages

Python
import os
from tqdm.auto import tqdm
from typing import List, Union
from datasets import load_dataset

All the qdrant related imports

Python
from qdrant_client import models
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from qdrant_client.models import PointStruct

All DSPY-PremAI and DSPy-Qdrant related imports

Python
import dspy
from dspy import PremAI
from dspy.retrieve.qdrant_rm import QdrantRM
from dsp.modules.sentence_vectorizer import PremAIVectorizer

We define some constants, which include PremAI project ID, the embedding model we are going to use, the name of the huggingface dataset, the name of the Qdrant collection (which can be any arbitrarily named name), and the Qdrant server URL in which we are going to access the DB.

Python
PROJECT_ID           = 1234
EMBEDDING_MODEL_NAME = "mistral-embed"
COLLECTION_NAME      = "arxiv-ml-papers-collection"
QDRANT_SERVER_URL    = "http://localhost:6333"
DATASET_NAME         = "CShorten/ML-ArXiv-Papers"

The project id we used is a dummy ID, make sure you have an account at Prem AI Platform and a valid project id and an API Key. Additionally, you also need to have at least one repository-id as a last requirement.

Loading dataset from HF and preprocessing it

In our very first step, we need to download the dataset. The dataset comprises a title and an abstract column that covers the title and abstract of the paper. We are going to fetch those columns. We are also going to take a smaller subset (let’s say 1000 rows) just for the sake of this tutorial and convert it into a dictionary in the following format:

JSON
[
    {"title":"title-of-paper", "abstract":"abstract-of-paper"}
]

After this, we are going to write a simple function that uses Prem Vectorizer from DSPy to convert a text or list of texts to its embedding. Prem Vectorizer internally uses Prem SDK to extract embeddings from text and is compatible with the DSPy ecosystem.

Python
dataset = load_dataset(DATASET_NAME)["train"]
dataset

Output

>>> Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'title', 'abstract'],
    num_rows: 117592
})

As we can see that inside the features, we have two columns named “Unnamed”, so we are going to remove them first and also take a subset of the rows (in our case we take 1000 rows). Finally, we convert this into a dict.

Python
dataset_dict = (
    dataset.select(range(1000))
           .select_columns(["title", "abstract"])
           .to_dict()
)

Right now this dict is not in the list of dictionary format, shown above. It is in this format:

{
    "title": ["title-paper-1", "title-paper-2", "..."],
    "abstract": ["abstract-paper-1", "abstract-paper-2", "..."]
}

So, we need to convert this to the format we want, so that it becomes easier for us to get the embeddings and insert to Qdrant DB.

Python
dataset = [
    {
        "title":title, "abstract":desc
    } for title, desc in zip(dataset_dict["title"], dataset_dict["abstract"]) 
]

import json
print(json.dumps(dataset[0], indent=4))

Creating embeddings of the dataset

We write a simple function to get embedding from the text. It is super simple; we initialize the premai vectorizer and then use it to get the embedding. By default, the premai vectorizer returns a numpy.ndarray, which we convert into a list (a list of the list), which becomes easier for us to upload to Qdrant.

Python
# we assume your have PREMAI_API_KEY in the environment variable. 

premai_vectorizer = PremAIVectorizer(
    project_id=PROJECT_ID, model_name=EMBEDDING_MODEL_NAME
)

def get_embeddings(
    premai_vectorizer: PremAIVectorizer, 
    documents: Union[str, List[str]]

):
    """Gets embedding from using Prem Embeddings"""
    documents = [documents] if isinstance(documents, str) else documents
    embeddings = premai_vectorizer(documents)
    return embeddings.tolist()

Uploading mini-batches of embeddings to DSPy

Qdrant sometimes gives an requests timed out error when the number of embeddings to upload is huge. So, to prevent this issue, we are going to do the following:

  1. Create mini-batches of the dataset

  2. Get the embeddings for all the abstracts in that mini-batch

  3. Iterate over the docs and their corresponding embeddings, and we create Qdrant Points. In short, a Qdrant Point acts like a central entity, mostly a vector, and Qdrant can do all sorts of operations on it.

  4. Finally, upload the point to our Qdrant collection. A collection is a structure in Qdrant where we keep a set of points (vectors) among which we can do operations like search.

But before doing all the steps mentioned above, we need to initialize the qdrant client and make a collection. Since we use mistral-embed, the embedding size is 1024. This can vary when using different embedding models.

Python
qdrant_client = QdrantClient(url=QDRANT_SERVER_URL)
embedding_size = 1024 

qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=models.VectorParams(
        size=embedding_size,
        distance=models.Distance.COSINE,
    ),
)

# make a simple function to create mini batches

def make_mini_batches(lst, batch_size):
    return [lst[i:i + batch_size] for i in range(0, len(lst), batch_size)]

# Function to iterate over batches, get embeddings and upload

batch_size=8
document_batches = make_mini_batches(dataset, batch_size=batch_size)
start_idx = 0


for batch in tqdm(document_batches, total=len(document_batches)):    
    points = []
    docs_to_pass = [b["abstract"] for b in batch]
    embeddings = get_embeddings(premai_vectorizer, documents=docs_to_pass)
    for idx, (document, embedding) in enumerate(zip(batch, embeddings)):
        points.append(
            models.PointStruct(id=idx+start_idx, vector=embedding, payload=document)
        )
    qdrant_client.upload_points(collection_name=COLLECTION_NAME, points=points)
    start_idx += batch_size
print("All Uploaded")

Congratulations if you have made it this far. In the later part of this tutorial, we will use this collection with DSPy and PremAI LLMs to create a simple RAG module. If you are unfamiliar with DSPy, check out our introductory tutorial on DSPy.

Building our RAG pipeline using DSPy, PremAI and Qdrant

We are going to start by initializing our DSPy-PremAI object as our LLM and using DSPy-Qdrant as our retriever. This retriever does all the heavy lifting of doing a nearest neighbour search for us and returns the top-k matched documents, which we will pass as our context to our LLM to answer our question.

Python
PROJECT_ID = 1234
EMBEDDING_MODEL = "mistral-embed"
COLLECTION_NAME = "arxiv-ml-papers-collection"
QDRANT_SERVER_URL = "http://localhost:6333"

model = PremAI(project_id=PROJECT_ID)
qdrant_client = QdrantClient(url=QDRANT_SERVER_URL)
qdrant_retriever_model = QdrantRM(
    COLLECTION_NAME, qdrant_client, k=3,
    vectorizer=PremAIVectorizer(project_id=PROJECT_ID, model_name=EMBEDDING_MODEL),
    document_field="abstract"
)

model = PremAI(project_id=PROJECT_ID, **{"temperature":0.1, "max_tokens":1000})
dspy.settings.configure(lm=model, rm=qdrant_retriever_model)

Now before moving forward, let’s do a quick sanity check on if our retriever is successfully retrieving relevant results or not.

Python
retrieve = dspy.Retrieve(k=3)
question = "Principal Component Analysis"
topK_passages = retrieve(question).passages

print(f"Top {retrieve.k} passages for question: {question} \n", "\n")
print(topK_passages)

Seems like we are getting some good relevant answers. Now let’s jump right in to make our simple RAG pipeline using DSPy.

Define a DSPy Signature and the RAG Module

The very first building block of our RAG pipeline is to build a DSPy Signature. In short, a signature explains the input and output fields without making you write big and messy prompts. You can also think of this as a prompt blueprint. Once you have created this blueprint, DSPy internally tries to optimize the prompt during optimization (we will come to that later).

In our case, we should have the following parameters:

  1. context: This will be an InputField which will contain all the retrieved passages.
  2. question: This will be another InputField which will contain user query
  3. answer: This will be the OutputField which contains the answer generated by the LLM.
Python
class GenerateAnswer(dspy.Signature):
    """Think and Answer questions based on the context provided."""
    context = dspy.InputField(desc="May contain relevant facts about user query")
    question = dspy.InputField(desc="User query")
    answer = dspy.OutputField(desc="Answer in one or two lines")
    answer = dspy.OutputField(desc="Answer in one or two lines")

After this, we will define the overall RAG pipeline inside a single class, also called Modules in DSPy. Generally, Modules in DSPy represent:

  1. Ways of running some prompting techniques like Chain of Thought or ReAct. We are going to use ReAct for our case.
  2. Building a workflow, which constitutes multiple composible steps.
  3. You can even attach / chain multiple modules to form a single module. This gives us the power of better modularity and helps us implement cleaner when defining LLM orchestration pipelines.

Now, let’s implement our RAG module.

Python
class RAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve()
        self.generate_answer = dspy.ReAct(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages        
        prediction = self.generate_answer(context=context, question=question)        
        return dspy.Prediction(context=context, answer=prediction.answer)

As you can see in the above code, we first define our retriever and then bind our signature with the ChainOfThought Module, which will take this blueprint to generate a better prompt but containing the same input and output fields mentioned while we define our base signature. In the forward step (i.e., when we call the RAG module object), we will first retrieve all the contexts from the retriever and then use this context to generate the answer from our signature. After this, we will return the predictions in a good format containing the context and the answer so that we can see what abstracts were retrieved.

Testing our DSPy pipeline with an example prompt

We are almost there, now as of our final step, let’s test our pipeline with a sample example.

Python
query = "What are some latest research done on manifolds and graphs"
rag_pipeline = RAG()
prediction = rag_pipeline(query)

print("LLM's answer:")
print(prediction.answer)
print("----------------")

print("Contexts retrieved and inserted to LLM:")
print(prediction.context)

You can even return more metadata like paper title, paper link (which would be not passed as context) but for references to the user so that they can get some relevant results.

Congratulations, now you know how to make a basic RAG pipeline using PremAI, DSPy and Qdrant.

Creating the streamlit web app Optional

In this section we are going to show you how to create a simple streamlit app as shown above. You can find the full code here.

Although we are not doing full explaination of this code, since we are using a boiler plate code which was used in Chat With PDF, Chat with SQL Tables. So you can refer those recipes to see an extended explaination of the streamlit boilerplate for doing chat.

We initially start with writning a code to to get the overall pipeline. Here it is how that looks like:

If you see in this above code, we have initialized two retrievers, where one is set with DSPy settings which will do the actual retrieval and put it inside the LLM’s context. However the second retriever is responsible to retrieve the titles of the paper (for the same contexts) so that we can show it as the returned sources. This means we need to do a slight change in our DSPy module.

Ok, we are now all set to write our streamlit function to do the chat with the documents inside the collection. However we first write one small functions to list out all the available collections.

Python
def get_all_collections(client: QdrantClient):
    return [collection.name for collection in client.get_collections().collections] 

Now we build our streamlit side bar to select the collection from the available Qdrant Collections.

Congratulations if you have completed till here. Check out our other tutorials and also our blog for more such amazing usecases and contents.