Welcome to the Prem AI cookbook section. In this recipe, we are going to implement a custom Retrieval Augmented Generation (RAG) pipeline that can answer questions and search through ML-related arXiv paper. We are going to use PremAI, Qdrant and DSPy for this recipe.
For those who need to become more familiar with Qdrant, it is an excellent open-source vector database and similarity search engine. You can also host Qdrant locally.
This recipe aims to show developers and users how to get started with Prem’s Generative AI Platform and build different use cases around it. We will build a simple RAG pipeline using the abovementioned tools to search through relevant ML-related papers in arXiv and answer user questions correctly by citing those answers. So high level, here are the steps:
Download a sample dataset from HuggingFace for our experiment. We will use ML-ArXiv-Papers, which contains a vast subset of Machine Learning papers. This dataset includes the title of the paper and the abstract.
Once downloaded, we do some preprocessing (which includes converting the data into proper formats and converting the dataset into smaller batches)
We get the embeddings using Prem Embeddings and initialize a Qdrant Collection to store those embeddings and their corresponding data.
After this, we connect the Qdrant collection with DSPy and build a simple RAG Module.
Finally, we test this with some sample questions.
Sounds interesting, right? Let’s start by installing and importing all the essential packages.
Let’s start by creating a virtual environment and installing dependencies.
python3 -m venv .venv
source .venv/bin/activate
Up next, we need to install some dependencies. You can check out all the dependencies in the requirements.txt file. To install the Qdrant engine, you need to have docker installed. You can build and run Qdrant’s official docker image using the following command:
docker run -p6333:6333 -p6334:6334 \-v$(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant
Where:
REST API will run in: localhost:6333
Web UI will run in: localhost:6333/dashboard
GRPC API will run in: localhost:6334
Once all the dependencies are installed, we import the following packages
Python
import os
from tqdm.auto import tqdm
from typing import List, Union
from datasets import load_dataset
All the qdrant related imports
Python
from qdrant_client import models
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
from qdrant_client.models import PointStruct
All DSPY-PremAI and DSPy-Qdrant related imports
Python
import dspy
from dspy import PremAI
from dspy.retrieve.qdrant_rm import QdrantRM
from dsp.modules.sentence_vectorizer import PremAIVectorizer
We define some constants, which include PremAI project ID, the embedding model we are going to use, the name of the huggingface dataset, the name of the Qdrant collection (which can be any arbitrarily named name), and the Qdrant server URL in which we are going to access the DB.
In our very first step, we need to download the dataset. The dataset comprises a title and an abstract column that covers the title and abstract of the paper. We are going to fetch those columns. We are also going to take a smaller subset (let’s say 1000 rows) just for the sake of this tutorial and convert it into a dictionary in the following format:
After this, we are going to write a simple function that uses Prem Vectorizer from DSPy to convert a text or list of texts to its embedding. Prem Vectorizer internally uses Prem SDK to extract embeddings from text and is compatible with the DSPy ecosystem.
As we can see that inside the features, we have two columns named “Unnamed”, so we are going to remove them first and also take a subset of the rows (in our case we take 1000 rows). Finally, we convert this into a dict.
{"title":"Learning from compressed observations","abstract":" The problem of statistical learning is to construct a predictor of a random\nvariable $Y$ as a function of a related random variable $X$ on the basis of an\ni.i.d. training sample from the joint distribution of $(X,Y)$. Allowable\npredictors are drawn from some specified class, and the goal is to approach\nasymptotically the performance (expected loss) of the best predictor in the\nclass. We consider the setting in which one has perfect observation of the\n$X$-part of the sample, while the $Y$-part has to be communicated at some\nfinite bit rate. The encoding of the $Y$-values is allowed to depend on the\n$X$-values. Under suitable regularity conditions on the admissible predictors,\nthe underlying family of probability distributions and the loss function, we\ngive an information-theoretic characterization of achievable predictor\nperformance in terms of conditional distortion-rate functions. The ideas are\nillustrated on the example of nonparametric regression in Gaussian noise.\n"}
We write a simple function to get embedding from the text. It is super simple; we initialize the premai vectorizer and then use it to get the embedding. By default, the premai vectorizer returns a numpy.ndarray, which we convert into a list (a list of the list), which becomes easier for us to upload to Qdrant.
Python
# we assume your have PREMAI_API_KEY in the environment variable.
premai_vectorizer = PremAIVectorizer(
project_id=PROJECT_ID, model_name=EMBEDDING_MODEL_NAME
)defget_embeddings(
premai_vectorizer: PremAIVectorizer,
documents: Union[str, List[str]]):"""Gets embedding from using Prem Embeddings"""
documents =[documents]ifisinstance(documents,str)else documents
embeddings = premai_vectorizer(documents)return embeddings.tolist()
Qdrant sometimes gives an requests timed out error when the number of embeddings to upload is huge. So, to prevent this issue, we are going to do the following:
Create mini-batches of the dataset
Get the embeddings for all the abstracts in that mini-batch
Iterate over the docs and their corresponding embeddings, and we create Qdrant Points. In short, a Qdrant Point acts like a central entity, mostly a vector, and Qdrant can do all sorts of operations on it.
Finally, upload the point to our Qdrant collection. A collection is a structure in Qdrant where we keep a set of points (vectors) among which we can do operations like search.
But before doing all the steps mentioned above, we need to initialize the qdrant client and make a collection. Since we use mistral-embed, the embedding size is 1024. This can vary when using different embedding models.
Python
qdrant_client = QdrantClient(url=QDRANT_SERVER_URL)
embedding_size =1024
qdrant_client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=models.VectorParams(
size=embedding_size,
distance=models.Distance.COSINE,),)# make a simple function to create mini batchesdefmake_mini_batches(lst, batch_size):return[lst[i:i + batch_size]for i inrange(0,len(lst), batch_size)]# Function to iterate over batches, get embeddings and upload
batch_size=8
document_batches = make_mini_batches(dataset, batch_size=batch_size)
start_idx =0for batch in tqdm(document_batches, total=len(document_batches)):
points =[]
docs_to_pass =[b["abstract"]for b in batch]
embeddings = get_embeddings(premai_vectorizer, documents=docs_to_pass)for idx,(document, embedding)inenumerate(zip(batch, embeddings)):
points.append(
models.PointStruct(id=idx+start_idx, vector=embedding, payload=document))
qdrant_client.upload_points(collection_name=COLLECTION_NAME, points=points)
start_idx += batch_size
print("All Uploaded")
Congratulations if you have made it this far. In the later part of this tutorial, we will use this collection with DSPy and PremAI LLMs to create a simple RAG module. If you are unfamiliar with DSPy, check out our introductory tutorial on DSPy.
Building our RAG pipeline using DSPy, PremAI and Qdrant
We are going to start by initializing our DSPy-PremAI object as our LLM and using DSPy-Qdrant as our retriever. This retriever does all the heavy lifting of doing a nearest neighbour search for us and returns the top-k matched documents, which we will pass as our context to our LLM to answer our question.
Top 3 passages for question: Principal Component Analysis
’[” In many physical, statistical, biological and other investigations it is\ndesirable to approximate a system of points by objects of lower dimension\nand/or complexity. For this purpose, Karl Pearson invented principal component\nanalysis in 1901 and found ‘lines and planes of closest fit to system of\npoints’. The famous k-means algorithm solves the approximation problem too, but\nby finite sets instead of lines and planes. This chapter gives a brief\npractical introduction into the methods of construction of general principal\nobjects, i.e. objects embedded in the ‘middle’ of the multidimensional data\nset. As a basis, the unifying framework of mean squared distance approximation\nof finite datasets is selected. Principal graphs and manifolds are constructed\nas generalisations of principal components and k-means principal points. For\nthis purpose, the family of expectation/maximisation algorithms with nearest\ngeneralisations is presented. Construction of principal graphs with controlled\ncomplexity is based on the graph grammar approach.\n”, ” In this paper, we study the application of sparse principal component\nanalysis (PCA) to clustering and feature selection problems. Sparse PCA seeks\nsparse factors, or linear combinations of the data variables, explaining a\nmaximum amount of variance in the data while having only a limited number of\nnonzero coefficients. PCA is often used as a simple clustering technique and\nsparse factors allow us here to interpret the clusters in terms of a reduced\nset of variables. We begin with a brief introduction and motivation on sparse\nPCA and detail our implementation of the algorithm in d’Aspremont et al.\n(2005). We then apply these results to some classic clustering and feature\nselection problems arising in biology.\n”, ’ We present three generalisations of Kernel Principal Components Analysis\n(KPCA) which incorporate knowledge of the class labels of a subset of the data\npoints. The first, MV-KPCA, penalises within class variances similar to Fisher\ndiscriminant analysis. The second, LSKPCA is a hybrid of least squares\nregression and kernel PCA. The final LR-KPCA is an iteratively reweighted\nversion of the previous which achieves a sigmoid loss function on the labeled\npoints. We provide a theoretical risk bound as well as illustrative experiments\non real and toy data sets.\n’]’
Seems like we are getting some good relevant answers. Now let’s jump right in to make our simple RAG pipeline using DSPy.
The very first building block of our RAG pipeline is to build a DSPy Signature. In short, a signature explains the input and output fields without making you write big and messy prompts. You can also think of this as a prompt blueprint. Once you have created this blueprint, DSPy internally tries to optimize the prompt during optimization (we will come to that later).
In our case, we should have the following parameters:
context: This will be an InputField which will contain all the retrieved passages.
question: This will be another InputField which will contain user query
answer: This will be the OutputField which contains the answer generated by the LLM.
Python
classGenerateAnswer(dspy.Signature):"""Think and Answer questions based on the context provided."""
context = dspy.InputField(desc="May contain relevant facts about user query")
question = dspy.InputField(desc="User query")
answer = dspy.OutputField(desc="Answer in one or two lines")
answer = dspy.OutputField(desc="Answer in one or two lines")
After this, we will define the overall RAG pipeline inside a single class, also called Modules in DSPy. Generally, Modules in DSPy represent:
Ways of running some prompting techniques like Chain of Thought or ReAct. We are going to use ReAct for our case.
Building a workflow, which constitutes multiple composible steps.
You can even attach / chain multiple modules to form a single module. This gives us the power of better modularity and helps us implement cleaner when defining LLM orchestration pipelines.
As you can see in the above code, we first define our retriever and then bind our signature with the ChainOfThought Module, which will take this blueprint to generate a better prompt but containing the same input and output fields mentioned while we define our base signature.
In the forward step (i.e., when we call the RAG module object), we will first retrieve all the contexts from the retriever and then use this context to generate the answer from our signature. After this, we will return the predictions in a good format containing the context and the answer so that we can see what abstracts were retrieved.
We are almost there, now as of our final step, let’s test our pipeline with a sample example.
Python
query ="What are some latest research done on manifolds and graphs"
rag_pipeline = RAG()
prediction = rag_pipeline(query)print("LLM's answer:")print(prediction.answer)print("----------------")print("Contexts retrieved and inserted to LLM:")print(prediction.context)
LLM’s answer:
The recent research on manifolds and graphs includes the analysis of the behavior of graph Laplacians at points near or on the boundary of a manifold, the minimax rate of convergence for estimating a manifold given a noisy sample, and the development of multi-manifold learning algorithms such as M-Isomap and D-C Isomap.
Contexts retrieved and inserted to LLM:
["In manifold learning, algorithms based on graph Laplacians constructed from\ndata have received considerable attention both in practical applications and\ntheoretical analysis. In particular, the convergence of graph Laplacians\nobtained from sampled data to certain continuous operators has become an active\nresearch topic recently. Most of the existing work has been done under the\nassumption that the data is sampled from a manifold without boundary or that\nthe functions of interests are evaluated at a point away from the boundary.\nHowever, the question of boundary behavior is of considerable practical and\ntheoretical interest. In this paper we provide an analysis of the behavior of\ngraph Laplacians at a point near or on the boundary, discuss their convergence\nrates and their implications and provide some numerical results. It turns out\nthat while points near the boundary occupy only a small part of the total\nvolume of a manifold, the behavior of graph Laplacian there has different\nscaling properties from its behavior elsewhere on the manifold, with global\neffects on the whole manifold, an observation with potentially important\nimplications for the general problem of learning on manifolds.\n","We find the minimax rate of convergence in Hausdorff distance for estimating\na manifold M of dimension d embedded in R^D given a noisy sample from the\nmanifold. We assume that the manifold satisfies a smoothness condition and that\nthe noise distribution has compact support. We show that the optimal rate of\nconvergence is n^{-2/(2+d)}. Thus, the minimax rate depends only on the\ndimension of the manifold, not on the dimension of the space in which M is\nembedded.\n","In many physical, statistical, biological and other investigations it is\ndesirable to approximate a system of points by objects of lower dimension\nand/or complexity. For this purpose, Karl Pearson invented principal component\nanalysis in 1901 and found 'lines and planes of closest fit to system of\npoints'. The famous k-means algorithm solves the approximation problem too, but\nby finite sets instead of lines and planes. This chapter gives a brief\npractical introduction into the methods of construction of general principal\nobjects, i.e. objects embedded in the 'middle' of the multidimensional data\nset. As a basis, the unifying framework of mean squared distance approximation\nof finite datasets is selected. Principal graphs and manifolds are constructed\nas generalisations of principal components and k-means principal points. For\nthis purpose, the family of expectation/maximisation algorithms with nearest\ngeneralisations is presented. Construction of principal graphs with controlled\ncomplexity is based on the graph grammar approach.\n"]
You can even return more metadata like paper title, paper link (which would be not passed as context) but for references to the user so that they can get some relevant results.
Congratulations, now you know how to make a basic RAG pipeline using PremAI, DSPy and Qdrant.
In this section we are going to show you how to create a simple streamlit app as shown above. You can find the full code here.
Although we are not doing full explaination of this code, since we are using a boiler plate code which was used in Chat With PDF, Chat with SQL Tables. So you can refer those recipes to see an extended explaination of the streamlit boilerplate for doing chat.
We initially start with writning a code to to get the overall pipeline. Here it is how that looks like:
If you see in this above code, we have initialized two retrievers, where one is set with DSPy settings which will do the actual retrieval and put it inside the LLM’s context. However the second retriever is responsible to retrieve the titles of the paper (for the same contexts) so that we can show it as the returned sources. This means we need to do a slight change in our DSPy module.
Ok, we are now all set to write our streamlit function to do the chat with the documents inside the collection. However we first write one small functions to list out all the available collections.
Python
defget_all_collections(client: QdrantClient):return[collection.name for collection in client.get_collections().collections]
Now we build our streamlit side bar to select the collection from the available Qdrant Collections.
Python
import streamlit as st
# ---- Sidebar ---- #with st.sidebar:
all_collections = get_all_collections(client=qdrant_client)
selected_collection = st.selectbox(label="Select your collection", options=all_collections)if selected_collection isNone:
st.error("No collections found")else:
st.success(f"You will be chatting with Table: {selected_collection}")# ---- Main UI ---- #if selected_collection isNone:
st.error("Please set up Qdrant Engine properly. No Collections found.")else:
pipeline = setup_retriever_and_llm(collection_name=selected_collection)
chat(pipeline=pipeline)defchat(pipeline):if"messages"notin st.session_state:
st.session_state.messages =[]for message in st.session_state.messages:with st.chat_message(message["role"]):
st.markdown(message["content"])if prompt := st.chat_input("Please write your query"):
user_content ={"role":"user","content": prompt}
st.session_state.messages.append(user_content)with st.chat_message("user"):
st.markdown(prompt)with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response =""whilenot full_response:with st.spinner("Thinking ...."):try:
response, titles = pipeline(prompt)
response_str = response.answer
response_contexts = response.context
response_meta =[{"title": title,"abstract": abstract}for title, abstract inzip(titles, response_contexts)]except Exception:
response_str ="Failed to respond"
response_meta =[]
fr =""
full_response =str(response_str)for i in full_response:
time.sleep(0.01)
fr += i
message_placeholder.write(fr +"▌")
message_placeholder.write(f"{full_response}")if response_meta isnotNoneandlen(response_meta)>0:for meta in response_meta:
title = meta["title"]
abstract = meta["abstract"]with st.expander(label=title):
st.write(abstract)else:
st.warning("No contexts found")
st.session_state.messages.append({"role":"assistant","content": full_response})
Congratulations if you have completed till here. Check out our other tutorials and also our blog for more such amazing usecases and contents.