Paper search and QnA on ArXiv papers
Welcome to the Prem AI cookbook section. In this recipe, we are going to implement a custom Retrieval Augmented Generation (RAG) pipeline that can answer questions and search through ML-related arXiv paper. We are going to use PremAI, Qdrant and DSPy for this recipe.
For those who need to become more familiar with Qdrant, it is an excellent open-source vector database and similarity search engine. You can also host Qdrant locally.
If you are not familiar with DSPy, check out our introductory recipe on using DSPy. We have covered many introductory concepts there. You can also check out DSPy documentation for more information.
To give a nice visualization, we use Streamlit, and here is how the final app would look:
So without further ado, let’s get started. You can find the full source code here.
Objective
This recipe aims to show developers and users how to get started with Prem’s Generative AI Platform and build different use cases around it. We will build a simple RAG pipeline using the abovementioned tools to search through relevant ML-related papers in arXiv and answer user questions correctly by citing those answers. So high level, here are the steps:
-
Download a sample dataset from HuggingFace for our experiment. We will use ML-ArXiv-Papers, which contains a vast subset of Machine Learning papers. This dataset includes the title of the paper and the abstract.
-
Once downloaded, we do some preprocessing (which includes converting the data into proper formats and converting the dataset into smaller batches)
-
We get the embeddings using Prem Embeddings and initialize a Qdrant Collection to store those embeddings and their corresponding data.
-
After this, we connect the Qdrant collection with DSPy and build a simple RAG Module.
-
Finally, we test this with some sample questions.
Sounds interesting, right? Let’s start by installing and importing all the essential packages.
Setting up the project
Let’s start by creating a virtual environment and installing dependencies.
Up next, we need to install some dependencies. You can check out all the dependencies in the requirements.txt file. To install the Qdrant engine, you need to have docker installed. You can build and run Qdrant’s official docker image using the following command:
Where:
- REST API will run in:
localhost:6333
- Web UI will run in:
localhost:6333/dashboard
- GRPC API will run in:
localhost:6334
Once all the dependencies are installed, we import the following packages
All the qdrant related imports
All DSPY-PremAI and DSPy-Qdrant related imports
We define some constants, which include PremAI project ID, the embedding model we are going to use, the name of the huggingface dataset, the name of the Qdrant collection (which can be any arbitrarily named name), and the Qdrant server URL in which we are going to access the DB.
Prem AI offers a variety of models (which includes SOTA LLMs and Embedding models. See the list here), so you can experiment with all the models.
The project id we used is a dummy ID, make sure you have an account at Prem AI Platform and a valid project id and an API Key. Additionally, you also need to have at least one repository-id as a last requirement.
Loading dataset from HF and preprocessing it
In our very first step, we need to download the dataset. The dataset comprises a title
and an abstract
column that covers the title and abstract of the paper. We are going to fetch those columns. We are also going to take a smaller subset (let’s say 1000 rows) just for the sake of this tutorial and convert it into a dictionary in the following format:
After this, we are going to write a simple function that uses Prem Vectorizer from DSPy to convert a text or list of texts to its embedding. Prem Vectorizer internally uses Prem SDK to extract embeddings from text and is compatible with the DSPy ecosystem.
Output
As we can see that inside the features, we have two columns named “Unnamed”, so we are going to remove them first and also take a subset of the rows (in our case we take 1000 rows). Finally, we convert this into a dict.
Right now this dict is not in the list of dictionary format, shown above. It is in this format:
So, we need to convert this to the format we want, so that it becomes easier for us to get the embeddings and insert to Qdrant DB.
Creating embeddings of the dataset
We write a simple function to get embedding from the text. It is super simple; we initialize the premai vectorizer and then use it to get the embedding. By default, the premai vectorizer returns a numpy.ndarray
, which we convert into a list (a list of the list), which becomes easier for us to upload to Qdrant.
Uploading mini-batches of embeddings to DSPy
Qdrant sometimes gives an requests timed out error when the number of embeddings to upload is huge. So, to prevent this issue, we are going to do the following:
-
Create mini-batches of the dataset
-
Get the embeddings for all the abstracts in that mini-batch
-
Iterate over the docs and their corresponding embeddings, and we create Qdrant Points. In short, a Qdrant Point acts like a central entity, mostly a vector, and Qdrant can do all sorts of operations on it.
-
Finally, upload the point to our Qdrant collection. A collection is a structure in Qdrant where we keep a set of points (vectors) among which we can do operations like search.
But before doing all the steps mentioned above, we need to initialize the qdrant client and make a collection. Since we use mistral-embed
, the embedding size is 1024
. This can vary when using different embedding models.
Congratulations if you have made it this far. In the later part of this tutorial, we will use this collection with DSPy and PremAI LLMs to create a simple RAG module. If you are unfamiliar with DSPy, check out our introductory tutorial on DSPy.
Building our RAG pipeline using DSPy, PremAI and Qdrant
We are going to start by initializing our DSPy-PremAI object as our LLM and using DSPy-Qdrant as our retriever. This retriever does all the heavy lifting of doing a nearest neighbour search for us and returns the top-k matched documents, which we will pass as our context to our LLM to answer our question.
Now before moving forward, let’s do a quick sanity check on if our retriever is successfully retrieving relevant results or not.
Seems like we are getting some good relevant answers. Now let’s jump right in to make our simple RAG pipeline using DSPy.
Define a DSPy Signature and the RAG Module
The very first building block of our RAG pipeline is to build a DSPy Signature. In short, a signature explains the input and output fields without making you write big and messy prompts. You can also think of this as a prompt blueprint. Once you have created this blueprint, DSPy internally tries to optimize the prompt during optimization (we will come to that later).
In our case, we should have the following parameters:
context
: This will be anInputField
which will contain all the retrieved passages.question
: This will be anotherInputField
which will contain user queryanswer
: This will be theOutputField
which contains the answer generated by the LLM.
After this, we will define the overall RAG pipeline inside a single class, also called Modules in DSPy. Generally, Modules in DSPy represent:
- Ways of running some prompting techniques like Chain of Thought or ReAct. We are going to use ReAct for our case.
- Building a workflow, which constitutes multiple composible steps.
- You can even attach / chain multiple modules to form a single module. This gives us the power of better modularity and helps us implement cleaner when defining LLM orchestration pipelines.
Now, let’s implement our RAG module.
As you can see in the above code, we first define our retriever and then bind our signature with the ChainOfThought
Module, which will take this blueprint to generate a better prompt but containing the same input and output fields mentioned while we define our base signature.
In the forward step (i.e., when we call the RAG module object), we will first retrieve all the contexts from the retriever and then use this context to generate the answer from our signature. After this, we will return the predictions in a good format containing the context and the answer so that we can see what abstracts were retrieved.
Testing our DSPy pipeline with an example prompt
We are almost there, now as of our final step, let’s test our pipeline with a sample example.
You can even return more metadata like paper title, paper link (which would be not passed as context) but for references to the user so that they can get some relevant results.
Congratulations, now you know how to make a basic RAG pipeline using PremAI, DSPy and Qdrant.
Creating the streamlit web app Optional
In this section we are going to show you how to create a simple streamlit app as shown above. You can find the full code here.
Although we are not doing full explaination of this code, since we are using a boiler plate code which was used in Chat With PDF, Chat with SQL Tables. So you can refer those recipes to see an extended explaination of the streamlit boilerplate for doing chat.
We initially start with writning a code to to get the overall pipeline. Here it is how that looks like:
If you see in this above code, we have initialized two retrievers, where one is set with DSPy settings which will do the actual retrieval and put it inside the LLM’s context. However the second retriever is responsible to retrieve the titles of the paper (for the same contexts) so that we can show it as the returned sources. This means we need to do a slight change in our DSPy module.
Ok, we are now all set to write our streamlit function to do the chat with the documents inside the collection. However we first write one small functions to list out all the available collections.
Now we build our streamlit side bar to select the collection from the available Qdrant Collections.
Congratulations if you have completed till here. Check out our other tutorials and also our blog for more such amazing usecases and contents.
Was this page helpful?