How to develop a Generalized RAG Pipeline for Text, Images, and Structured Data [closed]

Question 1

I'm trying to find a general solution for RAG to solve problems involving both text, images, chart, tables,.., they are in many different formats such as .docx, .xlsx, .pdf.

The requirement for the answer:

Some answers are just images
Some answers only contain text and need to be absolutely accurate because it relates to a process,...
On the other hand, the answers may not need to be absolutely accurate but should still ensure logical consistency; this is something I am already working on

The features of the documents:

Some documents in DOCX and Excel formats contain only text; this is the simplest form. My task is to determine the embedding model and LLM, in addition to selecting hyperparameters such as chunk size, chunk overlap, etc., and experimenting to find the appropriate values
If the documents have more complex content, such as DOCX files containing text and images, or PDF files containing text, images, charts, tables, etc., I haven't found a general solution to handle them yet.

Below are some documents I have read but feel I don't fully understand, I'm not sure how it can help me.

I want to be able to outline a pipeline to answer questions according to the requirements of my system. Any help would be greatly appreciated!

System:

LLM was run locally (Llama 3.1 13N Instruct, Qwen2-7B-Instruct,...)

Question 2

do you want any RAG Methodology for the pipeline like RAG-Fusion ?

Question 3

@DerekRoberts, of course. I think it belongs to the later part after the Retrieval has been processed

Question 4

Here is a sample of the code you will need to implement a RAG-FUSION. You would have to structure your requirements with this code, this serves as a guide for json files, you can implement others such as pdf, images following the same procedure.

def determine_extension(file):
 if file.endswith(".jpg", ".png"):
 send_image_to_rag_classifier(file)
 elif ...
 else ...
""" Implement the RAG fusion using the langchain library"""
import asyncio
import json
import logging
import os
import pathlib as path
from operator import itemgetter
from typing import Any
from dotenv import find_dotenv, load_dotenv
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.vectorstores import VectorStoreRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveJsonSplitter
logger = logging.getLogger(__name__)
# read OPENAI_API_KEY from the environment
load_dotenv(find_dotenv())
# Define a prompt for the RAG model
SYSTEM_PROMPT = """ 
 Your prompt
 """
# recursive pass data in the retriever_data
def collect_data_files(filepath: path) -> list:
 """Walk through the file path and collect the files
 Args: filepath: The file path to be walked through
 Returns:
 list: List of files
 """
 
 return store_file
# Create a recursive json splitter to split the data into chunks
def retrieve_data(data) -> list[chroma.Document]:
 """
 Retrieve the data from the file
 Args: data: The data to be retrieved
 Returns: list: List of documents
 """
 docs = collect_data_files(data)
 for file in docs:
 with open(file, "r") as f:
 data = json.loads(f.read())
 # Split the data into chunks
 splitter = RecursiveJsonSplitter(max_chunk_size=300)
 # create documents from the vector database
 documents = splitter.create_documents(texts=data, convert_lists=True)
 return documents
# vectorstore database from chroma
def vectorstore_db(data) -> VectorStoreRetriever:
 """
 Create a vectorstore database from the data
 Args: data: The data to be indexed
 Returns: VectorStoreRetriever: The vectorstore retriever
 """
 return vector_retriever
# create a function to generate queries from the RAG model
def get_unique_union_of_documents(docs: list[list]) -> list[Any]:
 """
 Get the unique union of the documents
 Args:
 docs: The documents to be processed
 Returns:
 list: The unique union of the documents"""
 return [json.loads(doc) for doc in unique_union]
# RAG FUSION
class RAGFusion:
 """
 Implement the RAG fusion
 Args:
 data: The data to be used for the RAG fusion
 """
 def __init__(self, data) -> None:
 self.data = data
 def __call__(self, question: str) -> str:
 """
 Implement the RAG fusion
 Args:
 question: The question to be answered
 Returns:
 str: The answer to the question
 """
 try:
 # create a retrieval chain
 prompt_for_rag_fusion = ChatPromptTemplate.from_template(SYSTEM_PROMPT)
 generate_query = (
 prompt_for_rag_fusion
 | ChatOpenAI(temperature=0.5, max_tokens=4096)
 | StrOutputParser()
 | (lambda x: x.split("\n"))
 )
 vb = vectorstore_db(self.data)
 # create a retrieval chain
 retrieval_chain = generate_query | vb.map() | get_unique_union_of_documents
 chat_template = """
 Answer the following questions{question} \n
 Based on the data and context provided {context} \n
 Question: {question} \n
 """
 # get the chat prompt template
 prompt = ChatPromptTemplate.from_template(chat_template)
 # use this llm
 llm = ChatOpenAI(temperature=0.5, max_tokens=4096)
 # implement the final rag fusion
 final_rag_fusion = (
 {"context": retrieval_chain, "question": itemgetter("question")}
 | prompt
 | llm
 | StrOutputParser()
 )
 return final_rag_fusion.invoke({"question": question})
 except Exception as e:
 logger.error(f"An error occurred: {e}")

Question 5

hi, thanks for your answer. I'm sorry for replay later. In retrieve_data function, if the file contain image, chart, ... RecursiveJsonSplitter can handle them? In RAGFusion class, what is the different between chat_template and SYSTEM_PROMPT?

Question 6

@happy your question has been closed so until stackoverflow reopens it, I may not be able to help you. I can edit the question once , stackoverflow allows me that is if you don't mind, stackoverflow says they are too many edits atm. I will also advise you to make sure your questions are always concise.

Question 7

well you either create a new question and send the url here or wait for stackoverflow to allow me edit it?

Question 8

@happy of course, l have submitted it for a review. let us wait. the code i submitted in the answer section is a month of hardwork looking at LLMs architecture. I am trying to contribute my quarter to the community which has helped me in the past. the voting is just to create awareness for other users so they can quickly find it.

Question 9

Yes, it can be a pdf file or docx file, xlsx file.

D.lola 2,3222 gold badges9 silver badges22 bronze badges · Accepted Answer · 2024-08-08 10:15:32Z

Here is a sample of the code you will need to implement a RAG-FUSION. You would have to structure your requirements with this code, this serves as a guide for json files, you can implement others such as pdf, images following the same procedure.

def determine_extension(file):
 if file.endswith(".jpg", ".png"):
 send_image_to_rag_classifier(file)
 elif ...
 else ...
""" Implement the RAG fusion using the langchain library"""
import asyncio
import json
import logging
import os
import pathlib as path
from operator import itemgetter
from typing import Any
from dotenv import find_dotenv, load_dotenv
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import chroma
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.vectorstores import VectorStoreRetriever
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveJsonSplitter
logger = logging.getLogger(__name__)
# read OPENAI_API_KEY from the environment
load_dotenv(find_dotenv())
# Define a prompt for the RAG model
SYSTEM_PROMPT = """ 
 Your prompt
 """
# recursive pass data in the retriever_data
def collect_data_files(filepath: path) -> list:
 """Walk through the file path and collect the files
 Args: filepath: The file path to be walked through
 Returns:
 list: List of files
 """
 
 return store_file
# Create a recursive json splitter to split the data into chunks
def retrieve_data(data) -> list[chroma.Document]:
 """
 Retrieve the data from the file
 Args: data: The data to be retrieved
 Returns: list: List of documents
 """
 docs = collect_data_files(data)
 for file in docs:
 with open(file, "r") as f:
 data = json.loads(f.read())
 # Split the data into chunks
 splitter = RecursiveJsonSplitter(max_chunk_size=300)
 # create documents from the vector database
 documents = splitter.create_documents(texts=data, convert_lists=True)
 return documents
# vectorstore database from chroma
def vectorstore_db(data) -> VectorStoreRetriever:
 """
 Create a vectorstore database from the data
 Args: data: The data to be indexed
 Returns: VectorStoreRetriever: The vectorstore retriever
 """
 return vector_retriever
# create a function to generate queries from the RAG model
def get_unique_union_of_documents(docs: list[list]) -> list[Any]:
 """
 Get the unique union of the documents
 Args:
 docs: The documents to be processed
 Returns:
 list: The unique union of the documents"""
 return [json.loads(doc) for doc in unique_union]
# RAG FUSION
class RAGFusion:
 """
 Implement the RAG fusion
 Args:
 data: The data to be used for the RAG fusion
 """
 def __init__(self, data) -> None:
 self.data = data
 def __call__(self, question: str) -> str:
 """
 Implement the RAG fusion
 Args:
 question: The question to be answered
 Returns:
 str: The answer to the question
 """
 try:
 # create a retrieval chain
 prompt_for_rag_fusion = ChatPromptTemplate.from_template(SYSTEM_PROMPT)
 generate_query = (
 prompt_for_rag_fusion
 | ChatOpenAI(temperature=0.5, max_tokens=4096)
 | StrOutputParser()
 | (lambda x: x.split("\n"))
 )
 vb = vectorstore_db(self.data)
 # create a retrieval chain
 retrieval_chain = generate_query | vb.map() | get_unique_union_of_documents
 chat_template = """
 Answer the following questions{question} \n
 Based on the data and context provided {context} \n
 Question: {question} \n
 """
 # get the chat prompt template
 prompt = ChatPromptTemplate.from_template(chat_template)
 # use this llm
 llm = ChatOpenAI(temperature=0.5, max_tokens=4096)
 # implement the final rag fusion
 final_rag_fusion = (
 {"context": retrieval_chain, "question": itemgetter("question")}
 | prompt
 | llm
 | StrOutputParser()
 )
 return final_rag_fusion.invoke({"question": question})
 except Exception as e:
 logger.error(f"An error occurred: {e}")

hi, thanks for your answer. I'm sorry for replay later. In retrieve_data function, if the file contain image, chart, ... RecursiveJsonSplitter can handle them? In RAGFusion class, what is the different between chat_template and SYSTEM_PROMPT?
@happy your question has been closed so until stackoverflow reopens it, I may not be able to help you. I can edit the question once , stackoverflow allows me that is if you don't mind, stackoverflow says they are too many edits atm. I will also advise you to make sure your questions are always concise.
well you either create a new question and send the url here or wait for stackoverflow to allow me edit it?
@happy of course, l have submitted it for a review. let us wait. the code i submitted in the answer section is a month of hardwork looking at LLMs architecture. I am trying to contribute my quarter to the community which has helped me in the past. the voting is just to create awareness for other users so they can quickly find it.

CollectivesTM on Stack Overflow

How to develop a Generalized RAG Pipeline for Text, Images, and Structured Data [closed]

1 Answer 1

19 Comments

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

19 Comments

Related