DocumentChain Class

The DocumentChain class is designed to facilitate the implementation of the Retrieval-Augmented Generation (RAG) method for information retrieval from various documents such as PDFs, text files, etc. This class encapsulates methods and attributes necessary to process and retrieve information from a collection of documents using advanced NLP models and techniques.

Attributes

  • docslist

    A list that holds the loaded documents after processing.

  • modelobject

    Placeholder for the language model used for downstream tasks (e.g., question answering).

  • chainobject

    Placeholder for the RAG chain which connects retrieval to downstream tasks.

  • retrieverobject

    An instance of a retriever that handles document indexing and retrieval.

  • bge_embeddingsobject

    An embedding model from Hugging Face used for generating embeddings for text.

  • vectorstoreobject

    A storage system for embeddings, which enables efficient querying and retrieval.

  • storeobject

    A key-value document store to hold chunked documents.

  • donebool

    A flag indicating if the initialization has been completed.

  • doc_nameslist

    A list of all document names that have been loaded.

  • directorystr

    The directory path where document files are stored.

Methods

  • __init__()

    Initializes an empty DocumentChain with defaults and directory to "./documents".

  • init_embedding_model()

    Initializes the embedding model with pre-trained Hugging Face embeddings.

  • load_documents()

    Loads all the documents from the specified directory and processes them into the docs attribute.

  • init_retriever()

    Initializes the retriever with a vector store and document store, and adds processed documents to the retriever.

  • init_model()

    Placeholder for initializing the downstream language model.

  • init_chain()

    Orchestrates the initialization of all components necessary for the RAG method, including the embedding model, downstream model, document loading, and retriever.

  • add_new_document(filename: str)

    Adds a new document to the retriever. The document is specified by the filename, and it is loaded, processed, and added to the store.

Usage Example

Below is an example demonstrating how to use the DocumentChain class:

# Initialize the DocumentChain
doc_chain = DocumentChain()

# Initialize all components
doc_chain.init_chain()

# Check if the initialization is complete
if doc_chain.done:
    print("Document Chain is ready for use.")

# Add a new document to the retriever
new_filename = "new_document.txt"
doc_chain.add_new_document(new_filename)

Note: External dependencies such as HuggingFaceBgeEmbeddings, TextLoader, PyPDFLoader, document splitters, Chroma, LLM_model, ParentDocumentRetriever should be properly installed and imported for the DocumentChain class to function correctly.

Important Considerations

  • Make sure that all required dependencies are satisfied.

  • The directory set in directory must exist and contain the documents to be processed.

  • Loading large sets of documents may be time-consuming and could require substantial memory, depending on document size and quantity.

  • The actual retrieval and downstream use of the RAG chain are not implemented within the provided methods.

If you encounter issues or have any questions regarding the use of the DocumentChain class, please refer to the project developer or support documentation.