Public app: Document searcher, powered by ChatGPT

Feel free to ask anything about Document searcher app in this topic. I will try my best to answer any question, for example related to the source code. Also make sure to check out this interesting article about this app.

A little bit of background information about the app. I spent a couple of days making this app. There are roughly two steps: Extract all the text from the PDF and create the embeddings.
For the extraction of the PDF I used the library PdfPlumber. Then, I split the text into smaller parts, which ChatGPT can handle. For this I used the RecursiveCharacterTextSplitter from Langchain.

        # Read PDF
        documents = []
        splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150, separators=["\n"])
        for pdf_file in params.pdf_uploaded:
            with pdfplumber.open(io.BytesIO(pdf_file.file.getvalue_binary())) as pdf:
                n_pages = len(pdf.pages)
                for page_number, page in enumerate(pdf.pages, start=1):
                    progress_message(f"Setting up vector dataframe for...")
                    UserMessage.info(f"Reading page {page_number}/{n_pages} from {pdf_file.filename}")
                    page_text_split = splitter.split_text(page.extract_text(stream=True))
                    for split_text in page_text_split:
                        documents.append(
                            Document(
                                page_content=split_text,
                                metadata={"source": pdf_file.filename, "page_number": page_number},
                            )
                        )
                    page.flush_cache()

For the creation of the embeddings the library Langchain came in handy. It handles a lot of heavy lifting in creating the embeddings. See below for the code:

        embeddings = OpenAIEmbeddings(openai_api_key=API_KEY)
        embedded_documents = []
        for document in documents:
            UserMessage.info(f"Embedding page {document.metadata['page_number']}/{n_pages} for "
                             f"{document.metadata['source']}")
            embedded_documents.append(
                {
                    "text": document.page_content,
                    "embeddings": embeddings.embed_query(document.page_content),
                    "page_number": document.metadata["page_number"],
                    "source": document.metadata["source"],
                }
            )

Finally, the prompt is constructed. This works with the following prompt:

def get_question_with_context(current_question, context):
    """First create embedding of the question. Then, create the context. Return answer as conversation."""
    prompt = f"""
             Question: {current_question["content"]}  \n
             Answer the question based on the context below. When you don't know the answer, say "I don't know the 
             answer, based on the provided context.  \n
             Context: {context}\n
             """
    question_with_context = {"role": "user", "content": prompt}
    return question_with_context

That’s it! With very little code it is possible to create a very powerful app. Sharing the app is really easy, using VIKTOR.

If you have any questions, please comment in this topic!

4 Likes

On VIKTOR’s website, an interesting (though admittedly less technical) article is posted regarding this document searcher application!

This may be valuable to explain your colleagues/managers what value apps like these using LLM’s can bring to your organisation.

Check it out here :rocket:

1 Like

Great app. I can really see this app help us save a lot of time by not having to manually scour all the engineering- and contract documents for our projects. It would also clean up the inevitable sloppy mistakes that result from the large amounts of text we have to process.

Would it be possible for us to make this into a ‘simple’-type app where we can create entities for all the projects we have? I would like to prepare the embeddings for the different entities, so our projectteams can quickly start asking questions whenever they like.

Hi Lloyd, thank you for your kind words.

It is indeed possible to make this into a simple app type, so you don’t have to recreate your embeddings each time you open the app. I am finishing up the public repo and will edit this post when it is finished.

1 Like

Hi Jell, great example! Thanks for sharing the code and providing a full working demo. That’s a great way to demonstrate the value.

We have implemented a very similar app on Viktor with 3 different modules: 1/ doc query, 2/pre-processed embeddings for our corporate documents, and 3/ the standard chatbot everybody loves to use.

We have received a great feedback from our first users. We plan for more developments and use-cases.

I’m looking forward to reading about the storage options as this has been one of our limiting factors.

2 Likes