Public app: Document searcher, powered by AzureAI

Jelle · 22 August 2023 13:37

Feel free to ask anything about Document searcher app in this topic. I will try my best to answer any question. The source code is fully available on our GitHub page. Also make sure to check out this interesting article about this app. And also I did a webinar about this application, please check it our on our YouTube channel.

A little bit of background information about the app. I spent a couple of days making this app. There are roughly two steps: Extract all the text from the PDF and create the embeddings.
For the extraction of the PDF I used the library PyPDF. Then, I split the text into smaller parts, which AzureAI can handle. For this I used the RecursiveCharacterTextSplitter from Langchain.

        # Splitter is used to chunk the document, so the chunk size doesn't become too big for AzureAI
        splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150, separators=["\n"])
        documents = []
        with pdf_file.open_binary() as pdf_opened:
            reader = PdfReader(pdf_opened)
            for page_number, page in enumerate(reader.pages):
                page_extracted_text = page.extract_text()
                page_extracted_text_split = splitter.split_text(page_extracted_text)
                for split_text in page_extracted_text_split:
                    documents.append(
                        Document(
                            page_content=split_text,
                            metadata={
                                "page_number": page_number + 1,
                                "source": entity_name.split(".")[0],
                            },

For the creation of the embeddings the library Langchain came in handy. It handles a lot of heavy lifting in the chunking, necessary for creating the embeddings. See below for the code:

        # Embed chunks
        embedded_documents = []
        API_KEY, ENDPOINT = get_API_key()
        client = AzureOpenAI(
            api_key=API_KEY, api_version="2023-10-01-preview", azure_endpoint=ENDPOINT, max_retries=MAX_RETRIES
        )

        for document in documents:
            UserMessage.info(f"Embedding page {document.metadata['page_number']} for " f"{document.metadata['source']}")
            embedding = get_embedding(client, document.page_content)
            embedded_documents.append(
                {
                    "text": document.page_content,
                    "embeddings": embedding,
                    "page_number": document.metadata["page_number"],
                    "source": document.metadata["source"],
                }
            )

Finally, the prompt is constructed. This works with the following prompt:

def get_question_with_context(current_question, context):
    """First create embedding of the question. Then, create the context. Return answer as conversation."""
    prompt = f"""
             Question: {current_question["content"]}  \n
             Answer the question based on the context below. When you don't know the answer, say "I don't know the 
             answer, based on the provided context.  \n
             Context: {context}\n
             """
    question_with_context = {"role": "user", "content": prompt}
    return question_with_context

That’s it! With very little code it is possible to create a very powerful app. Sharing the app is really easy, using VIKTOR.

If you have any questions, please comment in this topic!

Daniel · 23 August 2023 08:31

On VIKTOR’s website, an interesting (though admittedly less technical) article is posted regarding this document searcher application!

This may be valuable to explain your colleagues/managers what value apps like these using LLM’s can bring to your organisation.

Check it out here

LRoga · 24 August 2023 08:22

Great app. I can really see this app help us save a lot of time by not having to manually scour all the engineering- and contract documents for our projects. It would also clean up the inevitable sloppy mistakes that result from the large amounts of text we have to process.

Would it be possible for us to make this into a ‘simple’-type app where we can create entities for all the projects we have? I would like to prepare the embeddings for the different entities, so our projectteams can quickly start asking questions whenever they like.

Jelle · 24 August 2023 08:43

Hi Lloyd, thank you for your kind words.

It is indeed possible to make this into a simple app type, so you don’t have to recreate your embeddings each time you open the app. I am finishing up the public repo and will edit this post when it is finished.

Maxime · 30 August 2023 20:04

Hi Jell, great example! Thanks for sharing the code and providing a full working demo. That’s a great way to demonstrate the value.

We have implemented a very similar app on Viktor with 3 different modules: 1/ doc query, 2/pre-processed embeddings for our corporate documents, and 3/ the standard chatbot everybody loves to use.

We have received a great feedback from our first users. We plan for more developments and use-cases.

I’m looking forward to reading about the storage options as this has been one of our limiting factors.