Feel free to ask anything about Document searcher app in this topic. I will try my best to answer any question. The source code is fully available on our GitHub page. Also make sure to check out this interesting article about this app. And also I did a webinar about this application, please check it our on our YouTube channel.
A little bit of background information about the app. I spent a couple of days making this app. There are roughly two steps: Extract all the text from the PDF and create the embeddings.
For the extraction of the PDF I used the library PyPDF. Then, I split the text into smaller parts, which AzureAI can handle. For this I used the RecursiveCharacterTextSplitter from Langchain.
# Splitter is used to chunk the document, so the chunk size doesn't become too big for AzureAI
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150, separators=["\n"])
documents = []
with pdf_file.open_binary() as pdf_opened:
reader = PdfReader(pdf_opened)
for page_number, page in enumerate(reader.pages):
page_extracted_text = page.extract_text()
page_extracted_text_split = splitter.split_text(page_extracted_text)
for split_text in page_extracted_text_split:
documents.append(
Document(
page_content=split_text,
metadata={
"page_number": page_number + 1,
"source": entity_name.split(".")[0],
},
For the creation of the embeddings the library Langchain came in handy. It handles a lot of heavy lifting in the chunking, necessary for creating the embeddings. See below for the code:
# Embed chunks
embedded_documents = []
API_KEY, ENDPOINT = get_API_key()
client = AzureOpenAI(
api_key=API_KEY, api_version="2023-10-01-preview", azure_endpoint=ENDPOINT, max_retries=MAX_RETRIES
)
for document in documents:
UserMessage.info(f"Embedding page {document.metadata['page_number']} for " f"{document.metadata['source']}")
embedding = get_embedding(client, document.page_content)
embedded_documents.append(
{
"text": document.page_content,
"embeddings": embedding,
"page_number": document.metadata["page_number"],
"source": document.metadata["source"],
}
)
Finally, the prompt is constructed. This works with the following prompt:
def get_question_with_context(current_question, context):
"""First create embedding of the question. Then, create the context. Return answer as conversation."""
prompt = f"""
Question: {current_question["content"]} \n
Answer the question based on the context below. When you don't know the answer, say "I don't know the
answer, based on the provided context. \n
Context: {context}\n
"""
question_with_context = {"role": "user", "content": prompt}
return question_with_context
That’s it! With very little code it is possible to create a very powerful app. Sharing the app is really easy, using VIKTOR.
If you have any questions, please comment in this topic!