Feel free to ask anything about Document searcher app in this topic. I will try my best to answer any question, for example related to the source code. Also make sure to check out this interesting article about this app.
A little bit of background information about the app. I spent a couple of days making this app. There are roughly two steps: Extract all the text from the PDF and create the embeddings.
For the extraction of the PDF I used the library PdfPlumber. Then, I split the text into smaller parts, which ChatGPT can handle. For this I used the RecursiveCharacterTextSplitter from Langchain.
# Read PDF
documents = []
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150, separators=["\n"])
for pdf_file in params.pdf_uploaded:
with pdfplumber.open(io.BytesIO(pdf_file.file.getvalue_binary())) as pdf:
n_pages = len(pdf.pages)
for page_number, page in enumerate(pdf.pages, start=1):
progress_message(f"Setting up vector dataframe for...")
UserMessage.info(f"Reading page {page_number}/{n_pages} from {pdf_file.filename}")
page_text_split = splitter.split_text(page.extract_text(stream=True))
for split_text in page_text_split:
documents.append(
Document(
page_content=split_text,
metadata={"source": pdf_file.filename, "page_number": page_number},
)
)
page.flush_cache()
For the creation of the embeddings the library Langchain came in handy. It handles a lot of heavy lifting in creating the embeddings. See below for the code:
embeddings = OpenAIEmbeddings(openai_api_key=API_KEY)
embedded_documents = []
for document in documents:
UserMessage.info(f"Embedding page {document.metadata['page_number']}/{n_pages} for "
f"{document.metadata['source']}")
embedded_documents.append(
{
"text": document.page_content,
"embeddings": embeddings.embed_query(document.page_content),
"page_number": document.metadata["page_number"],
"source": document.metadata["source"],
}
)
Finally, the prompt is constructed. This works with the following prompt:
def get_question_with_context(current_question, context):
"""First create embedding of the question. Then, create the context. Return answer as conversation."""
prompt = f"""
Question: {current_question["content"]} \n
Answer the question based on the context below. When you don't know the answer, say "I don't know the
answer, based on the provided context. \n
Context: {context}\n
"""
question_with_context = {"role": "user", "content": prompt}
return question_with_context
That’s it! With very little code it is possible to create a very powerful app. Sharing the app is really easy, using VIKTOR.
If you have any questions, please comment in this topic!