Here we are going to use the Bidirectional Large Language Model which splits the imported Pdf into text file and run the question answering model with the help of BERT models. This model is very first bidirectional model used by google.
Table of Contents :
- Installation and Import Libraries
- Extract Text from PDFs
- Preprocess the Text
- Chunk Large Text (if needed)
- Question-Answering Model
- Save The Text
- Function to Work with the GPU-Enabled Model
- Apply the Model
- Extra : Transformers on single paragraph
#! pip install PyPDF2 transformers
#! pip install os
#! pip install re
import os
import re
import PyPDF2
from transformers import pipelinePyPDF2 : transformaers : re : pipeline :
def extract_pdf_text_all_pages(path_of_pdf):
with open(path_of_pdf, 'rb') as file:
reader = PyPDF2.PdfReader(file)
full_text = ""
for page_num in range(len(reader.pages)):
# for all pages
page = reader.pages[page_num]
full_text += page.extract_text() + "\n"
return full_textdef preprocess_text(text):
# Remove special characters, extra spaces, or newlines
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[^\w\s]', '', text)
return text.strip()Replace multiple spaces with one Remove special characters
def chunk_text(text, max_length=1000):
words = text.split()
chunked_words = [' '.join(words[i:i + max_length]) for i in range(0, len(words), max_length)]
return chunked_wordsFigure Reference
We can see one more additional layer than that of the standard LSTM . The additional layer extract the result by analyzing the both sequential flow and give the result which is more convenient .
from transformers import pipeline
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
qa_model = pipeline("question-answering", model=model_name, device=0) # device=0 for GPU, device=-1 for CPU
def get_answers_from_text(question, context):
return qa_model(question=question, context=context)
def process_text_in_chunks(question, text_chunks):
answers = []
for chunk in text_chunks:
result = get_answers_from_text(question, chunk)
answers.append(result['answer'])
return answersLoad the pre-trained question-answering model (BERT) with GPU support (if available) . Create a function to get answers for a given question and text context. Process the text in chunks if necessary
# Save the answers to a text file
def save_answers_to_file(answers, output_file="answers.txt"):
with open(output_file, "w") as file:
for i, answer in enumerate(answers):
file.write(f"Answer from chunk {i+1}: {answer}\n")This is modification of the
def main(directory_path, question):
all_pdfs_text = extract_text_from_multiple_pdfs(directory_path)
preprocessed_text = preprocess_text(all_pdfs_text)
text_chunks = chunk_text(preprocessed_text, max_length=1000)
answers = process_text_in_chunks(question, text_chunks)
save_answers_to_file(answers)
for i, answer in enumerate(answers):
print(f"Answer from chunk {i+1}: {answer}")# Example usage
directory_path = './'
question = "What is Heat ? " #you can cange this question according to your pdf.
main(directory_path, question)Answer from chunk 1: FqNpNVeHqNpNVdqNdpN
Answer from chunk 2: potential energy
Answer from chunk 3: soft spheres are connected by a spring
Answer from chunk 4: momentum is conserved
Answer from chunk 5: constant temperatureYou can change the number of the chunk in section to get result of your need. Here if you see the question is "what is Heat ?" and we got the answer "Potential Energy " and "constant temperature " while the exact sentence is not in the PdF .
from transformers import pipeline
qa_pipeline = pipeline("question-answering")
paragraph = """For the geometrical optimization of the reactant and product states, and the TS, you should use the B3LYP functional along with the D3 version of Grimme’s dispersion correction with Becke- Johnson damping. You should use the def2-SVP basis set that is of double zeta quality. You should use the SMD solvent model to emulate the experimental conditions. You should employ tight convergence criteria for geometrical optimization and TS search. You can use either the NEB-TS method in ORCA or the QST2+IRC method in Gaussian 16."""
# Example questions
questions = ["What is ORCA",
"What details can you extract from the text?",
"How does the paragraph relate to the topic?",]
for question in questions:
result = qa_pipeline(question=question, context=paragraph)
print(f"Question: {question}\nAnswer: {result['answer']}\n")