In today’s data collection standards, research organizations frequently rely on manual data collection methods, often hiring multiple data collectors to access reliable and scalable data. This approach is both financially burdensome and time-consuming, significantly slowing down the research process. There have been attempts at automating this process by large language models, however the outputs that these models provide are not very reliable. Trying to solve this problem, we were able to create a maximally accurate automated data collector by grounding the outputs of large language models to the factual information scrapable from the internet. We have used Retrieval Augmented Generation in order to ground the data we can scrape from the web and create a similarity score between the data points that come as inputs and the organizations from which we try to collect data. This tool enables:
- Precise similarity scoring between input queries and source organizations
- Verification of data points against real-world sources
- Reliable automation that maintains accuracy while reducing costs
Our solution emerged from recognizing the urgent need for a pragmatic tool that could bridge the gap between manual reliability and automated efficiency, particularly in data-intensive fields like finance and education.
Pragma automates the data collection process by gathering accurate and relevant information from verified online sources. Users can log into their accounts, enter organizations or topics they wish to research, and specify the questions they want answered. Pragma then compiles these inputs into structured search queries, retrieves reliable content from the web, processes it, and presents the information in a user-friendly interface. The platform also stores data for users, ensuring continuity and ease of access in future sessions.
Pragma’s frontend is built on React and Tailwind while backend utilizes a Node.js/Express server. Our FastAPI applications handle the entire data retrieval and structuring pipeline:
- Google’s Search API identifies top resources, which are then scraped with Python’s requests and BeautifulSoup.
- This unstructured content is filtered and formatted, creating JSON objects that capture each resource's title, date, and content, with metadata for added accuracy using regular expressions and datetime objects.
- OpenAI’s Vector Embeddings API vectorizes the content, which we store in Pinecone for efficient similarity search and real-time query responsiveness.
- OpenAI’s Completions API and Databricks then organize and filter results before presenting them to the user.
- MongoDB Atlas stores all data linked to the user account for future access capabilities.
Apart from the code, we have utilized tools like Figma, Terraform, or certain Databricks services to ensure the functionality of our UX and DevOps.
- Databricks:: We have fully utilized Databricks to support RAG process with multiple LLM functions. Here's what Databricks, their services and resources helped us build our AI application:
- Cleaning data and creating a vector database for retrieval
- PySpark to clean and format the structured data
- Databricks Jumpstart Package
- Setting up the AI Gateway
- Using ChatDataBricks for completion after retrieval for generating a response for the user
- MongoDB Atlas: We used MongoDB in a few ways in our project. Firstly, we set a up a cluster and two collections in our remote Atlas directory. We utilized MongoDB along with JWT for user authentication. Additionally, we used MongoDB's efficiency querying and storage for saving large amounts of data that is being fetched on the frontend. Due to being able to store this large amount of data on MongoDB, we provide our users an option to view and go back to their old data collections.
- Terraform: Due to having multiple components of our application ─ frontend React app, backend Express app, two FastAPI apps ─ we had to run multiple tests for our deployments which was possible through Terraform.
- .Tech Domains: Since pragma.tech was taken, we decided to take getpragma.tech which fits our slogan Get Pragmatic, Get Data.
One of the key challenges was minimizing hallucinations that can arise with LLMs, especially considering the large amounts of data they are pre-trained on. Additionally, managing how we wanted to structure the web-scraped content required extensive testing and validation. Implementing a reliable vector database and efficiently querying it to support real-time data requests was also technically challenging, but Pinecone helped us a ton on that end. Similarly, ensuring performance across various API integrations while building and deploying our own API app was a challenge we had to spend time towards learning to tackle.
We’re proud to have created an automated data collection platform through a fully working RAG process from contextual web scraping and embeddings to retrieval and completions. Our successful integration of multiple APIs, including but not limited to Google Search API and OpenAI's Vector Embedding and Completion API, coupled with our robust data processing pipeline makes us proud.
We gained valuable insights into how language can be represented as a set of relations of different vectors. We learnred techniques such as embedding and querying, web scraping, and managing data storage for quick access. Additionally, we deepened our understanding of APIs like Google Search, Pinecone, and OpenAI’s suite, as well as the backend workflow required to manage large volumes of structured data.
Our next steps include refining our data validation processes by including an option to embed a human-in-the-loop review system for further accuracy. Right now, the model is prompt engineered to provide accurate and reliable information for education and finance, we aim to expand into more fields like healthcare, law, and other data-intesive industries. In addition to that, we want to incorporate the option to download multiple types file extensions beyond CSV, like JSON, plain text, excel, and more.
We acknowledge the use of ChatGPT (GPT-3.5/4) in the following aspects of our project development:
- Generating documentation templates
- Assisting in code design and structure
- Troubleshooting coding errors
- Suggesting fixes for integration issues
- Identifying logical flaws in algorithms