Lab 4: Creating a RAG application to Search a PDF

Learning Objectives

Configure a vector store to use with the application
Read, chunk and ingest a pdf file
Implement logic to perform a similarity search on the vector store
Create a plugin to perform RAG using the memory store

Visual Studio Code

In this lab you will use a Semantic Kernel Vector Store Connector to provide semantic searching capability.

Open the labs\lab4\src\start\SK-Workshop-Lab4 folder in VS Code
In the file explorer, expand the SK-Workshop-Lab4 subfolder and right click on it and select Open in Integrated Terminal

In this lab we are going to use the Microsoft.SemanticKernel.Connectors.SqlServer connector to store vector embeddings in an Azure SQL Database. In order to do this, we need to install the package first.

Add the package reference to the project by running the following command in the terminal:

dotnet add package Microsoft.SemanticKernel.Connectors.SqlServer --version 1.31.0-alpha

In the Programs.cs file, replace line 19 with the following code:

var semanticTextMemory = new MemoryBuilder()
    .WithSqlServerMemoryStore(builder.Configuration.GetConnectionString("SqlAzureDB")!)
    .WithTextEmbeddingGeneration(builder.Configuration.GetConnectionString("OpenAI")!)
    .Build();

The MemoryBuilder will build a ISemanticTextMemory which is basically a wrapper the provides the ability to interact with an underlying vector store and a text embedding service. The WithSqlServerMemoryStore extension method is provided by the package you just installed. The WithTextEmbeddingGeneration extension method is in the Configuration/ConfigurationExtensions.cs file in the project. It follows the same pattern we've been using with the AddChatCompletionService extension method in order to easily switch between using OpenAI or Azure OpenAI.

Next we need to populate the vector store.

Populate the Vector Store with embeddings from a PDF file

In order to keep things clean in this example, we've put the majority of the memory store logic in the MemoryStore class. The MemoryExtensions.cs file includes a couple of extension methods used to simplify the memory store code.

Open the Memory\MemoryStore.cs file. Replace line 13 with a TABLE_NAME value that will be unique for the day - such as your initials and memory. For example:

private const string TABLE_NAME = "jh-memory";

This is so you will be able to see the data ingested, otherwise it will be using a table that is already filled.

Locate the PopulateAsync method on line 24 and paste the following code in that method:

var chunkCount = 0;
var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

foreach (var file in Directory.GetFiles(assetsDir, "*.pdf"))
{
    Console.WriteLine($"Generating chunks for {file}...");
                
    var docId = Path.GetFileNameWithoutExtension(file);

    // See if file already ingested
    if (await semanticTextMemory.TableExistsAsync(TABLE_NAME) 
        && await semanticTextMemory.RecordExistsAsync(TABLE_NAME, GenerateKey(docId, 1, 0)))
    {
        Console.WriteLine($"Document {docId} already exists.");
        continue;
    }
    
    using var pdf = PdfDocument.Open(file);
    foreach (var page in pdf.GetPages())
    {
        var pageText = GetPageText(page);
        var paragraphs = TextChunker.SplitPlainTextParagraphs([pageText], 500, 100, null, text => tokenizer.CountTokens(text));
        
        for (int i = 0; i < paragraphs.Count; i++)
        {
            var paragraph = paragraphs[i];

            var textToEmbed = paragraph;
            var title = $"Document {docId}, Page {page.Number}, Chunk {i}";

            await semanticTextMemory.SaveInformationAsync(TABLE_NAME, textToEmbed, GenerateKey(docId, page.Number, i), title);
            chunkCount++;

            Console.WriteLine($"Saved {title}...");
        }
    }
}

if (chunkCount > 0)
{
    Console.WriteLine($"Generated {chunkCount}.");
}

There is a lot going on in this one method, let's look at the important stuff.

The line 27: var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o"); uses TiktokenTokenizer to create a "gpt-4o" compatible tokenizer so we can split on tokens and not arbitrary characters.

The loop starting on line 29: foreach (var file in Directory.GetFiles(assetsDir, "*.pdf")) starts looping though any pdf files found in the assets directory.

Lines 36 and 37 use the extension method in the MemoryExtensions.cs file to determine if the file has already been saved in the vector store to prevent duplicates when the sample is run multiple times.

Line 43: using var pdf = PdfDocument.Open(file); uses the UglyToad.PdfPig library to open the PDF file.

Line 44: foreach (var page in pdf.GetPages()) starts the loop parsing each page in the PDF file into a page object.

Line 45: var pageText = GetPageText(page); cleans the text up a bit

Line 46: var paragraphs = TextChunker.SplitPlainTextParagraphs([pageText], 500, 100, null, text => tokenizer.CountTokens(text)); takes the pageText and splits it into a max of 500 tokens with a 100 token overlap.

Line 45 starts the loop through all the paragraphs returned from the TextChunker.

Line 56: await semanticTextMemory.SaveInformationAsync(TABLE_NAME, textToEmbed, GenerateKey(docId, page.Number, i), title); uses the underlying memory storage connector to get the embeddings from Azure OpenAI for the textToEmbed and saves to the Azure SQL Database table.

Let's now see it in action.

In the Program.cs file, replace line 45 with the following lines:

var assetsDir = PathUtils.FindAncestorDirectory("assets");
await memoryStore.PopulateAsync(assetsDir);

This will find the assets folder up a few levels in the source tree and start the memory store population.

Now run your application and look over the console output. This will take a minute or two.

dotnet run

The output should look something like this:

Generating chunks for D:\...\rag-vector-agent-semantic-kernel\labs\lab4\assets\MSFT_FY25Q1_10Q.pdf...
Saved Document MSFT_FY25Q1_10Q, Page 1, Chunk 0...
Saved Document MSFT_FY25Q1_10Q, Page 1, Chunk 1...
Saved Document MSFT_FY25Q1_10Q, Page 2, Chunk 0...
....
Saved Document MSFT_FY25Q1_10Q, Page 65, Chunk 0...
Saved Document MSFT_FY25Q1_10Q, Page 66, Chunk 0...
Generated 152.

Question:

Hit enter a couple of times in the terminal in order to have the chatbot loop exit.

The PDF file is now in the database.

Look at the Azure SQL Database table

Open the appsettings.Local.json file and locate the SqlAzureDB connection string on line 6.

It will have this syntax:

"SqlAzureDB": "Server=[db server].database.windows.net;Database=[database];User Id=[user];Password=[password];Persist Security Info=True;MultipleActiveResultSets=True;TrustServerCertificate=True;",

[db server] - You'll need the whole server name to login
[database] - This will be the database to connect to
[user] - This is the sql server user you'll used to login
[password] - the password for the user

Using SQL Server Management Studio (SSMS)

Open SSMS and connect to the Azure SQL database

Enter the full server name from the connection string (ie. dbsvr-sk-workshop1.database.windows.net;)
Enter the user in the Login and password in the Password inputs, click Connect.

NOTE: You may also need to select Optional for Encryption and check the Trust server certificate.

Expand the Object Browser tree to find the table you provided the name for earlier (ie. jh-memory)
Expand the columns and you'll see the embedding column is of the new VECTOR type

Right click on the table name and select the Select Top 1000 Rows

This will show you the records in the memory table to give you and idea of how the data is stored.

Using Azure Data Studio

Open Azure Data Studio and click the New Connection button

In the Connection Details panel, select the Connection String and paste past the SqlAzureDB connection string into the Connection string text box.
You can optionally give the connection an name.

Click the Connect button
In the Connections panel, expand the tree until you find the table you provided the name for earlier (ie. jh-memory)
Expand the columns and you'll see the embedding column is of the new VECTOR type

Right click on the table name and select the Select Top 1000

This will show you the records in the memory table to give you and idea of how the data is stored.

Add logic to MemoryStore to preform a semantic search

In the Memory\MemoryStore.cs file. Replace line 19 with the following code:

var searchItems = semanticTextMemory.SearchAsync(TABLE_NAME, query, 3);
await foreach (var item in searchItems)
{
    results.Add(item.Metadata.Text);
}

Line 19 var searchItems = semanticTextMemory.SearchAsync(TABLE_NAME, query, 3); performs the semantic search using the underlying connector. The last parameter is the number of results to return, in this case we are only returning 3.

The loop on line 20 iterates all the results and builds a list of the item text to return to the calling logic.

Create a plugin that performs RAG with the vector store

In the Plugins folder, create a file named PdfRetrieverPlugin.cs
Paste the following code into the file and save it:

using Memory;
using Microsoft.SemanticKernel;
using System.ComponentModel;
using System.ComponentModel.DataAnnotations;
using System.Text.Json;

namespace Plugins;

public class PdfRetrieverPlugin(MemoryStore memoryStore)
{
    [KernelFunction, Description("Searches for company information.")]
    public async Task<string> RetrieveAsync([Description("User's message"), Required] string question, Kernel kernel)
    {
        var rewriter = kernel.Plugins["QueryRewritePlugin"];

        var searchQuery = await kernel.InvokeAsync(rewriter["Rewrite"],
            new()
            {
                { "question", question }
            });

        var searchResults = await memoryStore.SearchAsync(searchQuery.ToString());

        var rag = kernel.Plugins["Prompts"];

        var llmResult = await kernel.InvokeAsync(rag["BasicRAG"],
            new() {
                { "question", question },
                { "context", JsonSerializer.Serialize(searchResults) }
            }
        );

        return llmResult.ToString();
    }
}

If you compare this to the WebRetrieverPlugin we created in the last lab (below), you'll notice there are only a three of changes (other than their descriptions):

MemoryStore is being passed in the constructor
The memoryStore SearchAsync is used instead of the WebSearchEngingPlugin's SearchAsync
Manually serialize the searchResults to a JSON array string

In Program.cs, replace line 37 with the following line to import the new plugin:

kernel.ImportPluginFromType<PdfRetrieverPlugin>();

Run your application and start to interact with Microsoft's latest S1 filing PDF file.

dotnet run

Some example questions to ask:

What was Microsoft's income in the last quarter?
What was Microsoft's Cloud division's income?
What were some industry trends mentioned?
Was OpenAI mentioned in the S1?
What did it say about OpenAI?

The output should look something like this:

Document MSFT_FY25Q1_10Q already exists.

Question: What was Microsoft's income in the last quarter?
Microsoft's net income for the quarter ended September 30, 2024, was $24,667 million.

Question: What was Microsoft's Cloud division's income?
The operating income for Microsoft's Intelligent Cloud division in the last quarter was $10.5 billion.

Question: What were some industry trends mentioned?
The industry trends mentioned in Microsoft's last quarter report include:

- The industry is dynamic and highly competitive, with frequent changes in both technologies and business models.
- Each industry shift is an opportunity to conceive new products, new technologies, or new ideas that can further transform the industry and Microsoft's business.
- Microsoft engages in a broad range of research and development activities to identify and address the changing demands of customers and users, industry trends, and competitive forces.

Question: Was OpenAI mentioned in the S1?
Yes, OpenAI was mentioned in Microsoft's S1.

Question: What did it say about OpenAI?
Microsoft's S1 mentioned that AI may be developed by Microsoft or others, including their strategic partner, OpenAI.

Question:

High-Level Summary

Focus

Creating a RAG application to search a PDF using a vector store.

Objectives

Configure a vector store to use with the application
Read, chunk and ingest a PDF file
Implement logic to perform a similarity search on the vector store
Create a plugin to perform RAG using the memory store

Additional Exercises

Experiment with different chunk sizes and overlaps to see their influence on the output

Further Ideas

Explore different vector store connectors and their performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 4: Creating a RAG application to Search a PDF

Learning Objectives

Visual Studio Code

Populate the Vector Store with embeddings from a PDF file

Look at the Azure SQL Database table

Using SQL Server Management Studio (SSMS)

Using Azure Data Studio

Add logic to MemoryStore to preform a semantic search

Create a plugin that performs RAG with the vector store

High-Level Summary

Focus

Objectives

Additional Exercises

Further Ideas

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

Lab 4: Creating a RAG application to Search a PDF

Learning Objectives

Visual Studio Code

Populate the Vector Store with embeddings from a PDF file

Look at the Azure SQL Database table

Using SQL Server Management Studio (SSMS)

Using Azure Data Studio

Add logic to MemoryStore to preform a semantic search

Create a plugin that performs RAG with the vector store

High-Level Summary

Focus

Objectives

Additional Exercises

Further Ideas