A serverless PDF processing pipeline that uses Azure Functions, Durable Functions, and Azure AI services to process PDFs in parallel, extract embeddings, and generate metadata.
- Azure Storage: PDF uploads and results storage
- Azure Functions: Event-driven PDF processing using Durable Functions
- Azure AI Search: Vector storage for embeddings
- Azure OpenAI: Text embeddings generation
- Event Grid: Triggers function execution on new PDF uploads
- Application Insights: Monitoring and logging
Before you start, you should have the following tools installed:
- Azure Account (create free account with $200 credit)
- Terraform >= 1.5
- Azure CLI (
az) - Python 3.11+
- Git
- GitHub Account (for Actions CI/CD - optional)
Can verify you have the correct tools installing by checking its versions
terraform --version
az --version
python --version
git --versiongit clone <repository-url>
cd <repository-name>Authenticate with Azure using the CLI:
az login
az account set --subscription "<your-subscription-id>"Verify authentication:
az account showNavigate to the Terraform directory:
cd terraformCreate your local variables file:
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars with your desired values:
environment = "dev" # or "staging", "prod"
project_name = "pdf-processor"
location = "UK South" # Change as needed
enable_eventgrid_subscription = false # IMPORTANT: Keep false initiallyterraform initThis downloads the required Azure provider and initialises the working directory.
terraform plan -out=tfplanReview the planned resources. You should see:
- Resource Group
- Storage Account & Containers
- OpenAI Service & Embedding Model
- Document Intelligence Service
- AI Search Service
- Application Insights & Log Analytics
- Function App Service Plan
- Linux Function App
- NO Event Grid subscription (because
enable_eventgrid_subscription = false)
terraform apply tfplanAfter successful deployment, note the outputs:
terraform outputSAVE THESE VALUES - will need them for the function app configuration.
Before enabling Event Grid, you have to deploy the function app code. This prevents the Event Grid trigger from firing before the function is ready.
cd ../function_app
pip install -r requirements.txt
cd ..Update terraform/terraform.tfvars:
enable_eventgrid_subscription = truePlan and apply:
cd terraform
terraform plan -out=tfplan
terraform apply tfplanThis creates the Event Grid subscription that will trigger your function on PDF uploads.
Edit terraform/terraform.tfvars:
# Change container names if you would like
pdf_uploads_container_name = "my-pdf-uploads"
pdf_results_container_name = "my-pdf-results"
# Change AI Search tier if
ai_search_sku = "basic"
# Adjust logging retention
log_retention_days = 90
# Change Python version
python_version = "3.11"Then reapply:
terraform apply- Set up local environment:
cd function_app
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Configure local settings:
Update function_app/local.settings.json with values from terraform output.
- Run locally:
func start- Create GitHub Secrets:
In your GitHub repo settings, go to Secrets and variables -> Actions -> New repository secret and add these secrets:
AZURE_FUNCTIONAPP_NAME = terraform output function_app_name
AZURE_FUNCTIONAPP_PUBLISH_PROFILE = #(see below)
AZURE_STORAGE_CONNECTION_STRING = terraform output storage_connection_stringTo get the publish profile:
You have 2 methods, either:
az functionapp deployment list-publishing-credentials \
--name <function-app-name> \
--resource-group <resource-group-name> \
--query publishingProfileOR
Go to Azure Portal -> Your Function App -> Get publish profile (will download) -> Open it and then follow next step
Copy the entire XML output and paste it as AZURE_FUNCTIONAPP_PUBLISH_PROFILE GitHub secret.
- Push to main branch:
git pushGitHub Actions automatically:
- Builds the function app package
- Uploads it as an artifact
- Deploys to Azure
- Uploads test PDFs to the container
The .github/workflows/deploy.yml workflow handles:
-
Build Phase:
- Checkout code
- Set up Python 3.11
- Create deployment zip
- Upload as artifact
-
Deploy Phase:
- Download artifact
- Deploy to Azure Functions
To enable/disable the PDF upload step, edit .github/workflows/deploy.yml.
In Azure Portal:
- Go to Function App → Monitoring → Log stream
az storage blob upload \
--account-name <storage-account-name> \
--container-name pdf-uploads \
--name test.pdf \
--file /path/to/test.pdf \
--account-key <storage-key>curl "https://<function-app-url>/api/processingStatus"az storage blob list \
--account-name <storage-account-name> \
--container-name pdf-results \
--account-key <storage-key>