015 - Extract #417

eveleighoj · 2025-10-27T18:18:16Z

eveleighoj
Oct 27, 2025
Maintainer

Introduction

The extract tool has been developed by I.AI and they are looking to MHCLG to take over the hosting and management of the tool. This ODP hopes to explain how we will host this infrastructure.

Status

Draft

Summary And Content

extract is a tool for creating data from documents. this ODP may eb too detailed but it aims to lay out all of the components and work required to get this working in mhclg infrastructure there are four main technical things that need to happen:

produce the Extract Python package
Create the Extract Service using the extract package
Identify authentication & authorisation solutions to control extract access
Build functionality in applications to use the extract service

I have attempted to explain the above and put key acceptance criteria against them in the key deliverables section below.

Contents

Key deliverables
Expected System

Key Deliverables

We want to migrate the current system developed by I.AI into MHCLG. There are two main componants which we expect I.AI to deliver:

Extract

An open source Python library which provides tools to extract data from documents. This should be the primary entrypoint for others to gain value from the work completed by I.AI

Acceptance Criteria:

It must be open source, specifically be available in github in a public repository and licenses correctly. It should be installable directly from Github and preferably published to PyPI with appropriate versioning.
It should have appropriate testing including evidence to show accuracy of data extraction
It should be properly documented to allow anyone to utilise it.

Extract Service

A service which provides an API which can be used by other MHCLG systems to allow users to extract data from documents via an API interface.

Acceptance Criteria:

Should have an api that is usable for submitting extractions and retrieving results
Do results need to be editable?
service should be deployed via Infrastructure as Code, specifically terraform. This will allow simpler deployment into MHCLG cloud platforms. The planning data team manage other infrastructure in terraform so utilising the same IaC language will make handover easier
service should be deployed into AWS, the planning data team currently manage infrastructure in AWS aligning with this again makes handover easier.
service should be able to operate at scale so autoscaling and other techniques need to be applied.
Key scaling limitations or bottlenecks should be detailed. E.g. model usage limits.

In addition to the above Both MHCLG and I.AI are working together to produce two additional pieces of work to support the usage of the extract service.

Authentication & Authorisation

Currently MHCLG does not have an authentication or authorisation ready for a product like extract. I.AI have an internal solution which is currently being used within the alpha but it has limitations. This is a fairly unknown section and likely needs invetigation of requirements needed now and requirements needed for the future.

Acceptance Criteria:

must be able to authenticate users in both departments and local authorities
must have a layer of authorisation that at minimum allows us to control which organisations have access
Does it need to control individual user access?
Do users need to be able to control access?

Application Code

A key part of of using the extract service is generating both front end and backend code which can interact with the extract api. the functionality should be available through a single api. this code should be usable in any express.js application including test apps and the providers service in planning data.

Acceptance criteria

cost must be usable in the providers service, specifically the provide application in the submit repo
code must utilise the extract API to perform the tasks
There should be a more detailed list of acceptance criteria focussed on users that doesn't need repeating here.

Expected System

System Context

Above is the expected System context once extract has migrated to MHCLG. There are some key points to cover:

The extract service consists of the Extract API system. Its worth noting that this does not offer front-end components instead focussing on. being an API that one or more systems can connect to. This may help with portability down the line if additional systems require access to this. functionality.
The Provider Service from the Planning Data Service is how Data Providers can use the extract tool for now. Screens and usage will be focussed on their needs. The front end functionality will be built into that service NOT extract
Extract will require access to google gemini and/or other systems, this is represented on the diagram by the AI Model system. This is expected to remain external and for now there are no plans to host our own version of this model in MHCLG
An authentication solution is required. we have sepaprated this from authorisation on the diagram as we expect to use an extarnal GDS ran service such as one login.
Authorization is again a separate service. This. is because authentication will happen with a general service but we then need to control who has access to specific tools. There are a lot of solutions for this ranging from simple email checking to a more complex app where users can be managed.

Key big open questions:

Which authentication service will be used? - so far GOVUK One login or GOVUK internal access have been identified as possible solutions
What will we do about an authorization system? - there are multiple efforts across MHCLG to solve this who we need to talk, this could be hosted inside the planning data platform. It is important that providers can also control access as we cannot manage users forever, but in beta we ma want it locked down.
How does the extract service use LLMs? - conversations are going through security & cloud platforms
What services are we going to use for user analytics? do we have standards for this?
Which AWS accounts will be responsible for hosting the architecture?

Extract API System

Now let's explore the containers required in the extract service:

And an infrastructure diagram which includes supporting infrastructure:

Questions:

how are we loading files from the frontend app to the bucket for backend access, I assume it's presigned urls, are these generated by the API or the web app?
database access by API/worker how does this work
fill in services we're hosting that support the worker
authentication needs adding to the diagrams
how is the worker contacted and using AI models

Extract Containers

Here let's split the Extract architecture down by each container adding details for each one. remember a container isn't strictly a docker container but a container as defined in the C4 model!

Extract API

I.AI repo: extact-app
digital-land repo:

This container or for a simple fast API application, it accepts requests and queues them into the task queue. Once requests are complete it's responsible for retrieving details for the front end application.

Alerting & Monitoring

CI/CD

Performance & Scalability

Security

Extract Task Queue

I.AI repos: extract-app
digital-land repos:

The SQS queue where the API can drop tasks and the workers can pull tasks from

Extract Worker

Segmenter

Georeferencer

Data Flow Diagram

STEP 1 - moving application side code into provide

the API stays as an external service to MHCLG
how can the API be reached by the provide web aplication
need all functionality to be on the API, will need to limit functionality that the app records/does
still need to answer the authentication question

eveleighoj · 2026-01-15T14:54:02Z

eveleighoj
Jan 15, 2026
Maintainer Author

Ae there any dependencies on other I.AI repos outside of plannin-extract and extract-app

1 reply

apricot13 Jan 15, 2026

there's these two packages

https://pypi.org/project/i-dot-ai-utilities/

https://www.npmjs.com/package/@i-dot-ai-npm/utilities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

015 - Extract #417

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

015 - Extract #417

Uh oh!

Uh oh!

eveleighoj Oct 27, 2025 Maintainer

Introduction

Status

Summary And Content

Key Deliverables

Extract

Extract Service

Authentication & Authorisation

Application Code

Expected System

System Context

Extract API System

Extract Containers

Extract API

Alerting & Monitoring

CI/CD

Performance & Scalability

Security

Extract Task Queue

Extract Worker

Segmenter

Georeferencer

Data Flow Diagram

STEP 1 - moving application side code into provide

Replies: 1 comment · 1 reply

Uh oh!

eveleighoj Jan 15, 2026 Maintainer Author

Uh oh!

apricot13 Jan 15, 2026

eveleighoj
Oct 27, 2025
Maintainer

Replies: 1 comment 1 reply

eveleighoj
Jan 15, 2026
Maintainer Author