Skip to content

Add script to find terms without an example#358

Open
bact wants to merge 21 commits intow3c:devfrom
bact:find-terms-without-examples
Open

Add script to find terms without an example#358
bact wants to merge 21 commits intow3c:devfrom
bact:find-terms-without-examples

Conversation

@bact
Copy link
Collaborator

@bact bact commented Aug 14, 2025

Pull Request

From DPV 2.3, new concepts should have an example (see 13 Aug 2025 meeting).

This 295_find_terms_without_example.py script will find terms without an example in the examples directory.

Update: The script will also find terms used in an example but undefined in vocab files (a potential typo).

Running without option at command line, it will show numbers of terms with an example (or that one of their parents has an example):

Namespace                  Class w/ Examples Prop. w/ Examples
-------------------------- ----------------- -----------------
ai                               4 / 183           1 / 10     
dpv                            442 / 962         113 / 144    
eu-aiact                         7 / 105           2 / 2      
eu-dga                          20 / 62            5 / 5      
eu-ehds                         17 / 61            0 / 0      
eu-gdpr                         87 / 217           6 / 6      
eu-nis2                          0 / 12            0 / 0      
eu-rights                        1 / 137           0 / 0      
justifications                  20 / 66            0 / 0      
legal-eu                         8 / 21            0 / 0      
loc                            146 / 5270          0 / 4      
p7012                           38 / 136           4 / 21     
pd                              70 / 221           0 / 0      
risk                            62 / 491          13 / 43     
sector-education                 7 / 49            0 / 0      
sector-finance                  10 / 43            0 / 0      
sector-health                    3 / 59            0 / 0      
sector-infra                     1 / 47            0 / 0      
sector-law                       2 / 94            0 / 0      
sector-publicservices            3 / 15            0 / 0      
tech                            14 / 127           5 / 52     

Top parents among classes without examples (excluding 'loc:'):
     49  risk:RiskMatrix7x7
     25  risk:RiskMatrix5x5
     25  risk:ServiceRelatedConsequence
     24  justifications:LegalProcessImpaired
     20  tech:Actor
     17  dpv:CryptographicMethods
     16  dpv:SecurityMethod
     16  risk:Discrimination
     16  dpv:PublicBenefit
     16  dpv:DataTransferLegalBasis

Top parents among properties without examples:
     18  tech:hasActor
      6  risk:controls
      5  dpv:hasData
      4  ai:hasAI
      4  skos:altLabel
      3  tech:hasInputData
      3  ai:hasData
      3  risk:resolves
      2  risk:reduces
      2  tech:hasInput  

Running with -v option, it will print all the terms without an example:

==== Properties without examples ====

ai:hasAISystem                           ⊂ ai:hasAI
ai:hasCapability                         ⊂ ai:hasAI
ai:hasData                               ⊂ dpv:hasData
ai:hasGPAIModel                          ⊂ ai:hasModel
ai:hasModel                              ⊂ ai:hasAI
ai:hasTechnique                          ⊂ ai:hasAI
ai:hasTestingData                        ⊂ ai:hasData, tech:hasInputData
ai:hasTrainingData                       ⊂ ai:hasData, tech:hasInputData
ai:hasValidationData                     ⊂ ai:hasData, tech:hasInputData
dpv:hasConformanceStatus
dpv:hasData
dpv:hasDataSubjectScale                  ⊂ dpv:hasScale
...

Not exactly useful yet since it can't distinguished the new terms (from one version to another).
Eventually, once we have a "sinceVersion" information (see #359), we may able to show only new terms without an example.

Find terms without an example in examples directory

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@coolharsh55
Copy link
Collaborator

coolharsh55 commented Aug 14, 2025

This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?

We also have the Examples CSV/RDF which contains dct:subject for what the example is about e.g.

dct:subject dpv:Purpose ;
. Its not enough that the term is mentioned in the example because the description for the example will be explaining something about the concept as well. So would it be easier to maintain that we check the dct:subject of examples and then ensure that there is an example for the concept (ideal) or parent (rdf:type)?

@coolharsh55
Copy link
Collaborator

We can use #12 for the general discussion on what to use for example / use-case and how to script tests around it.

@bact
Copy link
Collaborator Author

bact commented Aug 14, 2025

This is cool, I'll run it on 2.2 later. I think this is taking all the RDF outputs and checking which terms occur (anywhere) in examples?

Yes. Since the TTLs in examples/ directory are not in full form, I just match terms with a regular expression (without actual Turtle/RDF parsing). Will put this in code comment.

I will check the dex. I have look at it before but only see a description and a link to an actual TTL example file, so I use TTLs in /examples/ (at root) instead. Will look at the code again.

@coolharsh55
Copy link
Collaborator

What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact
Copy link
Collaborator Author

bact commented Aug 14, 2025

What do you mean by full form? They should be valid as turtle - except the name spaces which are taken from the csvs.

Sorry. I should use another word.
It is valid, but since we don't declare ex namespace anywhere in the TTL, rdflib can't parse it. I got an error at that point.

@coolharsh55
Copy link
Collaborator

I see. It's possible to make them fully conformant turtle files. My worry was that this might take up too much space in the html, but I can truncate the namespaces there via code. Please open an issue for this and I'll implement it later. Though do we need this for v2.2 or can it be done for v2.3? I prefer later as this might break stuff.

@bact
Copy link
Collaborator Author

bact commented Aug 14, 2025

I prefer 2.3. No rush since it will take more time to actually have more examples anyway.

bact added 2 commits August 14, 2025 22:36
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact bact added the code label Aug 15, 2025
@coolharsh55 coolharsh55 added this to the dpv v2.3 milestone Aug 17, 2025
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact
Copy link
Collaborator Author

bact commented Aug 19, 2025

The code is updated to cover a case like in #371.

This is what it reports from 2.2 draft:

==== Terms used in examples but NOT defined in vocabulary files ====
2021-05-28T12:24                         in: E0023.ttl
2022-09-06T15:36                         in: E0016.ttl
dpv-gdpr:SCCsByCommission                in: E0025.ttl
dpv-juris:Ireland                        in: E0019.ttl
dpv:CompanyA                             in: E0035.ttl
dpv:CompanyB                             in: E0035.ttl
dpv:ContractAccepted                     in: E0077.ttl
dpv:ContractOfferReceived                in: E0077.ttl
dpv:ContractPartiallyAccepted            in: E0077.ttl
dpv:ContractUnfulfilled                  in: E0077.ttl
dpv:Controller                           in: E0032.ttl
dpv:Email                                in: E0015.ttl
dpv:FraudPreventionDetection             in: E0031.ttl, E0034.ttl, E0041.ttl, E0065.ttl
dpv:Harm                                 in: E0027.ttl
dpv:IE                                   in: E0049.ttl
dpv:Identifier                           in: E0044.ttl
dpv:Incident                             in: E0069.ttl
dpv:JointControllerAgreement             in: E0034.ttl
dpv:LossOfData                           in: E0068.ttl, E0069.ttl
dpv:SomeContract                         in: E0078.ttl
dpv:Subsidiary                           in: E0038.ttl
dpv:SystemicMonitoring                   in: E0013.ttl
dpv:TransferStatistics                   in: E0024.ttl
dpv:hasConsequenceOfFailure              in: E0061.ttl
dpv:hasProcessingContext                 in: E0013.ttl
dpv:hasStorageLocation                   in: E0072.ttl
dpv:hasThirdPartyRecipient               in: E0034.ttl
dpv:isImplementedByUsingTechnology       in: E0060.ttl, E0064.ttl
exA:TVServiceOptimisation                in: E0004.ttl
exB:TVServiceOptimisation                in: E0004.ttl
exB:TVSignalOptimisation                 in: E0004.ttl
is:dpv                                   in: E0004.ttl
iso:IE                                   in: E0019.ttl
legal-eu:GDPR                            in: E0036.ttl, E0055.ttl, E0067.ttl
loc:USA                                  in: E0060.ttl
nace:M72                                 in: E0008.ttl
new-profile:Anonymise                    in: E0030.ttl
new-profile:Use                          in: E0030.ttl
pd:Email                                 in: E0006.ttl, E0022.ttl, E0023.ttl, E0026.ttl, E0072.ttl
policy:1                                 in: E0030.ttl
risk:DataBreachReport                    in: E0063.ttl
risk:MisuseBreachedInformation           in: E0068.ttl, E0069.ttl, E0071.ttl
risk:halts                               in: E0086.ttl

Note that not all of them are actually undefined. Few may be just a typo or some can be intentional or they are just come from another vocabulary (e.g., odrl).
For example:

  • The first two in the list are legit literals (datetime)
  • is:dpv in E0004.ttl, from text "the common ancestor is:dpv:OptimisationForConsumer" (a space is needed after is:dpv)
  • iso:IE in: E0019.ttl, this looks intentional
  • exA/B:TV* in E0004.ttl, also looks intentional.

bact added 3 commits August 19, 2025 17:28
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@coolharsh55
Copy link
Collaborator

Thanks @bact -- looking super helpful. For undefined terms in examples #358 (comment) -- these should be fixed, yes? Same issue as #371?

@bact
Copy link
Collaborator Author

bact commented Aug 20, 2025

I think so. Mind that some of them are false positives, due to limitations of regex matching. For example, "policy:1" is from odrl:uid <https://example.com/policy:1> which I think should be fine.

@coolharsh55
Copy link
Collaborator

Thanks, I'll change what I can find from these. There is code somewhere in the existing setup that distinguishes between "DPV concepts" and others based on namespaces in order to generate RDF or HTML. Later when looking at this, I'll see how to use that here as then we won't have to rely on regex and broken/invalid RDF will also be flagged automatically.

@bact
Copy link
Collaborator Author

bact commented Aug 20, 2025

I can take some of these.
Already look at legal-eu:GDPR, pd:Email, and few more at PR #373.
The rest looks more complicated.

@coolharsh55
Copy link
Collaborator

Thanks @bact -- added the changes in 22222be Can you please mark the unresolved ones, or ideally move them to #372 and I'll take a look.

@bact
Copy link
Collaborator Author

bact commented Aug 20, 2025

I will move the unresolved ones to #372 to better track that.

Update: Done. All remaining undefined terms are here: #372 (comment)

bact added 3 commits August 20, 2025 12:11
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
bact added 3 commits August 21, 2025 13:44
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@coolharsh55
Copy link
Collaborator

Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:

2021-05-28T12:24                         in: E0023.ttl <--- ignore strings i.e. starting with ""
2022-09-06T15:36                         in: E0016.ttl <--- ignore strings i.e. starting with ""
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
eu-gdpr:A6                               in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz
eu-gdpr:A7                               in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz
exA:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVSignalOptimisation                 in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
legal-eu:law                             in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes
legal-ie:DPA                             in: E0036.ttl <--- terms can contain dashes
legal-ie:law                             in: E0036.ttl <--- terms can contain dashes
nace:72                                  in: E0008.ttl <--- to add nace prefix in our list
policy:1                                 in: E0030.ttl <--- syntax error in IRI, fixed

@bact
Copy link
Collaborator Author

bact commented Aug 24, 2025

The numbering of the script (2xx) is a bit confusing though since there is one part (the check of possible undefined terms in HTML) that should be run after the HTML generation script (300).

Can think more about this for 2.3.

@coolharsh55
Copy link
Collaborator

@bact How about -- we make all 4xx numbered scripts be for testing -- so any tests go there, including the current 290 for SHACL and future ones like a fork of OOPS!/FOOPS! specifically for DPV that I'm planning to write. The current logic is that 1xx is data retrieval from GSheets, 2xx is RDF output, and 3xx is HTML output, then 9xx is for releases. In the future, we will need further numbers for tools/implementations e.g. if we host something interactive or provide a library or something -- these can take up 5xx -- 8xx.

@bact
Copy link
Collaborator Author

bact commented Aug 24, 2025

What about

  • 1xx data retrieval from source
  • 2xx RDF generation (for machine)
  • 3xx HTML generation (for human)
  • 4xx Test
    • 41x Test retrieved data
    • 42x Test generated RDF
    • 43x Test generated HTML

But this will make the possible numbers for each category limited to only 9 and difficult to allocate in a forward-compatible way (keep same numbers in the future).

@coolharsh55
Copy link
Collaborator

Isn't the issue here that the script tests both RDF and HTML -- and then whether it should be 42x or 43x? I think the 4xx should not follow the numbering e.g. what if we have more than 10 RDF test files. There very likely won't be a lot of code here and we can put the workflow in the README or wiki if it isn't simple/intuitive. So the primary focus should be of usability for people working with/on it (e.g. you and me).

@bact
Copy link
Collaborator Author

bact commented Aug 25, 2025

You're right. Keep it 4xx simple. Thanks.

bact and others added 2 commits September 9, 2025 20:48
Do not match strings (surrounded with quotes) and datetime-lookalike

Co-Authored-By: Harshvardhan Pandit <me@harshp.com>
@bact
Copy link
Collaborator Author

bact commented Sep 9, 2025

Tested this for fixing typos, brilliant stuff @bact - very helpful! Some minor issues which are easily recognised and ignored:

2021-05-28T12:24                         in: E0023.ttl <--- ignore strings i.e. starting with ""
2022-09-06T15:36                         in: E0016.ttl <--- ignore strings i.e. starting with ""
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
eu-gdpr:A6                               in: E0072.ttl <--- terms can contain dashes e.g. A6-xyz
eu-gdpr:A7                               in: E0072.ttl <--- terms can contain dashes e.g. A7-xyz
exA:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVServiceOptimisation                in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
exB:TVSignalOptimisation                 in: E0004.ttl <--- example prefixes can be of the form ex, exA, ... exF
legal-eu:law                             in: E0036.ttl, E0055.ttl, E0067.ttl, E0072.ttl, E0076.ttl <--- terms can contain dashes
legal-ie:DPA                             in: E0036.ttl <--- terms can contain dashes
legal-ie:law                             in: E0036.ttl <--- terms can contain dashes
nace:72                                  in: E0008.ttl <--- to add nace prefix in our list
policy:1                                 in: E0030.ttl <--- syntax error in IRI, fixed

Fixed the regex as suggested.

Run against the latest code in main repo:

Terms used in examples but NOT defined in vocabulary files (2)
----------------------------------------------------------
dpv:DataLoss                             in: E0068.ttl, E0069.ttl
nace:72                                  in: E0008.ttl

dpv:DataLoss should be risk:DataLoss will fix in another PR. -- PR #383

@bact
Copy link
Collaborator Author

bact commented Feb 4, 2026

Run against 2.3-dev on 2026-02-04 11:15 UTC

Classes (inc. parents) with examples: 1091 / 8487
Properties (inc. parents) with examples: 150 / 287

Namespace                  Class w/ Examples Prop. w/ Examples
-------------------------- ----------------- -----------------
ai                               5 / 185           1 / 10     
dpv                            448 / 962         113 / 144    
eu-aiact                         7 / 105           2 / 2      
eu-dga                          20 / 62            5 / 5      
eu-ehds                         17 / 61            0 / 0      
eu-gdpr                         94 / 217           6 / 6      
eu-nis2                          0 / 12            0 / 0      
eu-rights                        1 / 137           0 / 0      
justifications                  20 / 66            0 / 0      
legal-eu                         8 / 21            0 / 0      
loc                            146 / 5270          0 / 4      
p7012                           38 / 136           4 / 21     
pd                              70 / 221           0 / 0      
risk                            71 / 492          14 / 43     
sector-education                 7 / 49            0 / 0      
sector-finance                  10 / 43            0 / 0      
sector-health                    3 / 59            0 / 0      
sector-infra                     1 / 47            0 / 0      
sector-law                       2 / 94            0 / 0      
sector-publicservices            3 / 15            0 / 0      
tech                            14 / 127           5 / 52  

Top parents among classes without examples
(excluding child with prefixes: loc)
-------------------------------------------------------------
     49  risk:RiskMatrix7x7
     25  risk:ServiceRelatedConsequence
     25  risk:RiskMatrix5x5
     24  justifications:LegalProcessImpaired
     20  tech:Actor
     17  dpv:CryptographicMethods
     16  dpv:PublicBenefit
     16  risk:Discrimination
     16  dpv:SecurityMethod
     15  dpv:DataTransferLegalBasis

Top parents among properties without examples
-------------------------------------------------------------
     18  tech:hasActor
      6  risk:controls
      5  dpv:hasData
      4  ai:hasAI
      4  skos:altLabel
      3  ai:hasData
      3  tech:hasInputData
      3  risk:resolves
      2  tech:hasInput
      2  dpv:hasTechnicalOrganisationalMeasure

Terms used in examples but NOT defined in vocabulary files (1)
----------------------------------------------------------
nace:72                                  in: E0008.ttl

Terms referenced in HTML but NOT defined in vocabulary files (0)
------------------------------------------------------------
Not found.

@coolharsh55
Copy link
Collaborator

brilliant work, thanks @bact -- I'd like to have this integrated into dev from the start for 2.4 and to have as many examples as we can as the big push is to have guides, use-cases, and examples.

@bact bact modified the milestones: dpv v2.3, dpv v2.4 Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants