KnowledgeCaptureAndDiscovery · dgarijo · Dec 17, 2025 · Dec 17, 2025 · Dec 17, 2025 · dgarijo
diff --git a/docs/readmefile.md b/docs/readmefile.md
@@ -0,0 +1,163 @@
+The following metadata fields can be extracted from a readme.md file.   
+Unlike others files formats (pom, cargo, cabal...), README documents do not follow a formal specification. They are free‑form text files, usually written in markdown or restructuredtext, and their structure varies widely across projects. SOMEF applies heuristics to identify common sections (e.g., Title, Description, Installation, Usage, License...) and extracts metadata accordingly.
+
+| Software metadata category     |    SOMEF metadata JSON path            | README.MD metadata file field     |
+|--------------------------------|----------------------------------------|----------------------------------------|
+| acknowledgement                |     acknowledgement[i].result.value      |  hearders with acknowledgement        |
+| citation                       |     citation[i].result.value      |  headers with citation, reference, cite. Extract bibtext  **(1)**       |
+| contact                       |     contact[i].result.value      |  headers with contact       |
+| contributing_guidelines                      |     contributing_guidelines[i].result.value      |  headers with contributing     |
+| contributors                       |     contributors[i].result.value      |  headers with contributor      |
+| description                       |     description[i].result.value      |  headers with description, introduction, basics, initiation, overview      |
+| documentation                  |     documentation[i].result.value      |  github or gitlab url documentation **(2)**, headers with documentation, readthedocs same name project, readthedocs in badges, wiki links in badges and text       |
+| download                       |     download[i].result.value      |  headers with download       |
+| executable_example                      |     executable_example[i].result.value      |  extracts Binder from badgets   **(3)**    |
+| faq                       |     faq[i].result.value      |  headers with faq, errors, problems   |
+| full_title                       |     full_title[i].result.value      |  extract full title   **(4)** |
+| homepage                       |     homepage[i].result.value      |  homepage from badgets  **(5)**  |
+| identifier                    |     idenfier[i].result.value         |     extract from badgets directly or get from zenodo with latest doi **(6)**, swh identifiers **(7)**         |
+| images                       |     images[i].result.value      |  other images in the README apart from the logo   |
+| installation                      |     installation[i].result.value      |  headers with installation, install, setup, prepare, preparation, manual, guide       |
+| license                       |     license[i].result.value      |  headers with license      |
+| logo                       |     logo[i].result.value      |   look images in badges and text **(8)**  |
+| package_distribution                       |    package_distribution[i].result.value      |  Pypi or latest Pypi version in badges   **(9)**   |
+| related_documentation                  |     dorelated_documentationumentation[i].result.value      |   readthedocs diferent name project     |
+| run                       |     run[i].result.value      |  headers with run, execute       |
+| readme_url                     |     readme_url[i].result.value         |     url in raw githubuser content **(10)**         |
+| related_papers                   |     related_papers[i].result.value         |    look for arXiv reference in all the text **(11)**       |
+| repository_status                       |     repository_status[i].result.value      |  badges with Project status **(12)**  |
+| requirements                     |     requirements[i].result.value      |  headers with requirement, prerequisite, dependency, dependent      |
+| support                       |     support[i].result.value      |  headers with support, help, report   |
+| support_channels               |     support_channels[i].result.value      |  extract information of gitter, reddit and discord in badges and text  **(13)**  |
+| usage                       |     usage[i].result.value      |  headers with usage, example, implement, implementation, demo, tutorial, start, started      |
+
+
+------
+
+**(1)** 
+- Example:
+```bib
+@inproceedings{garijo2017widoco,
+  title={WIDOCO: a wizard for documenting ontologies},
+  author={Garijo, Daniel},
+  booktitle={International Semantic Web Conference},
+  pages={94--102},
+  year={2017},
+  organization={Springer, Cham},
+  doi = {10.1007/978-3-319-68204-4_9},
+  funding = {USNSF ICER-1541029, NIH 1R01GM117097-01},
+  url={http://dgarijo.com/papers/widoco-iswc2017.pdf}
+}
+```
+- Result:
+```
+{
+    "result": {
+        "value": "@inproceedings{garijo2017widoco,\n    url = {http://dgarijo.com/papers/widoco-iswc2017.pdf},\n    funding = {USNSF ICER-1541029, NIH 1R01GM117097-01},\n    doi = {10.1007/978-3-319-68204-4_9},\n    organization = {Springer, Cham},\n    year = {2017},\n    pages = {94--102},\n    booktitle = {International Semantic Web Conference},\n    author = {Garijo, Daniel},\n    title = {WIDOCO: a wizard for documenting ontologies},\n}",
+        "type": "Text_excerpt",
+        "format": "bibtex",
+        "doi": "10.1007/978-3-319-68204-4_9",
+        "title": "WIDOCO: a wizard for documenting ontologies",
+        "author": "Garijo, Daniel",
+        "url": "http://dgarijo.com/papers/widoco-iswc2017.pdf"
+    },
+}
+```
+
+
+**(2)** 
+- Example if github:
+```
+f"https://github.com/{owner}/{repo_name}/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
+```
+- Example if gitlab:
+```
+f"https://{domain_gitlab}/{owner}/{repo_name}/-/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
+```
+
+**(3)** 
+- Example: `[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/user/repo/HEAD)`
+- Result: `"value": "https://mybinder.org/v2/gh/user/repo/HEAD"`
+
+**(4)** 
+- Example: `# WIzard for DOCumenting Ontologies (WIDOCO)`
+- Result:
+```
+"full_title": [
+  {
+    "result": {
+      "type": "String",
+      "value": "WIzard for DOCumenting Ontologies (WIDOCO)"
+    },
+    "confidence": 1,
+    "technique": "regular_expression",
+    "source": "https://raw.githubusercontent.com/dgarijo/Widoco/master/README.md"
+  }
+]
+```
+
+**(5)** 
+- Example: `[![Project homepage](https://img.shields.io/badge/homepage-project-blue)](https://myproject.org)`
+- Result: `"value": "https://myproject.org"`
+
+
+**(6)** 
+- Example: `[![DOI](https://zenodo.org/badge/11427075.svg)](https://doi.org/10.5281/zenodo.11093793)`
+- Result: `"value": "https://doi.org/10.5281/zenodo.11093793"`
+
+**(7)** 
+- Example: `[![SWH](https://archive.softwareheritage.org/badge/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7/)](https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7;origin=...)`
+- Result: ` "value": "https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7",`
+
+
+**(8)** 
+- Example: `![Logo](src/main/resources/logo/logo2.png)`
+- Result: `"value": "https://raw.githubusercontent.com/dgarijo/Widoco/master/src/main/resources/logo/logo2.png"``
+
+**(9)** 
+- Example: `[![PyPI](https://badge.fury.io/py/somef.svg)](https://badge.fury.io/py/somef) `
+- Result: `"value": "https://pypi.org/project/somef"`
+
+
+**(10)** 
+- Example: 
+```
+[Yulun Zhang](http://yulunzhang.com/), [Yapeng Tian](http://yapengtian.org/), [Yu Kong](http://www1.ece.neu.edu/~yukong/), [Bineng Zhong](https://scholar.google.de/citations?user=hvRBydsAAAAJ&hl=en), and [Yun Fu](http://www1.ece.neu.edu/~yunfu/), "Residual Dense Network for Image Super-Resolution", CVPR 2018 (spotlight), [[arXiv]](https://arxiv.org/abs/1802.08797) 
+```
+- Result: `"value": "https://arxiv.org/abs/1802.08797"`
+
+
+**(11)** 
+- Example:
+```
+f"https://raw.githubusercontent.com/{owner}/{repo_name}/{repo_ref}/{urllib.parse.quote(partial)}" 
+```
+
+**(12)** 
+- Example:
+```
+ [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) 
+```
+- Result:
+```
+"value": "https://www.repostatus.org/#active",
+"description": "Active \u2013 The project has reached a stable, usable state and is being actively developed."
+```
+
+**(13)** 
+- Example:
+```
+[![Gitter chat](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/myproject/community)
+[Reddit](https://www.reddit.com/r/myproject)
+[Discord](https://discord.com/invite/xyz789)
+```
+- Result:
+```
+"value": "https://gitter.im/myproject/community"
+....
+"value": "https://www.reddit.com/r/myproject"
+.....
+"value": "https://discord.com/invite/xyz789"
+```
+
+
diff --git a/docs/requerimentstxt.md → docs/requirementstxt.md b/docs/requerimentstxt.md → docs/requirementstxt.md
diff --git a/docs/supported_languages.md b/docs/supported_languages.md
@@ -0,0 +1,33 @@
+
+SOMEF recognizes the programming languages used in a software repository by inspecting
+well-known configuration files, dependency descriptors and executable artifacts.
+To know more about the extraction details for each type of file, click on it.
+
+
+
+| Language  | Supported Files |
+|-----------|----------------------------|
+| Haskell | [`*.cabal`](./cabal.md) |  
+| Java | [`pom.xml`](./pom.md) | 
+| JavaScript | [`package.json`](./packagejson.md), [`bower.json`](./bower.md) | 
+| Julia | [`Project.toml`](./julia.md) | 
+| PHP | [`composer.json`](./composer.md) |  
+| Python | [`setup.py`](./setuppy.md), [`pyproject.toml`](./pyprojecttoml.md), [`requirements.txt`](./requirementstxt.md) | 
+| R | [`DESCRIPTION`](./description.md) | 
+| Ruby | [`*.gemspec`](./gemspec.md) |  
+| Rust | [`Cargo.toml`](./cargo.md) |  
+
+---
+
+SoMEF also detects the following files to recognize build instructions, workflows or executable examples:
+
+
+| Language  | Supported Files                    | Software metadata category  |
+|-----------|------------------------------------|-----------------------------|
+| Docker    |  `Dockerfile`, `docker-compose.yml` | has_built_file
+| Jupyter Notebook |  `*.ipynb`                | executable_example |
+| Ontologies | `*.ttl`, `*.owl`, `*.nt`, `*.xml`, `*.jsonld` | ontologies |
+| Shell |  `*.sh`                                | has_script_file   |
+| YAML |  `*.yml`, `*.yaml` | continuous_integration, workflows
+
+
diff --git a/docs/supported_metadata_files.md b/docs/supported_metadata_files.md
@@ -2,82 +2,6 @@
 
 This project supports extracting metadata from specific types of files commonly used to declare authorship and contribution in open source repositories.
 
-## Supported files of authors.
-
-The following filenames are recognized and processed automatically:
-
-* `AUTHORS`
-* `AUTHORS.md`
-* `AUTHORS.txt`
-
-These files are expected to be located at the root of the repository. Filenames are matched case-insensitively.
-
-## Purpose and Format
-
-These files typically contain a list of individuals and/or organizations that have contributed to the project. While there is no universal standard for formatting, a widely referenced convention is Google's guidance:
-
-🔗 [Google Open Source: Authors Files Protocol](https://opensource.google/documentation/reference/releasing/authors/)
-
-The content may be structured as:
-
-* Simple plain text, with one contributor per line.
-* Markdown-formatted text (`.md` files).
-* Lines including contributor names, emails (e.g., `Name <email>`), and sometimes affiliations.
-
-### Examples of Valid Entries
-
-```text
-Jane Doe <jane@example.com>
-John Smith
-Acme Corporation <acme@mail.com>
-Google Inc.
-```
-
-### Examples of NON Valid Entries
-
-```text
-JetBrains <>
-Microsoft
-Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung
-scrawl - Top contributor
-Tom
-```
-## What Is Read vs. Discarded
-
-When processing these files, the parser will:
-
-**Include** lines that:
-
-* Contain person names, optionally with emails (`Name <email>`).
-* Clearly refer to organizations (e.g., "Google LLC", "OpenAI Inc.").
-
-**Discard** lines that:
-
-* Are headers, decorative separators, or markdown formatting (`#`, `*`, `=`, etc.).
-* Contain only URLs or links.
-* Are single words with no email and no organizational keyword (e.g., `JetBrains <>`).
-* Are markdown or structured noise (`---`, `{}`, etc.).
-* Contain more than four words and are not recognized as organizations — to avoid capturing generic or descriptive sentences (e.g., This line not is an author).
-
-### Special Cases
-
-* Entries with only a first name and an email are accepted but must not assign an empty `last_name`.
-* Lines starting with `-` or `*` are considered lists, but only parsed if the content matches expected author patterns.
-* Blocks enclosed in `{}` are stripped before parsing.
-* Any line matching known organization suffixes (`Inc.`, `LLC`, `Ltd.`, `Corporation`) is treated as an organization, even if no email is present.
-* Some organization names (e.g., Open Source Initiative) may be mistakenly treated as person names if they do not contain a company designator or email. To improve detection, it is recommended to use names like Open Source Initiative Inc.
-* In such cases, only the meaningful part (typically the name) is extracted before any descriptive annotations.
-For example, the line:
-Tom Smith (Tom) - Project leader 2010-2018
-Will be interpreted as:
-{
-  "type": "Person",
-  "name": "Tom Smith",
-  "value": "Tom Smith",
-  "given_name": "Tom",
-  "last_name": "Smith"
-}
-
 
 ## Supported Metadata Files in SOMEF
 
@@ -90,6 +14,7 @@ SOMEF can extract metadata from a wide range of files commonly found in software
 | `bower.json`       | JavaScript (Bower)         | Package descriptor used for configuring packages that can be used as a dependency for Bower-managed front-end projects. |  <div align="center">[🔍](./bower.md)</div>| [📄](https://github.com/bower/spec/blob/master/json.md)| |[Example](https://github.com/juanjemdIos/somef/blob/master/src/somef/test/test_data/repositories/js-template/bower.json) |
 | `package.json`     | JavaScript / Node.js       | Defines metadata, scripts, and dependencies for Node.js projects |  <div align="center">[🔍](./packagejson.md)| [📄](https://docs.npmjs.com/cli/v10/configuring-npm/package-json)| 10.9.4|[Example](https://github.com/npm/cli/blob/latest/package.json) | 
 | `codemeta.json`       |        JSON-LD              | Metadata file for research software using JSON-LD vocabulary | <div align="center">[🔍](./codemetajson.md)</div> | [📄](https://github.com/codemeta/codemeta/blob/master/crosswalk.csv)| [v3.0](https://w3id.org/codemeta/3.0)|[Example](https://github.com/codemeta/codemeta/blob/master/codemeta.json) |
+| `readme.me` | Markdown                     | Main documentation file of repository |  <div align="center">[🔍](./readmefile.md)</div>| | |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/README.md) |
 | `composer.json`    | PHP                        | Manifest file serves as the package descriptor used in PHP projects. | <div align="center">[🔍](./composer.md)</div>| [📄](https://getcomposer.org/doc/04-schema.md)| [2.8.12](https://getcomposer.org/changelog/2.8.12)|[Example](https://github.com/composer/composer/blob/main/composer.json) |
 | `juliaProject.toml`   | Python                     | Defines the package metadata and dependencies for Julia projects, used by the Pkg package manager.|  <div align="center">[🔍](./julia.md)</div>| [📄](https://docs.julialang.org/en/v1/)| |[Example](https://github.com/JuliaLang/TOML.jl/blob/master/Project.toml) | 
 | `pyproject.toml`   | Python                     | Modern Python project configuration file used by tools like Poetry and Flit |  <div align="center">[🔍](./pyprojecttoml.md)</div>| [📄](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/)| |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/pyproject.toml) | 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -4,25 +4,23 @@ nav:
   - Install: install.md
   - Usage: usage.md
   - Output: output.md
-  - Supported formats:
-    - Authors: author.md
-    - Bower: bower.md
-    - Cabal: cabal.md
-    - Cargo: cargo.md
-    - Codemeta: codemetajson.md
-    - Composer: composer.md
-    - Description: description.md
-    - Gemespec: gemspec.md
-    - Julia Projects: julia.md
-    - Package JSON: packagejson.md
-    - Pom: pom.md
-    - Pyproject: pyprojecttoml.md
-    - Requirements: requirementstxt.md
-    - Setup: setuppy.md
-
+  - Supported metadata files: supported_metadata_files.md
+  - Supported languages: supported_languages.md
+    # - Authors: author.md
+    # - Bower: bower.md
+    # - Cabal: cabal.md
+    # - Cargo: cargo.md
+    # - Codemeta: codemetajson.md
+    # - Composer: composer.md
+    # - Description: description.md
+    # - Gemespec: gemspec.md
+    # - Julia Projects: julia.md
+    # - Package JSON: packagejson.md
+    # - Pom: pom.md
+    # - Pyproject: pyprojecttoml.md
+    # - Requirements: requirementstxt.md
+    # - Setup: setuppy.md
 
-  - Contributing: contributing.md
-  - Changelog: changelog.md 
 theme:
   name: material