Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions docs/readmefile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
The following metadata fields can be extracted from a readme.md file.
Unlike others files formats (pom, cargo, cabal...), README documents do not follow a formal specification. They are free‑form text files, usually written in markdown or restructuredtext, and their structure varies widely across projects. SOMEF applies heuristics to identify common sections (e.g., Title, Description, Installation, Usage, License...) and extracts metadata accordingly.

| Software metadata category | SOMEF metadata JSON path | README.MD metadata file field |
|--------------------------------|----------------------------------------|----------------------------------------|
| acknowledgement | acknowledgement[i].result.value | hearders with acknowledgement |
| citation | citation[i].result.value | headers with citation, reference, cite. Extract bibtext **(1)** |
| contact | contact[i].result.value | headers with contact |
| contributing_guidelines | contributing_guidelines[i].result.value | headers with contributing |
| contributors | contributors[i].result.value | headers with contributor |
| description | description[i].result.value | headers with description, introduction, basics, initiation, overview |
| documentation | documentation[i].result.value | github or gitlab url documentation **(2)**, headers with documentation, readthedocs same name project, readthedocs in badges, wiki links in badges and text |
| download | download[i].result.value | headers with download |
| executable_example | executable_example[i].result.value | extracts Binder from badgets **(3)** |
| faq | faq[i].result.value | headers with faq, errors, problems |
| full_title | full_title[i].result.value | extract full title **(4)** |
| homepage | homepage[i].result.value | homepage from badgets **(5)** |
| identifier | idenfier[i].result.value | extract from badgets directly or get from zenodo with latest doi **(6)**, swh identifiers **(7)** |
| images | images[i].result.value | other images in the README apart from the logo |
| installation | installation[i].result.value | headers with installation, install, setup, prepare, preparation, manual, guide |
| license | license[i].result.value | headers with license |
| logo | logo[i].result.value | look images in badges and text **(8)** |
| package_distribution | package_distribution[i].result.value | Pypi or latest Pypi version in badges **(9)** |
| related_documentation | dorelated_documentationumentation[i].result.value | readthedocs diferent name project |
| run | run[i].result.value | headers with run, execute |
| readme_url | readme_url[i].result.value | url in raw githubuser content **(10)** |
| related_papers | related_papers[i].result.value | look for arXiv reference in all the text **(11)** |
| repository_status | repository_status[i].result.value | badges with Project status **(12)** |
| requirements | requirements[i].result.value | headers with requirement, prerequisite, dependency, dependent |
| support | support[i].result.value | headers with support, help, report |
| support_channels | support_channels[i].result.value | extract information of gitter, reddit and discord in badges and text **(13)** |
| usage | usage[i].result.value | headers with usage, example, implement, implementation, demo, tutorial, start, started |


------

**(1)**
- Example:
```bib
@inproceedings{garijo2017widoco,
title={WIDOCO: a wizard for documenting ontologies},
author={Garijo, Daniel},
booktitle={International Semantic Web Conference},
pages={94--102},
year={2017},
organization={Springer, Cham},
doi = {10.1007/978-3-319-68204-4_9},
funding = {USNSF ICER-1541029, NIH 1R01GM117097-01},
url={http://dgarijo.com/papers/widoco-iswc2017.pdf}
}
```
- Result:
```
{
"result": {
"value": "@inproceedings{garijo2017widoco,\n url = {http://dgarijo.com/papers/widoco-iswc2017.pdf},\n funding = {USNSF ICER-1541029, NIH 1R01GM117097-01},\n doi = {10.1007/978-3-319-68204-4_9},\n organization = {Springer, Cham},\n year = {2017},\n pages = {94--102},\n booktitle = {International Semantic Web Conference},\n author = {Garijo, Daniel},\n title = {WIDOCO: a wizard for documenting ontologies},\n}",
"type": "Text_excerpt",
"format": "bibtex",
"doi": "10.1007/978-3-319-68204-4_9",
"title": "WIDOCO: a wizard for documenting ontologies",
"author": "Garijo, Daniel",
"url": "http://dgarijo.com/papers/widoco-iswc2017.pdf"
},
}
```


**(2)**
- Example if github:
```
f"https://github.com/{owner}/{repo_name}/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
```
- Example if gitlab:
```
f"https://{domain_gitlab}/{owner}/{repo_name}/-/tree/{urllib.parse.quote(repo_default_branch)}/{docs_path}"
```

**(3)**
- Example: `[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/user/repo/HEAD)`
- Result: `"value": "https://mybinder.org/v2/gh/user/repo/HEAD"`

**(4)**
- Example: `# WIzard for DOCumenting Ontologies (WIDOCO)`
- Result:
```
"full_title": [
{
"result": {
"type": "String",
"value": "WIzard for DOCumenting Ontologies (WIDOCO)"
},
"confidence": 1,
"technique": "regular_expression",
"source": "https://raw.githubusercontent.com/dgarijo/Widoco/master/README.md"
}
]
```

**(5)**
- Example: `[![Project homepage](https://img.shields.io/badge/homepage-project-blue)](https://myproject.org)`
- Result: `"value": "https://myproject.org"`


**(6)**
- Example: `[![DOI](https://zenodo.org/badge/11427075.svg)](https://doi.org/10.5281/zenodo.11093793)`
- Result: `"value": "https://doi.org/10.5281/zenodo.11093793"`

**(7)**
- Example: `[![SWH](https://archive.softwareheritage.org/badge/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7/)](https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7;origin=...)`
- Result: ` "value": "https://archive.softwareheritage.org/swh:1:dir:40d462bbecefc3a9c3e810567d1f0d7606e0fae7",`


**(8)**
- Example: `![Logo](src/main/resources/logo/logo2.png)`
- Result: `"value": "https://raw.githubusercontent.com/dgarijo/Widoco/master/src/main/resources/logo/logo2.png"``

**(9)**
- Example: `[![PyPI](https://badge.fury.io/py/somef.svg)](https://badge.fury.io/py/somef) `
- Result: `"value": "https://pypi.org/project/somef"`


**(10)**
- Example:
```
[Yulun Zhang](http://yulunzhang.com/), [Yapeng Tian](http://yapengtian.org/), [Yu Kong](http://www1.ece.neu.edu/~yukong/), [Bineng Zhong](https://scholar.google.de/citations?user=hvRBydsAAAAJ&hl=en), and [Yun Fu](http://www1.ece.neu.edu/~yunfu/), "Residual Dense Network for Image Super-Resolution", CVPR 2018 (spotlight), [[arXiv]](https://arxiv.org/abs/1802.08797)
```
- Result: `"value": "https://arxiv.org/abs/1802.08797"`


**(11)**
- Example:
```
f"https://raw.githubusercontent.com/{owner}/{repo_name}/{repo_ref}/{urllib.parse.quote(partial)}"
```

**(12)**
- Example:
```
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
```
- Result:
```
"value": "https://www.repostatus.org/#active",
"description": "Active \u2013 The project has reached a stable, usable state and is being actively developed."
```

**(13)**
- Example:
```
[![Gitter chat](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/myproject/community)
[Reddit](https://www.reddit.com/r/myproject)
[Discord](https://discord.com/invite/xyz789)
```
- Result:
```
"value": "https://gitter.im/myproject/community"
....
"value": "https://www.reddit.com/r/myproject"
.....
"value": "https://discord.com/invite/xyz789"
```


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file? Shouldn't it be requirements.txt?

File renamed without changes.
33 changes: 33 additions & 0 deletions docs/supported_languages.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@

SOMEF recognizes the programming languages used in a software repository by inspecting
well-known configuration files, dependency descriptors and executable artifacts.
To know more about the extraction details for each type of file, click on it.



| Language | Supported Files |
|-----------|----------------------------|
| Haskell | [`*.cabal`](./cabal.md) |
| Java | [`pom.xml`](./pom.md) |
| JavaScript | [`package.json`](./packagejson.md), [`bower.json`](./bower.md) |
| Julia | [`Project.toml`](./julia.md) |
| PHP | [`composer.json`](./composer.md) |
| Python | [`setup.py`](./setuppy.md), [`pyproject.toml`](./pyprojecttoml.md), [`requirements.txt`](./requirementstxt.md) |
| R | [`DESCRIPTION`](./description.md) |
| Ruby | [`*.gemspec`](./gemspec.md) |
| Rust | [`Cargo.toml`](./cargo.md) |

---

SoMEF also detects the following files to recognize build instructions, workflows or executable examples:


| Language | Supported Files | Software metadata category |
|-----------|------------------------------------|-----------------------------|
| Docker | `Dockerfile`, `docker-compose.yml` | has_built_file
| Jupyter Notebook | `*.ipynb` | executable_example |
| Ontologies | `*.ttl`, `*.owl`, `*.nt`, `*.xml`, `*.jsonld` | ontologies |
| Shell | `*.sh` | has_script_file |
| YAML | `*.yml`, `*.yaml` | continuous_integration, workflows


77 changes: 1 addition & 76 deletions docs/supported_metadata_files.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,82 +2,6 @@

This project supports extracting metadata from specific types of files commonly used to declare authorship and contribution in open source repositories.

## Supported files of authors.

The following filenames are recognized and processed automatically:

* `AUTHORS`
* `AUTHORS.md`
* `AUTHORS.txt`

These files are expected to be located at the root of the repository. Filenames are matched case-insensitively.

## Purpose and Format

These files typically contain a list of individuals and/or organizations that have contributed to the project. While there is no universal standard for formatting, a widely referenced convention is Google's guidance:

🔗 [Google Open Source: Authors Files Protocol](https://opensource.google/documentation/reference/releasing/authors/)

The content may be structured as:

* Simple plain text, with one contributor per line.
* Markdown-formatted text (`.md` files).
* Lines including contributor names, emails (e.g., `Name <email>`), and sometimes affiliations.

### Examples of Valid Entries

```text
Jane Doe <jane@example.com>
John Smith
Acme Corporation <acme@mail.com>
Google Inc.
```

### Examples of NON Valid Entries

```text
JetBrains <>
Microsoft
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung
scrawl - Top contributor
Tom
```
## What Is Read vs. Discarded

When processing these files, the parser will:

**Include** lines that:

* Contain person names, optionally with emails (`Name <email>`).
* Clearly refer to organizations (e.g., "Google LLC", "OpenAI Inc.").

**Discard** lines that:

* Are headers, decorative separators, or markdown formatting (`#`, `*`, `=`, etc.).
* Contain only URLs or links.
* Are single words with no email and no organizational keyword (e.g., `JetBrains <>`).
* Are markdown or structured noise (`---`, `{}`, etc.).
* Contain more than four words and are not recognized as organizations — to avoid capturing generic or descriptive sentences (e.g., This line not is an author).

### Special Cases

* Entries with only a first name and an email are accepted but must not assign an empty `last_name`.
* Lines starting with `-` or `*` are considered lists, but only parsed if the content matches expected author patterns.
* Blocks enclosed in `{}` are stripped before parsing.
* Any line matching known organization suffixes (`Inc.`, `LLC`, `Ltd.`, `Corporation`) is treated as an organization, even if no email is present.
* Some organization names (e.g., Open Source Initiative) may be mistakenly treated as person names if they do not contain a company designator or email. To improve detection, it is recommended to use names like Open Source Initiative Inc.
* In such cases, only the meaningful part (typically the name) is extracted before any descriptive annotations.
For example, the line:
Tom Smith (Tom) - Project leader 2010-2018
Will be interpreted as:
{
"type": "Person",
"name": "Tom Smith",
"value": "Tom Smith",
"given_name": "Tom",
"last_name": "Smith"
}


## Supported Metadata Files in SOMEF

Expand All @@ -90,6 +14,7 @@ SOMEF can extract metadata from a wide range of files commonly found in software
| `bower.json` | JavaScript (Bower) | Package descriptor used for configuring packages that can be used as a dependency for Bower-managed front-end projects. | <div align="center">[🔍](./bower.md)</div>| [📄](https://github.com/bower/spec/blob/master/json.md)| |[Example](https://github.com/juanjemdIos/somef/blob/master/src/somef/test/test_data/repositories/js-template/bower.json) |
| `package.json` | JavaScript / Node.js | Defines metadata, scripts, and dependencies for Node.js projects | <div align="center">[🔍](./packagejson.md)| [📄](https://docs.npmjs.com/cli/v10/configuring-npm/package-json)| 10.9.4|[Example](https://github.com/npm/cli/blob/latest/package.json) |
| `codemeta.json` | JSON-LD | Metadata file for research software using JSON-LD vocabulary | <div align="center">[🔍](./codemetajson.md)</div> | [📄](https://github.com/codemeta/codemeta/blob/master/crosswalk.csv)| [v3.0](https://w3id.org/codemeta/3.0)|[Example](https://github.com/codemeta/codemeta/blob/master/codemeta.json) |
| `readme.me` | Markdown | Main documentation file of repository | <div align="center">[🔍](./readmefile.md)</div>| | |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/README.md) |
| `composer.json` | PHP | Manifest file serves as the package descriptor used in PHP projects. | <div align="center">[🔍](./composer.md)</div>| [📄](https://getcomposer.org/doc/04-schema.md)| [2.8.12](https://getcomposer.org/changelog/2.8.12)|[Example](https://github.com/composer/composer/blob/main/composer.json) |
| `juliaProject.toml` | Python | Defines the package metadata and dependencies for Julia projects, used by the Pkg package manager.| <div align="center">[🔍](./julia.md)</div>| [📄](https://docs.julialang.org/en/v1/)| |[Example](https://github.com/JuliaLang/TOML.jl/blob/master/Project.toml) |
| `pyproject.toml` | Python | Modern Python project configuration file used by tools like Poetry and Flit | <div align="center">[🔍](./pyprojecttoml.md)</div>| [📄](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/)| |[Example](https://github.com/KnowledgeCaptureAndDiscovery/somef/blob/master/pyproject.toml) |
Expand Down
34 changes: 16 additions & 18 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,23 @@ nav:
- Install: install.md
- Usage: usage.md
- Output: output.md
- Supported formats:
- Authors: author.md
- Bower: bower.md
- Cabal: cabal.md
- Cargo: cargo.md
- Codemeta: codemetajson.md
- Composer: composer.md
- Description: description.md
- Gemespec: gemspec.md
- Julia Projects: julia.md
- Package JSON: packagejson.md
- Pom: pom.md
- Pyproject: pyprojecttoml.md
- Requirements: requirementstxt.md
- Setup: setuppy.md

- Supported metadata files: supported_metadata_files.md
- Supported languages: supported_languages.md
# - Authors: author.md
# - Bower: bower.md
# - Cabal: cabal.md
# - Cargo: cargo.md
# - Codemeta: codemetajson.md
# - Composer: composer.md
# - Description: description.md
# - Gemespec: gemspec.md
# - Julia Projects: julia.md
# - Package JSON: packagejson.md
# - Pom: pom.md
# - Pyproject: pyprojecttoml.md
# - Requirements: requirementstxt.md
# - Setup: setuppy.md

- Contributing: contributing.md
- Changelog: changelog.md
theme:
name: material

Expand Down
Loading