Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
4ef60a5
feat: add support for mac ARM
lfoppiano Feb 23, 2026
906161f
chore: update .gitignore
lfoppiano Feb 23, 2026
dcf6e28
chore: update libraries built with libc++ on github actions
lfoppiano Feb 24, 2026
411a38f
feat: support both mangled methods names for libc++ and libstdc++
lfoppiano Feb 25, 2026
0b28291
fix: update path
lfoppiano Feb 25, 2026
c15728a
feat: first version of an automated build
lfoppiano Feb 25, 2026
81ef036
feat: fix the GitHub runner names
lfoppiano Feb 25, 2026
eb7e83c
feat: use runners that actually exist
lfoppiano Feb 25, 2026
4dc4573
fix folder names
lfoppiano Feb 25, 2026
f84802a
fix: update binaries
lfoppiano Feb 25, 2026
1e29dc8
feat: allow switching libraries
lfoppiano Feb 25, 2026
b5b1a0d
feat: ignore tests based on the library loaded
lfoppiano Feb 26, 2026
c4be80a
feat: add external libraries
lfoppiano Feb 26, 2026
50fa187
fix: package name in ubuntu it has a different number
lfoppiano Feb 26, 2026
182c69e
fix: remove binaries in the repo, use maven-antrun-plugin to build them
lfoppiano Feb 26, 2026
2c0ecc0
feat: use cmake instead of a script
lfoppiano Feb 26, 2026
a0ce0d8
fix: remove warnings when building CDL2
lfoppiano Feb 26, 2026
af29918
fix: load the full library as it was designed
lfoppiano Feb 26, 2026
0fac883
refactor: library loading
lfoppiano Feb 26, 2026
bf9015a
chore: cleanup
lfoppiano Feb 26, 2026
0a69e1e
fix: remove macOS full variant from test platforms and clean up pom.x…
lfoppiano Feb 26, 2026
65f2a06
Merge branch 'master' into feature/support-arm-mac
lfoppiano Feb 26, 2026
0539d7f
docs: update readme.md
lfoppiano Feb 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/test-platforms.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Build and test on multiple platforms

on:
push:
pull_request:
branches: [main, master, develop]

jobs:
test:
strategy:
fail-fast: false
matrix:
include:
- os: ubuntu-24.04
variant: system
- os: ubuntu-24.04-arm
variant: system
- os: ubuntu-24.04
variant: standard
- os: ubuntu-24.04-arm
variant: standard
- os: macos-15-intel
variant: standard
- os: macos-latest
variant: standard
- os: ubuntu-24.04
variant: full
- os: ubuntu-24.04-arm
variant: full

runs-on: ${{ matrix.os }}

steps:
- uses: actions/checkout@v4

- name: Install libffi (Linux)
if: runner.os == 'Linux'
run: sudo apt-get update && sudo apt-get install -y libffi-dev

- name: Install libffi (macOS)
if: runner.os == 'macOS'
run: brew install libffi

- name: Install system CLD2 library (Linux system profile)
if: matrix.variant == 'system'
run: sudo apt-get install -y libcld2-0 libcld2-dev

- name: Install build tools (Linux standard/full profiles)
if: runner.os == 'Linux' && matrix.variant != 'system'
run: sudo apt-get install -y build-essential

- name: Set up JDK
uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '17'

- name: Build and test
run: mvn clean verify -P${{ matrix.variant }} -q
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,8 @@ hs_err_pid*
.classpath
.settings/

.idea

# Native libraries (use build-native profile to build)
src/main/resources/linux-*/
src/main/resources/darwin-*/
78 changes: 52 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,62 @@ The [Compact Language Detector 2](https://github.com/CLD2Owners/cld2) is a nativ

## Installation

### Native Library
This project supports four build profiles:

First, the library libcld2.so (or a .dll on Windows) needs to be installed.
| Profile | Description | Platforms |
| ------------- | ----------------------------------------------------------------------- | ----------------|
| *(default)* | No native library bundled. Requires system library or `-Djava.library.path` | Any |
| `system` | Use system-installed libcld2 (Debian package) | Linux (Debian) |
| `standard` | Clone and build CLD2 from source, bundle into JAR | Linux, macOS |
| `full` | Build from source with full language support (160+) | Linux |

- on Debian-based systems the easiest way is to install the package [libcld-0](https://packages.debian.org/stretch/libcld2-0):
### System Library (Linux/Debian)

Install the native library via apt:
```
apt-get install libcld2-0 libcld2-dev
```
- to compile the CLD2 library from source:

Then build with the `system` profile:
```
mvn clean verify -Psystem
```

### Build from Source

For Linux or macOS, use the `standard` profile to clone and build CLD2 from source:
```
mvn clean verify -Pstandard
```

This clones [lfoppiano/CLD2](https://github.com/lfoppiano/CLD2) and builds `libcld2`, then bundles it into the JAR.

**Prerequisites:**
- Linux: `build-essential`, `git`
- macOS: Xcode Command Line Tools (includes git, clang++)

### Full Language Support (160+ languages)

The `full` profile is **Linux only**. It builds both `libcld2` and `libcld2_full` from source and uses `LD_PRELOAD` to load the full language tables during testing:

```
git clone https://github.com/CLD2Owners/cld2.git
cd cld2/internal/
export CFLAGS="-Wno-narrowing -O3"
./compile_and_test_all.sh
mvn clean verify -Pfull
```
If you only want the libraries, `./compile_libs.sh` is sufficient. You may use different compiler flags, the flag `-Wno-narrowing` is required for compilers which follow the C++11 standard.

The `libcld2_full` library only contains the classifier tables for 160+ languages — it is not a standalone library. At runtime, use `LD_PRELOAD=libcld2_full.so` to override the standard tables in `libcld2`. For Hadoop Map-Reduce jobs, pass `-Dmapreduce.reduce.env=LD_PRELOAD=libcld2_full.so`.

**Why Linux only?** The macOS equivalent (`DYLD_INSERT_LIBRARIES`) does not work because System Integrity Protection (SIP) strips all `DYLD_*` environment variables from child processes, including the JVM forked by Maven Surefire.

#### Using the CLD2 Full Version (160+ languages)
### Using Without Maven Profiles

Both the Debian package and the source build provide two native libraries: `libcld2.so` and `libcld2_full.so`. The former supports 80+, the latter 160+ languages. However, the `libcld2_full.so` from the Debian package isn't a complete shared library - it only contains the tables used by the classifier. To use the larger tables for 160+ language instead of those for 80+ languages, you must use the [LD_PRELOAD trick](https://stackoverflow.com/questions/426230/what-is-the-ld-preload-trick) and set the environment variable `LD_PRELOAD=libcld2_full.so` (on Linux). In case, the language detector is used in Hadoop Map-Reduce jobs, this can be achieved by setting the Hadoop configuration property `mapreduce.reduce.env`, e.g., by passing `-Dmapreduce.reduce.env=LD_PRELOAD=libcld2_full.so` as command-line argument.
If not using a profile, you must provide the native library yourself:

1. **Install system library** (see above), then:
```
mvn clean verify -Djava.library.path=/usr/lib/x86_64-linux-gnu
```

2. **Or use JNA's classpath loading**: Place `libcld2.so`/`libcld2.dylib` on the classpath and JNA will find it.


### Java Bindings
Expand All @@ -43,29 +78,20 @@ and can then be used as dependency
</dependency>
```

To link the Java code with the native libraries, you need to make sure that Java can find the share object:
To link the Java code with the native libraries when using the default build (without profiles), you need to make sure that Java can find the shared object:
- either install the native library on a standard library path (already done when the Debian package is used)
- add the directory where your libcld2.so installed to the environment variable `LD_LIBRARY_PATH`
- use the Java option `-Djava.library.path=...`

#### Java Native Access (JNA) and libffi

The CLD2 native functions are accessed via the [Java Native Access (JNA)](https://github.com/java-native-access/jna) which uses the [Foreign Function Interface Library (libffi)](https://sourceware.org/libffi/). JNA is a project dependency but the libffi needs to be present on your system. If not install it, e.g.
```
apt-get install libffi6
```

#### Potential Issues on Other Platforms (Non-Linux)

So far, the bindings have only been tested on Linux.
The CLD2 native functions are accessed via the [Java Native Access (JNA)](https://github.com/java-native-access/jna) which uses the [Foreign Function Interface Library (libffi)](https://sourceware.org/libffi/). JNA is a project dependency but libffi needs to be present on your system:
- Linux (Debian/Ubuntu): `apt-get install libffi-dev`
- macOS: `brew install libffi`

One potential issue for ports to other platforms is the [mangling of C++ function names](https://en.wikipedia.org/wiki/Name_mangling). Function names called in the native library are registered in [Cld2Library](../blob/master/src/main/java/org/commoncrawl/langdetect/cld2/Cld2Library.java) and [Cld2](../blob/master/src/main/java/org/commoncrawl/langdetect/cld2/Cld2.java) using the mangled names, e.g., `_ZN4CLD224ExtDetectLanguageSummaryEPKcibPKNS_8CLDHintsEiPNS_8LanguageEPiPdPSt6vectorINS_11ResultChunkESaISA_EES7_Pb`. The mangling may work differently on a different platform or when another C++-compiler is used.
#### Platform Support

To adopt the Java bindings, you first need to get the mangled names from the shared object. On Linux this could be done by calling
```
% nm -D .../libcld2.so.0.0.197
```
The mangled function names in the two Java classes need to be replaced by the ones exposed by your native library. Please also see the notes in [Cld2](../blob/master/src/main/java/org/commoncrawl/langdetect/cld2/Cld2.java) regarding the creation of the bindings.
The bindings have been tested on Linux (x86-64, ARM64) and macOS (Intel, Apple Silicon).


## History
Expand Down
177 changes: 177 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@
</properties>

<build>
<extensions>
<extension>
<groupId>kr.motd.maven</groupId>
<artifactId>os-maven-plugin</artifactId>
<version>1.7.0</version>
</extension>
</extensions>
<testSourceDirectory>${basedir}/src/test/java</testSourceDirectory>
<testResources>
<testResource>
Expand Down Expand Up @@ -77,6 +84,176 @@
</plugins>
</build>

<profiles>
<profile>
<id>system</id>
</profile>
<profile>
<id>standard</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<id>build-cld2</id>
<phase>generate-resources</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<target>
<exec executable="git" failonerror="true" dir="${project.build.directory}">
<arg line="clone --depth 1 -b feature/mac-arm-support https://github.com/lfoppiano/CLD2.git"/>
</exec>
<exec executable="cmake" failonerror="true" dir="${project.build.directory}/CLD2">
<arg value="-S."/>
<arg value="-Bbuild"/>
<arg value="-DCMAKE_CXX_FLAGS=-w"/>
</exec>
<exec executable="cmake" failonerror="true" dir="${project.build.directory}/CLD2">
<arg value="--build"/>
<arg value="build"/>
</exec>
<condition property="native.prefix" value="darwin-aarch64">
<and>
<os family="mac"/>
<os arch="aarch64"/>
</and>
</condition>
<condition property="native.prefix" value="darwin-x86-64">
<and>
<os family="mac"/>
<not>
<os arch="aarch64"/>
</not>
</and>
</condition>
<condition property="native.prefix" value="linux-aarch64">
<and>
<os family="unix"/>
<not>
<os family="mac"/>
</not>
<os arch="aarch64"/>
</and>
</condition>
<condition property="native.prefix" value="linux-x86-64">
<and>
<os family="unix"/>
<not>
<os family="mac"/>
</not>
<not>
<os arch="aarch64"/>
</not>
</and>
</condition>
<mkdir dir="${project.build.outputDirectory}/${native.prefix}"/>
<copy todir="${project.build.outputDirectory}/${native.prefix}">
<fileset dir="${project.build.directory}/CLD2/build">
<include name="libcld2.*"/>
</fileset>
</copy>
</target>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
<profile>
<id>full</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-antrun-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<id>build-cld2</id>
<phase>generate-resources</phase>
<goals>
<goal>run</goal>
</goals>
<configuration>
<target>
<exec executable="git" failonerror="true" dir="${project.build.directory}">
<arg line="clone --depth 1 -b feature/mac-arm-support https://github.com/lfoppiano/CLD2.git"/>
</exec>
<exec executable="cmake" failonerror="true" dir="${project.build.directory}/CLD2">
<arg value="-S."/>
<arg value="-Bbuild"/>
<arg value="-DCMAKE_CXX_FLAGS=-w"/>
</exec>
<exec executable="cmake" failonerror="true" dir="${project.build.directory}/CLD2">
<arg value="--build"/>
<arg value="build"/>
</exec>
<condition property="native.prefix" value="darwin-aarch64">
<and>
<os family="mac"/>
<os arch="aarch64"/>
</and>
</condition>
<condition property="native.prefix" value="darwin-x86-64">
<and>
<os family="mac"/>
<not>
<os arch="aarch64"/>
</not>
</and>
</condition>
<condition property="native.prefix" value="linux-aarch64">
<and>
<os family="unix"/>
<not>
<os family="mac"/>
</not>
<os arch="aarch64"/>
</and>
</condition>
<condition property="native.prefix" value="linux-x86-64">
<and>
<os family="unix"/>
<not>
<os family="mac"/>
</not>
<not>
<os arch="aarch64"/>
</not>
</and>
</condition>
<mkdir dir="${project.build.outputDirectory}/${native.prefix}"/>
<copy todir="${project.build.outputDirectory}/${native.prefix}">
<fileset dir="${project.build.directory}/CLD2/build">
<include name="libcld2.*"/>
<include name="libcld2_full.*"/>
</fileset>
</copy>
</target>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>3.5.4</version>
<configuration>
<environmentVariables>
<LD_PRELOAD>${project.build.directory}/classes/${os.detected.classifier}/libcld2_full.so</LD_PRELOAD>
</environmentVariables>
</configuration>
</plugin>
</plugins>
</build>
</profile>
</profiles>


<dependencies>

Expand Down
12 changes: 9 additions & 3 deletions src/main/java/org/commoncrawl/langdetect/cld2/CLDHints.java
Original file line number Diff line number Diff line change
Expand Up @@ -57,10 +57,16 @@ public class CLDHints extends Structure {
public int encoding_hint = Encoding.UNKNOWN_ENCODING.value();

/** ITALIAN boosts it */
public int language_hint = Language.UNKNOWN_LANGUAGE.value();
public int language_hint;

protected static CLDHints NO_HINTS = new CLDHints(null, "",
Encoding.UNKNOWN_ENCODING.value(), Language.UNKNOWN_LANGUAGE.value());
private static CLDHints noHints;

public static CLDHints getNoHints() {
if (noHints == null) {
noHints = new CLDHints(null, "", Encoding.UNKNOWN_ENCODING.value(), Language.UNKNOWN_LANGUAGE.value());
}
return noHints;
}

private static final Pattern DOTPATTERN = Pattern.compile("\\.");

Expand Down
Loading