A comprehensive project for analyzing Turkish word roots using Wikipedia articles, TDK dictionary data, and large language models (Gemma/Gemini). The project aims to identify and validate Turkish word roots through various methods.
This project combines multiple approaches to analyze Turkish word roots:
- Wikipedia corpus analysis
- Turkish dictionary (TDK) integration
- Large Language Model assistance
- Suffix stripping algorithms
- Loads Turkish Wikipedia dataset (~535K articles)
- Implements suffix stripping algorithm
- Processes text in parallel for efficiency
- Generates root candidates from Wikipedia corpus
- Uses Gemma 2.0 (27B) model for root validation
- Implements comprehensive Turkish suffix list (249 suffixes)
- Processes words in chunks for efficient analysis
- Filters non-Turkish words and proper nouns
- Uses Google's Gemini model for root verification
- Processes and validates root candidates
- Combines results with existing root dictionary
- Saves validated roots to JSON format
- Integrates TDK dictionary database
- Provides word meanings and etymology
- Contains 8 related tables:
- madde (words)
- anlam (meanings)
- ornek (examples)
- ozellik (properties)
- etc.
Due to GitHub's file size limitations, large files have been split into smaller chunks:
The TDK dictionary data (sozluk/gts.json, ~104 MB) is split into three parts:
sozluk/gts.json.part_aa
sozluk/gts.json.part_ab
sozluk/gts.json.part_ac
To merge these files into the original:
# On Unix-like systems (Linux/MacOS):
cat sozluk/gts.json.part_* > sozluk/gts.json
# On Windows (PowerShell):
Get-Content sozluk/gts.json.part_* | Set-Content sozluk/gts.json
# On Windows (Command Prompt):
copy /b sozluk\gts.json.part_* sozluk\gts.jsonFor contributors who need to split large files:
# On Unix-like systems:
split -b 50M large_file.json large_file.json.part_
# On Windows (PowerShell):
$file = [IO.File]::ReadAllBytes("large_file.json")
$size = 50MB
$parts = [Math]::Ceiling($file.Length / $size)
for($i=0; $i -lt $parts; $i++) {
$start = $i * $size
$chunk = $file[$start..([Math]::Min($start + $size - 1, $file.Length - 1))]
[IO.File]::WriteAllBytes("large_file.json.part_$i", $chunk)
}kokler.txt: Base dictionary of known Turkish rootsall_roots.json: Combined validated rootssorted_remains_words_freq.json: Frequency analysis of remaining words
Contains 249 Turkish suffixes including:
- Case endings
- Possessive markers
- Verb tenses
- Derivational suffixes
- Text extraction from Wikipedia
- Suffix stripping
- Root candidate generation
- LLM validation
- Dictionary verification
- Frequency analysis
- Initial known roots: 10,470
- Total discovered forms: ~4 million
- Validated unique roots: ~21,756
- Processing time: ~21 minutes with parallel processing
- Python 3.9+
- Required packages:
- pandas
- datasets
- joblib
- sqlite3
- ollama (for Gemma)
- google.generativeai (for Gemini)
- Clone the repository:
git clone https://github.com/yourusername/turkce_kokler.git- Merge the dictionary data:
# On Unix-like systems:
cat sozluk/gts.json.part_* > sozluk/gts.json
# On Windows (PowerShell):
Get-Content sozluk/gts.json.part_* | Set-Content sozluk/gts.json- Install Python dependencies:
pip install -r requirements.txt-
Make sure the dictionary file is properly merged (see Installation step 2)
-
Run the notebooks in order:
- wikipedia_kok.ipynb
- gemma_kokleri_sec.ipynb
- gemini_kokleri_duzelt.ipynb
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to your fork
- Create a Pull Request
Note: When adding large files (>50MB), please split them into smaller chunks:
split -b 50M large_file.json large_file.json.part_[Add your license information here]
[Add author information here]
- TDK for the Turkish dictionary database
- Wikimedia for the Turkish Wikipedia dataset
- Google and Ollama for the LLM models