00-what

Secrets detection.

Data prep

We want to prepare data with either plain or secret labels.

For secret data, we can use the included keygen tool to generate a good number of randomly generated keys (that can be considered as secret/password-like data).

Then we can use common shell/bash capabilities to prepare the data with ease, and make it comform to FastText requirements (what we would use as example to train the secret detection model). For example:

% node keygen/index.js  # by default generates 1 million keys per strength of total of 10 strengths
% cat keygen/data/*.txt | sed 's/^/__label__secret /' > data/secrets.txt

For plain data, we can use https://github.com/dwyl/english-words as a basis, and do the same thing:

% cat english-words/words.txt | sed 's/^/__label__plain /' > data/plain.txt

Then we would want to merge, shuffle and split the data into training (80%) and validation (20%) sets.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
keygen		keygen
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
model_training.ipynb		model_training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

00-what

Data prep

About

Uh oh!

Releases

Packages

Languages

portunus-dev/double-oh-what

Folders and files

Latest commit

History

Repository files navigation

00-what

Data prep

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages