Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 24 additions & 119 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,141 +1,46 @@
# MAILSIEVE

MAILSIEVE is a command-line tool for discovering publicly listed business email addresses from domains, with an emphasis on **rate-limiting, resumability, and evidence logging**.
## Purpose

It is designed for research, compliance checks, and operational workflows where **polite crawling and auditability** matter.
External operational tooling for intake, triage, and routing (support ops).

---
## Status

## Features
- **Stability**: Experimental
- **SemVer**: Not guaranteed until v1.0.0
- **Security**: See **Security** section below

- Domain-based email discovery
- Safe resume via `processed.txt`
- Polite rate-limiting and concurrency controls
- CSV output (append-only)
- Evidence logging (GDPR-trimmed)
- Parallel execution with controlled fan-out
## Scope

---
- What this repo is responsible for
- What it explicitly does **not** do

## Installation

Requires **Node.js ≥ 18**.
## Quickstart

```bash
git clone https://github.com/midiakiasat/MAILSIEVE.git
# clone
git clone https://github.com/Verifrax/MAILSIEVE.git
cd MAILSIEVE
npm install
chmod +x batch-run.sh
````

---

## Basic Usage

### 1. Prepare input domains

Create a file named `domains.txt`:

```txt
example.com
https://anotherdomain.it
www.somedomain.org
```

One domain per line.
URLs are normalized automatically.

---

### 2. Run the batch processor

```bash
./batch-run.sh
```

MAILSIEVE will:

* skip domains already listed in `processed.txt`
* append results to `results.csv`
* log trimmed evidence to `logs/evidence.jsonl`

---

### 3. Check progress

```bash
wc -l domains.txt processed.txt results.csv
tail -n 5 results.csv
# install (adjust if needed)
# (placeholder) npm install / pnpm install / go test ./... / etc.
```

---

## Configuration (Environment Variables)

You can tune behavior without editing code:

### Concurrency

```bash
POLITE_CONCURRENCY=3 ./batch-run.sh
```

### Slower / safer crawling

```bash
RATE_MS=1500 TIMEOUT_MS=20000 POLITE_CONCURRENCY=1 ./batch-run.sh
```

### Verbose output

```bash
QUIET_ENV=0 ./batch-run.sh
```

---

## Reset a Run

To start fresh:

```bash
rm -f processed.txt results.csv
rm -rf .cache/http
rm -f logs/evidence.jsonl
```

Then rerun:

```bash
./batch-run.sh
```

---

## Output Files

| File | Purpose |
| --------------------- | ------------------------- |
| `results.csv` | Discovered emails |
| `processed.txt` | Domains already processed |
| `logs/evidence.jsonl` | Minimal evidence trail |

---

## Legal & Ethical Use
## Repository layout

MAILSIEVE **only processes publicly available information**.
- `/` Root sources
- `/.github/` Issue + PR templates
- `/docs/` Documentation (if present)

You are responsible for ensuring that your usage complies with:
## Security

* local laws and regulations
* website terms of service
* data protection frameworks (e.g. GDPR)
- Report vulnerabilities privately: **security@verifrax.org**
- Do **not** open public issues for sensitive findings

This tool is provided **as-is**, without warranty.
## Contributing

---
See `CONTRIBUTING.md`.

## License

See [`LICENSE`](./LICENSE).
MIT. See `LICENSE`.