Data Connectors

Playwright-based data connectors for DataConnect. Each connector exports a user's data from a web platform using browser automation — credentials never leave the device.

Connectors

Platform	Company	Runtime	Scopes
ChatGPT	OpenAI	playwright	conversations, memories
Instagram	Meta	playwright	profile, posts, liked_posts
LinkedIn	LinkedIn	playwright	profile, experience, education, skills
Spotify	Spotify	playwright	savedTracks, playlists

Repository structure

connectors/
├── registry.json                  # Central registry (checksums, versions)
├── types/
│   └── connector.d.ts             # TypeScript type definitions
├── schemas/                       # JSON schemas for exported data
│   ├── chatgpt.conversations.json
│   └── ...
├── openai/
│   ├── chatgpt-playwright.js      # Connector script
│   └── chatgpt-playwright.json    # Metadata
├── linkedin/
│   ├── linkedin-playwright.js
│   └── linkedin-playwright.json
├── meta/
│   ├── instagram-playwright.js
│   └── instagram-playwright.json
└── spotify/
    ├── spotify-playwright.js
    └── spotify-playwright.json

Each connector consists of two files inside a <company>/ directory:

<name>-playwright.js — the connector script (plain JS, runs inside the Playwright runner sidecar)
<name>-playwright.json — metadata (display name, login URL, selectors, scopes)

How connectors work

Connectors run in a sandboxed Playwright browser managed by the DataConnect app. The runner provides a page API object (not raw Playwright). The browser starts headless; connectors call page.showBrowser() when login is needed and page.goHeadless() after.

Two-phase architecture

Phase 1 — Login (visible browser)

Navigate to the platform's login page (headless)
Check if the user is already logged in via persistent session
If not, show the browser so the user can log in manually
Extract auth tokens/cookies once logged in

Phase 2 — Data collection (headless)

Switch to headless mode (browser disappears)
Fetch data via API calls, network capture, or DOM scraping
Report structured progress to the UI
Return the collected data with an export summary

Data extraction patterns

Pattern	When to use	Example connector
API fetch via `page.evaluate()`	Platform has REST/JSON APIs	`openai/chatgpt-playwright.js`
Network capture via `page.captureNetwork()`	Platform uses GraphQL/XHR that fires on navigation	`meta/instagram-playwright.js`
DOM scraping via `page.evaluate()`	No API available, data only in rendered HTML	`linkedin/linkedin-playwright.js`

Building a new connector

1. Create the metadata file

Create connectors/<company>/<name>-playwright.json:

{
  "id": "<name>-playwright",
  "version": "1.0.0",
  "name": "Platform Name",
  "company": "Company",
  "description": "Exports your ... using Playwright browser automation.",
  "connectURL": "https://platform.com/login",
  "connectSelector": "css-selector-for-logged-in-state",
  "exportFrequency": "daily",
  "runtime": "playwright",
  "vectorize_config": { "documents": "field_name" }
}

runtime must be "playwright"
connectURL is where the browser navigates initially
connectSelector detects whether the user is logged in (e.g. an element only visible post-login)

2. Create the connector script

Create connectors/<company>/<name>-playwright.js:

// State management
const state = { isComplete: false };

// ─── Login check ──────────────────────────────────────
const checkLoginStatus = async () => {
  try {
    return await page.evaluate(`
      (() => {
        const hasLoggedInEl = !!document.querySelector('LOGGED_IN_SELECTOR');
        const hasLoginForm = !!document.querySelector('LOGIN_FORM_SELECTOR');
        return hasLoggedInEl && !hasLoginForm;
      })()
    `);
  } catch { return false; }
};

// ─── Main flow ────────────────────────────────────────
(async () => {
  // Phase 1: Login
  await page.setData('status', 'Checking login status...');
  await page.sleep(2000);

  if (!(await checkLoginStatus())) {
    await page.showBrowser('https://platform.com/login');
    await page.setData('status', 'Please log in...');
    await page.promptUser(
      'Please log in. Click "Done" when ready.',
      async () => await checkLoginStatus(),
      2000
    );
  }

  // Phase 2: Headless data collection
  await page.goHeadless();

  await page.setProgress({
    phase: { step: 1, total: 2, label: 'Fetching profile' },
    message: 'Loading profile data...',
  });

  // ... fetch your data here ...
  const items = [];

  // Build result (exportSummary is required)
  const result = {
    items,
    exportSummary: {
      count: items.length,
      label: items.length === 1 ? 'item' : 'items',
    },
    timestamp: new Date().toISOString(),
    version: '1.0.0-playwright',
    platform: 'platform-name',
  };

  state.isComplete = true;
  await page.setData('result', result);
})();

3. Add a data schema (optional)

Create connectors/schemas/<platform>.<scope>.json to describe the exported data format:

{
  "name": "Platform Items",
  "version": "1.0.0",
  "scope": "platform.items",
  "dialect": "json",
  "description": "Description of the exported data",
  "schema": {
    "type": "object",
    "properties": {
      "items": {
        "type": "array",
        "items": {
          "properties": {
            "id": { "type": "string" },
            "title": { "type": "string" }
          },
          "required": ["id", "title"]
        }
      }
    },
    "required": ["items"]
  }
}

4. Update the registry

Add your connector to registry.json. Generate checksums with:

shasum -a 256 <company>/<name>-playwright.js | awk '{print "sha256:" $1}'
shasum -a 256 <company>/<name>-playwright.json | awk '{print "sha256:" $1}'

Then add an entry to the connectors array:

{
  "id": "<name>-playwright",
  "company": "<company>",
  "version": "1.0.0",
  "name": "Platform Name",
  "description": "...",
  "files": {
    "script": "<company>/<name>-playwright.js",
    "metadata": "<company>/<name>-playwright.json"
  },
  "checksums": {
    "script": "sha256:<hash>",
    "metadata": "sha256:<hash>"
  }
}

Page API reference

The page object is available as a global in connector scripts:

Method	Description
`page.evaluate(jsString)`	Run JS in browser context, return result
`page.goto(url)`	Navigate to URL
`page.sleep(ms)`	Wait for milliseconds
`page.setData(key, value)`	Send data to host (`'status'`, `'error'`, `'result'`)
`page.setProgress({phase, message, count})`	Structured progress for the UI
`page.showBrowser(url?)`	Switch to headed mode (visible browser)
`page.goHeadless()`	Switch to headless mode (invisible)
`page.promptUser(msg, checkFn, interval)`	Show prompt, poll `checkFn` until truthy
`page.captureNetwork({urlPattern, bodyPattern, key})`	Register a network capture
`page.getCapturedResponse(key)`	Get captured response or `null`
`page.clearNetworkCaptures()`	Clear all captures
`page.closeBrowser()`	Close browser, keep process for HTTP work

Progress reporting

await page.setProgress({
  phase: { step: 1, total: 3, label: 'Fetching memories' },
  message: 'Downloaded 50 of 200 items...',
  count: 50,
});

phase.step / phase.total — drives the step indicator ("Step 1 of 3")
phase.label — short label for the current phase
message — human-readable progress text
count — numeric count for progress tracking

Testing locally

Prerequisites

DataConnect cloned and able to run (npm run tauri:dev)

Setup

Clone this repo alongside DataConnect:

git clone https://github.com/vana-com/data-connectors.git

Point DataConnect to your local connectors during development:

# From the DataConnect repo
CONNECTORS_PATH=../data-connectors npm run tauri:dev

The CONNECTORS_PATH environment variable tells the fetch script to skip downloading and use your local directory instead.

After editing connector files, sync them to the app's runtime directory:

# From the DataConnect repo
node scripts/sync-connectors-dev.js

This copies your connector files to ~/.dataconnect/connectors/ where the running app reads them. The app checks this directory first, so your local edits take effect without rebuilding.

Iteration loop

Edit your connector script
Run node scripts/sync-connectors-dev.js (from the DataConnect repo)
Click the connector in the app to test
Check logs in ~/Library/Logs/DataConnect/ (macOS) for debugging

Contributing

Adding a new connector

Fork this repo
Create a branch: git checkout -b feat/<platform>-connector
Add your files in connectors/<company>/:
- <name>-playwright.js — connector script
- <name>-playwright.json — metadata
- schemas/<platform>.<scope>.json — data schema (optional but encouraged)
Test locally using the instructions above
Update registry.json with your connector entry and checksums
Open a pull request

Modifying an existing connector

Fork and branch
Make your changes to the connector script and/or metadata
Test locally
Update the version in the metadata JSON
Regenerate checksums and update registry.json
Open a pull request

Guidelines

Credentials stay on-device. Connectors run in a local browser. Never send tokens or passwords to external servers.
Use page.setProgress() to report progress. Users should see what's happening during long exports.
Include exportSummary in the result. The UI uses it to display what was collected.
Handle errors gracefully. Use page.setData('error', message) and provide clear error messages.
Prefer API fetch over DOM scraping when the platform has usable APIs. APIs are more stable than DOM structure.
Avoid relying on CSS class names — many platforms obfuscate them. Use structural selectors, heading text, and content heuristics instead.
Rate-limit API calls. Add page.sleep() between requests to avoid triggering rate limits.
Test pagination edge cases — empty results, single page, large datasets.

Registry checksums

The registry uses SHA-256 checksums to verify file integrity during OTA updates. Always regenerate checksums when modifying connector files:

shasum -a 256 <company>/<name>-playwright.js | awk '{print "sha256:" $1}'
shasum -a 256 <company>/<name>-playwright.json | awk '{print "sha256:" $1}'

How the registry works

DataConnect fetches registry.json from this repo on app startup and during npm postinstall. For each connector listed:

Check if local files exist with matching checksums
If not, download from baseUrl/<file_path> (this repo's raw GitHub URL)
Verify SHA-256 checksums match
Write to local connectors/ directory

This enables OTA connector updates without requiring a full app release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Connectors

Connectors

Repository structure

How connectors work

Two-phase architecture

Data extraction patterns

Building a new connector

1. Create the metadata file

2. Create the connector script

3. Add a data schema (optional)

4. Update the registry

Page API reference

Progress reporting

Testing locally

Prerequisites

Setup

Iteration loop

Contributing

Adding a new connector

Modifying an existing connector

Guidelines

Registry checksums

How the registry works

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
linkedin		linkedin
meta		meta
openai		openai
schemas		schemas
spotify		spotify
types		types
.gitignore		.gitignore
README.md		README.md
index.ts		index.ts
registry.json		registry.json

vana-com/data-connectors

Folders and files

Latest commit

History

Repository files navigation

Data Connectors

Connectors

Repository structure

How connectors work

Two-phase architecture

Data extraction patterns

Building a new connector

1. Create the metadata file

2. Create the connector script

3. Add a data schema (optional)

4. Update the registry

Page API reference

Progress reporting

Testing locally

Prerequisites

Setup

Iteration loop

Contributing

Adding a new connector

Modifying an existing connector

Guidelines

Registry checksums

How the registry works

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages