Skip to content

betoalien/pardox-php

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PardoX for PHP — High-Performance DataFrame Engine

Packagist License: MIT PHP 8.1+ Powered By Rust

The Speed of Rust. The Simplicity of PHP.

PardoX is a high-performance DataFrame engine for PHP. A single Rust core handles all computation — CSV parsing, arithmetic, database I/O, sorting, GroupBy, Window functions, and more — exposed to PHP through the native FFI extension. No Python. No Node. No middleware.

v0.3.2 is now available. PRDX streaming to PostgreSQL (150M rows validated), GroupBy, String & Date ops, Window functions, Lazy pipeline, SQL over DataFrames, Encryption, Data Contracts, Time Travel, Arrow Flight, Linear Algebra, REST Connector, and 29 total gap features — all from PHP.


⚡ Why PardoX for PHP?

Capability How
Zero-copy ingestion Multi-threaded Rust CSV parser — no PHP string processing
SIMD arithmetic AVX2 / NEON — 5x–20x faster than PHP loops
Native database I/O Rust drivers for PostgreSQL, MySQL, SQL Server, MongoDB — no PHP extensions needed
PRDX Streaming Stream 150M-row files to PostgreSQL at ~145k rows/s with O(block) RAM
GPU sort WebGPU Bitonic sort with transparent CPU fallback
GroupBy + Window Aggregations, rolling, rank, lag/lead — pure Rust
No dependencies Only requires PHP's built-in ext-ffi and ext-json
Cross-platform Linux x64 · Windows x64 · macOS Intel · macOS Apple Silicon

📦 Installation

composer require betoalien/pardox-php

Requirements:

  • PHP 8.1 or higher
  • ext-ffi enabled in php.ini
  • ext-json (standard, usually enabled by default)

Enable the FFI extension:

; php.ini
extension=ffi
ffi.enable=true

🚀 Quick Start

<?php
require_once 'vendor/autoload.php';

use PardoX\DataFrame;
use PardoX\IO;

// Load 100,000 rows — parallel Rust CSV parser
$df = DataFrame::read_csv('./sales_data.csv');
echo "Loaded " . number_format($df->shape[0]) . " rows\n";

// SIMD-accelerated arithmetic
$meanDiscount = $df->mean('discount');
echo sprintf("Mean discount: %.4f\n", $meanDiscount);

// Value frequency table
$stateCounts = $df->value_counts('state');
arsort($stateCounts);
foreach (array_slice($stateCounts, 0, 3, true) as $state => $count) {
    echo "  $state: $count\n";
}

// GroupBy aggregation
$grouped = $df->group_by('state', ['revenue' => 'sum', 'quantity' => 'mean']);

// Write to MySQL
$MYSQL = 'mysql://user:password@localhost:3306/mydb';
IO::executeMysql($MYSQL, 'CREATE TABLE IF NOT EXISTS sales (price DOUBLE, quantity DOUBLE)');
$rows = $df->to_mysql($MYSQL, 'sales', 'append');
echo "Written $rows rows to MySQL\n";

🗄️ What's New in v0.3.2

PRDX Streaming to PostgreSQL

Stream a .prdx file directly to PostgreSQL — without loading the file into RAM. O(block) memory regardless of file size.

use PardoX\IO;

$rows = IO::write_sql_prdx(
    '/data/ventas_150m.prdx',
    'postgresql://user:pass@localhost:5432/mydb',
    'ventas',           // table must already exist
    'append',
    [],
    1000000             // rows per COPY batch
);
echo "Streamed " . number_format($rows) . " rows\n";
// Validated: 150,000,000 rows / 3.8 GB in ~1,032s at ~145k rows/s
Approach RAM used
read_prdx() then to_sql() Entire file (3.8 GB for 150M rows)
IO::write_sql_prdx() O(one block) — typically < 200 MB

GroupBy Aggregation

// Single aggregation
$grouped = $df->group_by('category', ['revenue' => 'sum']);

// Multiple aggregations
$grouped = $df->group_by('state', [
    'revenue'  => 'sum',
    'price'    => 'mean',
    'quantity' => 'count',
]);

String & Date Operations

// String ops
$df->str_upper('name');
$df->str_lower('email');
$df->str_trim('description');
$df->str_contains('tag', 'php');
$df->str_replace('status', 'old', 'new');

// Date ops
$df->date_extract('created_at', 'year');   // → 'result_year'
$df->date_format('created_at', '%Y-%m');
$df->date_diff('end_date', 'start_date');
$df->date_add('created_at', 30, 'day');

Window Functions

$df->row_number('price');
$df->rank('revenue', 'dense');
$df->lag('price', 1);
$df->lead('price', 1);
$df->rolling_mean('price', 7);   // 7-period moving average

Lazy Pipeline

use PardoX\IO;

// Scan without loading — filter and collect on demand
$result = IO::lazy_scan_csv('./large_file.csv')
    ->select(['id', 'price', 'state'])
    ->filter('price', '>', 100.0)
    ->limit(10000)
    ->collect();

echo $result->shape[0] . " rows\n";

SQL over DataFrames

// Register a DataFrame and run SQL against it
$df = DataFrame::read_csv('./orders.csv');
$result = $df->sql('SELECT state, SUM(revenue) as total FROM df GROUP BY state');

Encryption

use PardoX\IO;

// Write encrypted PRDX
IO::write_prdx_encrypted('./secure.prdx', $df, 'my-secret-key');

// Read back
$df = IO::read_prdx_encrypted('./secure.prdx', 'my-secret-key');

Data Contracts

$contract = json_encode([
    'source'  => 'orders',
    'columns' => [
        'price'    => ['min' => 0.0, 'max' => 10000.0],
        'status'   => ['allowed_values' => ['active', 'pending', 'closed']],
        'customer' => ['nullable' => false],
    ]
]);

// Returns a new DataFrame with only conforming rows
$clean = $df->validate_contract($contract);
$violations = $df->contract_violation_count();
echo "$violations rows quarantined\n";

🗄️ Database I/O

1. Relational Conqueror — PostgreSQL, MySQL, SQL Server, MongoDB

use PardoX\DataFrame;
use PardoX\IO;

// ── PostgreSQL ────────────────────────────────────────────────
$PG = 'postgresql://user:pass@localhost:5432/mydb';

$df = IO::read_sql($PG, 'SELECT * FROM orders WHERE status = \'active\'');
IO::executeSql($PG, 'CREATE TABLE orders_archive (id BIGINT, amount FLOAT, region TEXT)');

// COPY FROM STDIN auto-activated for > 10,000 rows
$rows = $df->to_sql($PG, 'orders_archive', 'append');

// Upsert — INSERT … ON CONFLICT DO UPDATE
$rows = $df->to_sql($PG, 'orders_archive', 'upsert', ['id']);

// ── MySQL ─────────────────────────────────────────────────────
$MY = 'mysql://user:pass@localhost:3306/mydb';

$df = IO::read_mysql($MY, 'SELECT * FROM products WHERE active = 1');
$rows = $df->to_mysql($MY, 'products_bak', 'append');
$rows = $df->to_mysql($MY, 'products_bak', 'upsert', ['id']);

// ── SQL Server ────────────────────────────────────────────────
$MS = 'Server=localhost,1433;Database=mydb;UID=sa;PWD=MyPwd;TrustServerCertificate=Yes';

$df = IO::read_sqlserver($MS, 'SELECT TOP 5000 * FROM dbo.transactions');
$rows = $df->to_sqlserver($MS, 'dbo.transactions_bak', 'upsert', ['id']); // MERGE INTO

// ── MongoDB ───────────────────────────────────────────────────
$MG = 'mongodb://admin:pass@localhost:27017';

$df = IO::read_mongodb($MG, 'mydb.orders');
$rows = $df->to_mongodb($MG, 'mydb.orders_archive', 'append');

Write modes:

Database append replace upsert
PostgreSQL INSERT (COPY for >10k rows) ON CONFLICT DO UPDATE
MySQL INSERT 1k/stmt (LOAD DATA for >10k) REPLACE INTO ON DUPLICATE KEY UPDATE
SQL Server INSERT 500/stmt INSERT 500/stmt MERGE INTO
MongoDB insert_many 10k/batch drop + insert_many

Note on SQL Server passwords: A known issue in the tiberius v0.12 Rust driver causes authentication failure when ! is present in the password. Use only [A-Za-z0-9_\-@#$]. Fix planned for v0.4.0.


📋 Full API Overview

DataFrame — Factory Methods

$df = DataFrame::read_csv('./file.csv');
$df = DataFrame::read_csv('./file.csv', ['price' => 'Float64']);  // with schema
$df = IO::read_sql($pgConn, 'SELECT * FROM orders');
$df = IO::read_mysql($myConn, 'SELECT * FROM products');
$df = IO::read_sqlserver($msConn, 'SELECT TOP 100 * FROM dbo.t');
$df = IO::read_mongodb($mgConn, 'mydb.collection');

DataFrame — Properties

$df->shape      // [rows, cols]
$df->columns    // ['col1', 'col2', ...]
$df->dtypes     // ['col1' => 'Float64', ...]

DataFrame — Inspection

$df->show(10);              // ASCII table (stdout)
$df->head(5);               // → DataFrame
$df->tail(5);               // → DataFrame
$df->iloc(0, 100);          // → DataFrame (rows 0–99)

DataFrame — Arithmetic & Transform

$df->cast('quantity', 'Float64');
$df->fillna(0.0);
$df->round(2);
$df->mul('price', 'quantity');       // → DataFrame with 'result_mul'
$df->sub('revenue', 'cost');         // → DataFrame with 'result_sub'
$df->add('amount', 'tax');           // → DataFrame with 'result_add'
$df->std('column');                  // float
$df->min_max_scale('col');           // → DataFrame with 'result_minmax'
$df->sort_values('col', true);       // → DataFrame (ascending)
$df->sort_values('col', false, true);// → DataFrame (descending, GPU)

DataFrame — GroupBy & Aggregation

$df->group_by('category', ['revenue' => 'sum', 'price' => 'mean']);
$df->group_by('state', ['quantity' => 'count', 'revenue' => 'max']);

DataFrame — Window Functions

$df->row_number('price');
$df->rank('revenue', 'dense');
$df->lag('price', 1);
$df->lead('price', 1);
$df->rolling_mean('price', 7);

DataFrame — String Operations

$df->str_upper('col');
$df->str_lower('col');
$df->str_trim('col');
$df->str_len('col');
$df->str_contains('col', 'pattern');
$df->str_replace('col', 'old', 'new');

DataFrame — Series Operators

$total   = $df['price'] * $df['quantity'];    // → Series
$net     = $df['revenue'] - $df['discount'];  // → Series
$df['total'] = $df['price'] * $df['quantity']; // assign new column

Series — Aggregations

$df['col']->sum();     // float
$df['col']->mean();    // float
$df['col']->min();     // float
$df['col']->max();     // float
$df['col']->std();     // float
$df['col']->count();   // int

Observer

$df->value_counts('col');   // ['value' => count, ...]
$df->unique('col');         // ['val1', 'val2', ...]
$df->to_dict();             // [['col' => val, ...], ...]
$df->to_json();             // string
$df->tolist();              // [[val, val, ...], ...]

Write

$df->to_prdx('./out.prdx');
$df->to_csv('./out.csv');
$df->to_sql($pgConn, 'table', 'append', ['id']);
$df->to_mysql($myConn, 'table', 'upsert', ['id']);
$df->to_sqlserver($msConn, 'dbo.table', 'append');
$df->to_mongodb($mgConn, 'db.collection', 'append');
IO::write_sql_prdx('file.prdx', $pgConn, 'table', 'append', [], 1000000);

IO — Static Helpers

IO::executeSql($pgConn, 'CREATE TABLE ...');
IO::executeMysql($myConn, 'DROP TABLE IF EXISTS ...');
IO::executeSqlserver($msConn, 'TRUNCATE TABLE dbo.staging');
IO::executeMongodb($mgConn, 'mydb', '{"drop": "col"}');

📊 Benchmarks

Operation PHP PDO / loops PardoX v0.3.2 Speedup
Read CSV (1 GB) ~12s ~0.8s ~15x
Column multiply (1M rows) ~0.8s ~0.02s ~40x
PostgreSQL write 50k rows ~20s (PDO execute) ~0.6s (COPY) ~33x
MySQL write 50k rows ~25s (PDO execute) ~3s (batch INSERT) ~8x
PRDX → PostgreSQL 150M rows N/A ~1,032s 145k rows/s

🗺️ Roadmap

Version Status Highlights
v0.1 ✅ Released CSV, arithmetic, aggregations, .prdx format
v0.3.1 ✅ Released Databases (PG/MySQL/MSSQL/MongoDB), Observer, Math, GPU sort
v0.3.2 ✅ Released PRDX Streaming, GroupBy, Window, String/Date, Lazy, Encryption, Data Contracts, Time Travel, Arrow Flight, Linear Algebra, REST Connector — 29 features
v0.4.0 🔜 Planned SQL Server ! password fix, structured error codes, Apache Parquet, Kafka, S3

🌐 Platform Support

OS Architecture Status
Linux x86_64 ✅ Stable
Windows x86_64 ✅ Stable
macOS ARM64 (M1/M2/M3) ✅ Stable
macOS x86_64 (Intel) ✅ Stable

📘 Documentation

Full Documentation →


📄 License

MIT License — free for commercial and personal use.


by Alberto Cardenas
www.albertocardenas.com · www.pardox.io

About

PHP SDK for pardoX, the high-performance DataFrame engine for modern ETL and analytics. A Rust core powers SDKs in Python, Node.js, and PHP, with native database I/O and an ultra-fast binary format.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors