02 Dec 10:07

0.3.0 Latest

Latest

ScaleDP 0.30

What's Changed

🚀 Features

Added TextEmbeddings transformer, for compute embedding using SentenceTransformers
Added BaseTextSplitter and TextSplitter for semantic splitting text
Added support pandas udf for TextSplitter
Added support TextChunks as input to TextEmbeddings

📚 Documentation

Added TextEmbedding and TextSplitter docs

📘 Jupyter Notebooks

TextSplitterAndEmbeddings.ipynb - Read pdf documents, split text into chunks and compute embeddings in scale

Assets 2

19 Nov 06:48

0.2.6

ScaleDP 0.2.6

What's Changed

🚀 Features

Enable support GPU in YoloOnnxDetector

🔄 Updates

Change default values for detectors

📘 Jupyter Notebooks

YoloOnnxDetectorBenchamrks.ipynb - Benchmarking YOLO model with different parameters configurations on CPU and GPU

📝 Blog Posts

Benchmarking YOLO Models on Spark Using ScaleDP

yolo-scaledp-benchmarking

Assets 2

10 Nov 05:24

0.2.5

ScaleDP 0.2.5

What's Changed

🚀 Features

Added param 'returnEmpty' to ImageCropBoxes for avoid to have exceptions if no boxes are found
Added labels param to the YoloOnnxDetector
Improve displaying labels in ImageDrawBoxes

🧰 Maintenance

Updated versions of dependencies (Pandas, Numpy, OpenCV)

🐛 Bug Fixes

Fixed convert color schema in YoloOnnxDetector
Fixed show utils on Google Colab
Fixed imports of the DataFrame

📘 Jupyter Notebooks

📝 Blog Posts

Running YOLO Models on Spark Using ScaleDP

yolo-scaledp

Assets 2

03 Nov 06:38

0.2.4

ScaleDP 0.2.4

What's Changed

🚀 Features

Added FaceDetector transformer
Added SignatureDetector transformers
Added PdfAssembler transformer for assembling PDFs
Updated ImageCropBoxes to support multiple boxes
Added LineOrientation detector model to the TesseractRecognizer
Added possibility to use subfields in Show Utils
Added padding option to YoloOnnxDetector

🐛 Bug Fixes

Fixed borders in Show Utils

Assets 2

19 Mar 10:53

0.2.2 Pre-release

Pre-release

ScaleDP 0.2.2

What's Changed

Integrated with Spark PDF DataSource
Added Object detection by @mykolamelnykml in #47
Update LLM extractors by @mykolamelnykml in #48
Improve LLM extractors and another workflows by @mykolamelnykml in #50
Added LLM Ocr by @mykolamelnykml in #52
Add LLMNer by @mykolamelnykml in #54
Added run Black and Ruff in actions by @mykolamelnykml in #55
Updated VisualLLMExtractor by @mykolamelnykml in #58
Updated tutorials by @mykolamelnykml in #60
Improve VisualLLMextractor by @mykolamelnykml in #62

Related posts:

Structured Data Extraction from PDFs with AI

Full Changelog: 0.1.0rc10...0.2.2

Contributors

mykolamelnykml

Assets 2

18 Nov 18:11

0.1.0rc10 Pre-release

Pre-release

What's Changed

Added EasyOcr in https://github.com/StabRise/spark-pdf/pull/39
Added DocTR in https://github.com/StabRise/spark-pdf/pull/44
Added Surya Ocr in https://github.com/StabRise/spark-pdf/pull/38
Added support PyTesseract lib for binding to tesseract in StabRise/spark-pdf#4
Added line width param to the ImageDrawBoxes n StabRise/spark-pdf#5
Added textSize param to the ImageDrawBoxes in StabRise/spark-pdf#7
*Added list of displayed data to the imagedrawregions in StabRise/spark-pdf#9
Initialize sphinx docs in StabRise/spark-pdf#20
Improved test coverage in StabRise/spark-pdf#22
Added TextToDocument transformer in StabRise/spark-pdf#27
Refactoring in StabRise/spark-pdf#30
Changed Ner transformer for work with raw text in https://github.com/StabRise/spark-pdf/pull/31
Added dockerfile in https://github.com/StabRise/spark-pdf/pull/37, StabRise/spark-pdf#16

Full Changelog: https://github.com/StabRise/spark-pdf/commits/0.1.0rc10

Assets 2