Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
2ab5557
AI draft for protocol buffer support
thirtiseven Dec 19, 2025
d261358
AI draft for Hadoop sequence file reader
thirtiseven Dec 22, 2025
cd31fad
Revert "AI draft for protocol buffer support"
thirtiseven Dec 23, 2025
1278c66
clean up
thirtiseven Dec 23, 2025
10933b0
update tests
thirtiseven Dec 23, 2025
e965b01
address comment
thirtiseven Dec 25, 2025
f89f8c1
address comment
thirtiseven Dec 25, 2025
02c0752
address comment
thirtiseven Dec 25, 2025
f3bcf9d
address comment
thirtiseven Dec 25, 2025
562672a
copyrights
thirtiseven Jan 4, 2026
2e10fbd
refactor
thirtiseven Jan 5, 2026
572c0da
address comments
thirtiseven Jan 6, 2026
9b1162e
address comments
thirtiseven Jan 6, 2026
f95910f
address comments
thirtiseven Jan 6, 2026
481bfbe
multi-thread reader
thirtiseven Jan 7, 2026
bd526c5
delete perf test
thirtiseven Jan 7, 2026
cf33cf4
address commens
thirtiseven Jan 8, 2026
af43f3e
address comments
thirtiseven Jan 8, 2026
288152a
remove COALESCING reader
thirtiseven Jan 8, 2026
ea91eab
fix
thirtiseven Jan 8, 2026
6a23c2e
fix
thirtiseven Jan 8, 2026
9847f16
make sequence file isSplitable to false due to data diff
thirtiseven Jan 9, 2026
c6b98fb
Merge branch 'seq_file_reader' of https://github.com/thirtiseven/spar…
thirtiseven Jan 9, 2026
70ad202
fix merge seqreader
thirtiseven Jan 9, 2026
f9f4a8c
use gpu reader
thirtiseven Jan 13, 2026
e6322bc
fix a bug
thirtiseven Jan 13, 2026
94f31ea
performance optimization
thirtiseven Jan 16, 2026
4139c00
fix
thirtiseven Jan 16, 2026
1b2fbe9
Revert "fix"
thirtiseven Jan 20, 2026
310ccbc
Revert "performance optimization"
thirtiseven Jan 20, 2026
81ccdfa
Revert "fix a bug"
thirtiseven Jan 20, 2026
28d0405
Revert "use gpu reader"
thirtiseven Jan 20, 2026
dcf6af0
fix OOM bug
thirtiseven Jan 20, 2026
87f5a72
performance optimzation
thirtiseven Jan 20, 2026
bcfcbc7
integration tests
thirtiseven Jan 21, 2026
7cc02cf
splitable true by default
thirtiseven Jan 21, 2026
143fc3d
logical rule
thirtiseven Jan 23, 2026
dc0bbfc
save a memory copy
thirtiseven Jan 26, 2026
9619bc0
fix perfile config
thirtiseven Jan 27, 2026
d052441
GPU combine to produce larger batch
thirtiseven Jan 28, 2026
98ee00f
support glob style path
thirtiseven Feb 2, 2026
e4fef5a
a bug fix, RDD conversion refinement
thirtiseven Feb 3, 2026
0f8f8ca
support compress
thirtiseven Feb 4, 2026
c60c978
huge upmerge from dev branch
thirtiseven Feb 25, 2026
18015d6
refactor
thirtiseven Feb 26, 2026
1b297b9
Merge remote-tracking branch 'origin/main' into seq_file_reader
thirtiseven Mar 12, 2026
3fe2cd6
fix scala 2.13 build
thirtiseven Mar 12, 2026
ed7fa84
verify and refactor
thirtiseven Mar 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,9 @@ Name | Description | Default Value | Applicable at
<a name="sql.format.parquet.reader.type"></a>spark.rapids.sql.format.parquet.reader.type|Sets the Parquet reader type. We support different types that are optimized for different environments. The original Spark style reader can be selected by setting this to PERFILE which individually reads and copies files to the GPU. Loading many small files individually has high overhead, and using either COALESCING or MULTITHREADED is recommended instead. The COALESCING reader is good when using a local file system where the executors are on the same nodes or close to the nodes the data is being read on. This reader coalesces all the files assigned to a task into a single host buffer before sending it down to the GPU. It copies blocks from a single file into a host buffer in separate threads in parallel, see spark.rapids.sql.multiThreadedRead.numThreads. MULTITHREADED is good for cloud environments where you are reading from a blobstore that is totally separate and likely has a higher I/O read cost. Many times the cloud environments also get better throughput when you have multiple readers in parallel. This reader uses multiple threads to read each file in parallel and each file is sent to the GPU separately. This allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.parquet.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. By default this is set to AUTO so we select the reader we think is best. This will either be the COALESCING or the MULTITHREADED based on whether we think the file is in the cloud. See spark.rapids.cloudSchemes.|AUTO|Runtime
<a name="sql.format.parquet.write.enabled"></a>spark.rapids.sql.format.parquet.write.enabled|When set to false disables parquet output acceleration|true|Runtime
<a name="sql.format.parquet.writer.int96.enabled"></a>spark.rapids.sql.format.parquet.writer.int96.enabled|When set to false, disables accelerated parquet write if the spark.sql.parquet.outputTimestampType is set to INT96|true|Runtime
<a name="sql.format.sequencefile.multiThreadedRead.maxNumFilesParallel"></a>spark.rapids.sql.format.sequencefile.multiThreadedRead.maxNumFilesParallel|A limit on the maximum number of files per task processed in parallel on the CPU side before the file is sent to the GPU. This affects the amount of host memory used when reading the files in parallel. Used with MULTITHREADED reader, see spark.rapids.sql.format.sequencefile.reader.type.|2147483647|Runtime
<a name="sql.format.sequencefile.rddScan.physicalReplace.enabled"></a>spark.rapids.sql.format.sequencefile.rddScan.physicalReplace.enabled|Enable physical-plan replacement for SequenceFile RDD scans (RDDScanExec) when the lineage can be safely identified as a simple SequenceFile scan with BinaryType key/value output. Unsupported or risky cases automatically remain on CPU.|true|Runtime
<a name="sql.format.sequencefile.reader.type"></a>spark.rapids.sql.format.sequencefile.reader.type|Sets the SequenceFile reader type. Since SequenceFile decoding happens on the CPU (using Hadoop's SequenceFile.Reader), COALESCING mode is not supported and will throw an exception. MULTITHREADED uses multiple threads to read files in parallel, utilizing multiple CPU cores for I/O and decoding. MULTITHREADED is recommended when reading many files as it allows the CPU to keep reading while GPU is also doing work. See spark.rapids.sql.multiThreadedRead.numThreads and spark.rapids.sql.format.sequencefile.multiThreadedRead.maxNumFilesParallel to control the number of threads and amount of memory used. AUTO is kept for compatibility, but MULTITHREADED is the default for SequenceFile.|MULTITHREADED|Runtime
<a name="sql.formatNumberFloat.enabled"></a>spark.rapids.sql.formatNumberFloat.enabled|format_number with floating point types on the GPU returns results that have a different precision than the default results of Spark.|true|Runtime
<a name="sql.hasExtendedYearValues"></a>spark.rapids.sql.hasExtendedYearValues|Spark 3.2.0+ extended parsing of years in dates and timestamps to support the full range of possible values. Prior to this it was limited to a positive 4 digit year. The Accelerator does not support the extended range yet. This config indicates if your data includes this extended range or not, or if you don't care about getting the correct values on values with the extended range.|true|Runtime
<a name="sql.hashOptimizeSort.enabled"></a>spark.rapids.sql.hashOptimizeSort.enabled|Whether sorts should be inserted after some hashed operations to improve output ordering. This can improve output file sizes when saving to columnar formats.|false|Runtime
Expand Down Expand Up @@ -473,6 +476,7 @@ Name | Description | Default Value | Notes
<a name="sql.exec.ProjectExec"></a>spark.rapids.sql.exec.ProjectExec|The backend for most select, withColumn and dropColumn statements|true|None|
<a name="sql.exec.RangeExec"></a>spark.rapids.sql.exec.RangeExec|The backend for range operator|true|None|
<a name="sql.exec.SampleExec"></a>spark.rapids.sql.exec.SampleExec|The backend for the sample operator|true|None|
<a name="sql.exec.SerializeFromObjectExec"></a>spark.rapids.sql.exec.SerializeFromObjectExec|Serialize object rows to binary columns for SequenceFile RDD scans|true|None|
<a name="sql.exec.SortExec"></a>spark.rapids.sql.exec.SortExec|The backend for the sort operator|true|None|
<a name="sql.exec.SubqueryBroadcastExec"></a>spark.rapids.sql.exec.SubqueryBroadcastExec|Plan to collect and transform the broadcast key values|true|None|
<a name="sql.exec.TakeOrderedAndProjectExec"></a>spark.rapids.sql.exec.TakeOrderedAndProjectExec|Take the first limit elements as defined by the sortOrder, and do projection if needed|true|None|
Expand Down
186 changes: 106 additions & 80 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -421,6 +421,32 @@ Accelerator supports are described below.
<td>S</td>
</tr>
<tr>
<td rowspan="1">SerializeFromObjectExec</td>
<td rowspan="1">Serialize object rows to binary columns for SequenceFile RDD scans</td>
<td rowspan="1">None</td>
<td>Input/Output</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>S</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td rowspan="1">SortExec</td>
<td rowspan="1">The backend for the sort operator</td>
<td rowspan="1">None</td>
Expand Down Expand Up @@ -499,32 +525,6 @@ Accelerator supports are described below.
<td><b>NS</b></td>
</tr>
<tr>
<td rowspan="1">UnionExec</td>
<td rowspan="1">The backend for the union operator</td>
<td rowspan="1">None</td>
<td>Input/Output</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>unionByName will not optionally impute nulls for missing struct fields when the column is a struct and there are non-overlapping fields;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<th>Executor</th>
<th>Description</th>
<th>Notes</th>
Expand All @@ -551,6 +551,32 @@ Accelerator supports are described below.
<th>YEARMONTH</th>
</tr>
<tr>
<td rowspan="1">UnionExec</td>
<td rowspan="1">The backend for the union operator</td>
<td rowspan="1">None</td>
<td>Input/Output</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>unionByName will not optionally impute nulls for missing struct fields when the column is a struct and there are non-overlapping fields;<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><b>NS</b></td>
</tr>
<tr>
<td rowspan="1">AQEShuffleReadExec</td>
<td rowspan="1">A wrapper of shuffle query stage</td>
<td rowspan="1">None</td>
Expand Down Expand Up @@ -915,6 +941,32 @@ Accelerator supports are described below.
<td>S</td>
</tr>
<tr>
<th>Executor</th>
<th>Description</th>
<th>Notes</th>
<th>Param(s)</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
<th>DAYTIME</th>
<th>YEARMONTH</th>
</tr>
<tr>
<td rowspan="4">BroadcastHashJoinExec</td>
<td rowspan="4">Implementation of join using broadcast data</td>
<td rowspan="4">None</td>
Expand Down Expand Up @@ -1010,32 +1062,6 @@ Accelerator supports are described below.
<td><b>NS</b></td>
</tr>
<tr>
<th>Executor</th>
<th>Description</th>
<th>Notes</th>
<th>Param(s)</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
<th>DAYTIME</th>
<th>YEARMONTH</th>
</tr>
<tr>
<td rowspan="2">BroadcastNestedLoopJoinExec</td>
<td rowspan="2">Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported</td>
<td rowspan="2">None</td>
Expand Down Expand Up @@ -1301,6 +1327,32 @@ Accelerator supports are described below.
<td><b>NS</b></td>
</tr>
<tr>
<th>Executor</th>
<th>Description</th>
<th>Notes</th>
<th>Param(s)</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
<th>DAYTIME</th>
<th>YEARMONTH</th>
</tr>
<tr>
<td rowspan="1">AggregateInPandasExec</td>
<td rowspan="1">The backend for an Aggregation Pandas UDF. This accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.</td>
<td rowspan="1">None</td>
Expand Down Expand Up @@ -1405,32 +1457,6 @@ Accelerator supports are described below.
<td><b>NS</b></td>
</tr>
<tr>
<th>Executor</th>
<th>Description</th>
<th>Notes</th>
<th>Param(s)</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
<th>DAYTIME</th>
<th>YEARMONTH</th>
</tr>
<tr>
<td rowspan="1">MapInPandasExec</td>
<td rowspan="1">The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process. It also supports scheduling GPU resources for the Python process when enabled.</td>
<td rowspan="1">None</td>
Expand Down Expand Up @@ -8267,7 +8293,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand All @@ -8290,7 +8316,7 @@ are limited.
<td> </td>
<td> </td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT, DAYTIME, YEARMONTH</em></td>
<td> </td>
<td> </td>
<td> </td>
Expand Down
Loading
Loading