Skip to content

Commit 9fd57b5

Browse files
julienledemalambmrcnc
authored
Add a proposal process (#513)
* start with a simple process and an example --------- Signed-off-by: Julien Le Dem <julien@apache.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Micah Kornfield Co-authored-by: Marc Cenac <547446+mrcnc@users.noreply.github.com>
1 parent c3f7be7 commit 9fd57b5

File tree

3 files changed

+166
-0
lines changed

3 files changed

+166
-0
lines changed

proposals/1_BASE64_ENCODING.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
<!--
2+
- Licensed to the Apache Software Foundation (ASF) under one
3+
- or more contributor license agreements. See the NOTICE file
4+
- distributed with this work for additional information
5+
- regarding copyright ownership. The ASF licenses this file
6+
- to you under the Apache License, Version 2.0 (the
7+
- "License"); you may not use this file except in compliance
8+
- with the License. You may obtain a copy of the License at
9+
-
10+
- http://www.apache.org/licenses/LICENSE-2.0
11+
-
12+
- Unless required by applicable law or agreed to in writing,
13+
- software distributed under the License is distributed on an
14+
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
- KIND, either express or implied. See the License for the
16+
- specific language governing permissions and limitations
17+
- under the License.
18+
-->
19+
---
20+
Author: Julien Le Dem
21+
Created: 2025-Aug-7
22+
Name: add BASE64 compression
23+
Issue: https://github.com/apache/parquet-format/issues/NNN
24+
Status: ARCHIVED
25+
Reason: Did not compress
26+
---
27+
28+
# Proposal
29+
30+
**NOTE**: This is an example proposal for use as a template
31+
32+
## Description
33+
Add Base64 to compression algorithms.
34+
This is not backwards compatible as a new compression alg.
35+
36+
## Spec
37+
38+
See [BASE64 spec].
39+
40+
## Evaluation
41+
42+
After trying out in the java implementation, file size doubled on average.
43+
See prototype [here](github.com/julienledem/mypoc)
44+

proposals/README.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
<!--
2+
- Licensed to the Apache Software Foundation (ASF) under one
3+
- or more contributor license agreements. See the NOTICE file
4+
- distributed with this work for additional information
5+
- regarding copyright ownership. The ASF licenses this file
6+
- to you under the Apache License, Version 2.0 (the
7+
- "License"); you may not use this file except in compliance
8+
- with the License. You may obtain a copy of the License at
9+
-
10+
- http://www.apache.org/licenses/LICENSE-2.0
11+
-
12+
- Unless required by applicable law or agreed to in writing,
13+
- software distributed under the License is distributed on an
14+
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
- KIND, either express or implied. See the License for the
16+
- specific language governing permissions and limitations
17+
- under the License.
18+
-->
19+
# Proposals
20+
21+
This proposal process is intended for significantly impactful additions to the Parquet spec. The goal is to facilitate those projects and help them being contributed to Parquet.
22+
For example, changes that are not forward compatible like adding a new encoding or a new compression algorithm (older implementations can not read new files). [see guidelines for more details](https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#general-guidelinespreferences-on-additions)
23+
This gives better visibility to those projects which require coordination in several implementations.
24+
Bug fixes, code only features or minor changes to the spec that can be ignored by older implementations can simply be filed as a github issue.
25+
26+
## Proposal lifecycle
27+
28+
Discuss -> Draft/POC -> Implementation -> Approval
29+
30+
### Discuss
31+
Start a [DISCUSS] thread on the mailing list (dev@parquet.apache.org) with your idea. At this point, the community can discuss whether the impact of the proposal requires a document here or just be a github issue.
32+
Once you have a better idea of the general consensus on the proposal, open a github issue using the [proposal template](proposals/_PROPOSAL_TEMPLATE.md)
33+
Attaching a google doc to collect feedback and collaborate with the community works usually well early on.
34+
35+
*Transition:* Once you feel you received enough feedback or need to start the POC to have better answers to questions you get, you can move to the next step. Anybody is free to start POCs anytime. We just recommend getting feedback before you spend a significant amount of your time.
36+
37+
### Draft/POC
38+
Once you feel the discussion has stabilized and you are ready to start a POC, open a PR to add the proposal to the table in the [Active Proposals](#active-proposals) section bellow. Link all relevant discussion documents in the body of the corresponding github issue. If using google docs, make sure docs are owned by an account that will maintain them publicly (preffer personal account over a work account). Alternativaly you can create a new Markdown file in the proposals folder and give more visibility to the work in progress (see [the example](1_BASE64_ENCODING.md) ).
39+
The proposal document can evolve along the course of the POC. In particular to add more links to findings and performance evaluations. Collaboration is encouraged. More validation on the POC increases the chances of success.
40+
41+
Example: [https://github.com/apache/parquet-format/pull/221]
42+
43+
Make sure you consider the [requirements document](https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0#heading=h.v4emiipkghrx) to ensure the success of the POC. (Note: this doc would become a markdown page in the repo)
44+
45+
*Transition:* There is enough clarity on the spec for the new feature and we have identified the 2 initial reference implementations for verification.
46+
47+
### Implementation
48+
Once we have reached enough consensus on the formalized spec change and validated it through the POC, we should have a clear idea of whether we want to pursue the implementation across the ecosystem.
49+
At this stage we should finalize a formal spec contribution to parquet-format and we need to meet the contribution guidelines to consider the implementation finished.
50+
See [CONTRIBUTING guidelines](https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format).
51+
52+
*Transition:* A PMC vote will formalize that we have concluded the implementation and are ready to release.
53+
54+
### Approval
55+
Once the implementation phase is finished, we can include the contribution in the next release. Congrats!
56+
57+
## Active Proposals
58+
59+
| ID | Description | Status |
60+
|-----|--------------|---------|
61+
| [github issue] | adding this new encoding | POC |
62+
| [github issue] | add Variant type | Implementation |
63+
64+
(Those are examples to be removed as we start using this)
65+
66+
## Implemented
67+
| ID | Description | Status | release it was added |
68+
|-----|--------------|---------|-----------------------|
69+
| [github issue] | encryption | Completed | x.y.z |
70+
71+
(Those are examples to be removed as we start using this)
72+
73+
## Archived
74+
75+
| ID | Description | Status | reason for archiving |
76+
|-----|--------------|---------|-----------------------|
77+
| [github issue] | [adding base64 compression](1_BASE64_ENCODING.md) | Archived | POC showed that compression ratio was not practical |
78+
79+
(Those are examples to be removed as we start using this)

proposals/_PROPOSAL_TEMPLATE.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
<!--
2+
- Licensed to the Apache Software Foundation (ASF) under one
3+
- or more contributor license agreements. See the NOTICE file
4+
- distributed with this work for additional information
5+
- regarding copyright ownership. The ASF licenses this file
6+
- to you under the Apache License, Version 2.0 (the
7+
- "License"); you may not use this file except in compliance
8+
- with the License. You may obtain a copy of the License at
9+
-
10+
- http://www.apache.org/licenses/LICENSE-2.0
11+
-
12+
- Unless required by applicable law or agreed to in writing,
13+
- software distributed under the License is distributed on an
14+
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
- KIND, either express or implied. See the License for the
16+
- specific language governing permissions and limitations
17+
- under the License.
18+
-->
19+
# Proposal
20+
21+
---
22+
Author: ~your name~
23+
Created: ~date~
24+
Name: *short sentence describing the proposal*
25+
Issue: https://github.com/apache/parquet-format/issues/NNN
26+
Status: DRAFT|IMPLEMENTATION|COMPLETED
27+
---
28+
29+
## Description
30+
*Short description of the proposal. Is it a new encoding? Is it backwards compatible (old readers will just ignore it)? Is it additional metadata?*
31+
32+
## Rationale
33+
Describe why this is a feature that will improve the parquet format and what alternatives currently exist for the use case (e.g. must use a different format, or "must build additional infrastructure to avoid re-parsing footer on each query", or "must use a general purpose compression algorithm to achieve the same space, thus slowing down query performance")
34+
35+
## Spec
36+
37+
At the proposal stage you don't need a fully fleshed out spec yet.
38+
Please add any link to relevant documentation, papers, etc.
39+
at the implementation stage, the details will need to be all clarified.
40+
41+
## Evaluation
42+
What datasets is it tested on and what is a success criteria
43+
Please add any link to the relevant codebase.

0 commit comments

Comments
 (0)