-
Notifications
You must be signed in to change notification settings - Fork 366
feat: Support customizing S3 endpoints #1913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I took a quick look -- pooling the STS clients seems like a reasonable thing to do and this approach to doing so looks pretty correct to me. However it's not totally clear to me why this is needed to implement #32. Looking at #389, this part of the code wasn't touched at all. Could we instead start more top-down with S3-compatible storage support and figure out if these changes are actually necessary? |
polaris-core/src/main/java/org/apache/polaris/core/storage/aws/StsClientSupplier.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/S3AccessConfig.java
Outdated
Show resolved
Hide resolved
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/S3AccessConfig.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we copying such generic code from another project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is convenient as opposed to writing it from scratch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what will happen if anyone change this class later? Should we keep CODE_COPIED_TO_POLARIS for ever?
To reduce the maintenance burden, I'd suggest to rewrite to avoid copying from another project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSS is meant to be reused. Rewriting to avoid attribution is against the grain of OSS concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: forever: I'd guess it's very likely as I do not personally see a reason for rewriting this code in any substantial way in the foreseeable future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-maynard @flyrain We have a ton of code taken from other projects already in the code base. Most of that was already part of the initial contribution to Apache. A lot of that was also "generic" and literally copies of rather trivial code. So I seriously do not understand what you're trying to say here. The goal should really not be to reinvent "all the wheels".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What’s an example of similarly generic code that’s been copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always a good practice to reduce the dependency for code. In this case, we have to add additional information in the LICENSE. If rewriting isn't a big task which I believe so, we mighty avoid extra maintenance burden. That's also just a suggestion, I'm NOT gonna block this PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
License checks for CODE_COPIED_TO_POLARIS marks are automated - there's minimal cognitive effort involved in maintaining these license references (plus they are fixed and do not need to change over time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, as mentioned, a lot of other projects' code in Polaris (OpenAPI generator files, the web site, code copied from Iceberg, build scripts, etc etc etc).
There is no ongoing maintenance burden, but there is a huge plus taking code that's tested and known to work in production for years. New developed code is way more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really related to S3-compatible storage support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats are not directly relates to supporting S3-compatible storage. However, the STS clients pool is. Having a pool in the system, it is good to expose metrics about it for observability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are using AWS ask it has metrics inbuilt https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/metrics-list.html is the current mechanism more comprehensive than that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these stats are about the cache of STS clients we keep in Polaris code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we just plotting a gauge for max clients, wouldn't it be better to have the stats for the http connection pool ? for ex lets say i have 2 entires in the cache and one entry hogs my 49 connection and other only 1 i would want to know who hogs 49 connection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stats for the HTTP client would be nice to have for sure, but let's defer it to another PR. I do not have that code handy ATM and I would not want to bloat the scope of this PR even more :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only usage of SdkHttpClient outside of QuarkusProducers itself, I wonder if we can simplify things by just injecting the StsClientBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StsClientsPool is injected where it is needed. SdkHttpClient is injected into StsClientsPool. Are you suggesting to hide the SdkHttpClient from CDI? If we did that, it would not be possible to close the singleton SdkHttpClient using CDI callbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the suggestion is to make SdkHttpClient an inner (owned) fields in StsClientsPool and close the pool, IMHO, that introduces unnecessary coupling, because the pool only needs the interface (namely SdkHttpClient), but QuarkusProducers chooses a specific implementation of that interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the suggestion. They are always coupled together, so encapsulating the client inside the pool simplifies usage of the pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the pool and the HTTP client are not coupled. The pool needs a client. Yes, the client can be reused by other parts of Polaris (if it becomes necessary in the future).
The pool is easily injectable in the current state of the PR. Use sites need not be aware of how the client is constructed.
I'd like to keep the current approach of having distinct CDI beans for the HTTP client and the STS client pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as @singhpk234 commented in another thread, we're going to support multiple HTTP client implementations (later), selectable in runtime (by config).
singhpk234
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment doesn't shows up linking here
I can do that, but the diff is going to be larger ;) |
|
@singhpk234 : Sorry, I do not see your comments (even via above links). Could you re-summit them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, i resubmitted !
I think i messed up last time in submitting, thank you for patience ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if HTTP connection pool is shared, shouldn't we restrict max connection per route ? as if lets say my pool size is 50 and for route 1 hit with 50 concurrent request so pool now only has 50 connection of route 1 and if now route 2 comes it starves ? I am not sure how to do it here HTTPClient use to PoolingConnectionManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the AWS SDK not able to deal with this situation on its side?
Did you observe this kind of starvation in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is practically possible, i think since we don't specify it default kicks in from http client, TBH i haven't hit in practice as in AWS EMR spark jobs generally have one endpoint to connect to, but with a long running service like polaris, i think the possibility being increased, WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@snazy WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only ApacheHttpClient and not URLConnection client ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it configurable in runtime if you prefer. Let's do that as a follow-up PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ApacheHttpClient is more capable in general, AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg supports both, hence coming from there, Its good to have alternative specially considering how they behave. for example The Apache HTTP connection will always read the entire stream on close, which results in much more data being read than needed. The URL connection client does not behave this way. for more details : apache/iceberg#7262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. If do not mind supporting both, but let's do that in a follow-up PR to avoid bloating the scope of this PR (the main idea in this PR is not the ApacheHttpClient client, but supporting multiple endpoints).
runtime/service/src/main/java/org/apache/polaris/service/storage/aws/StsClientsPool.java
Outdated
Show resolved
Hide resolved
|
Marking this PR as "draft" to add end-to-end support for MinIO. Please feel free to comment anyway. |
|
|
|
Prerequisite PR: #1955 |
8363e38 to
de9c323
Compare
runtime/service/src/main/java/org/apache/polaris/service/quarkus/config/QuarkusProducers.java
Outdated
Show resolved
Hide resolved
|
Thanks for doing this. I think it's the right direction. We could also make CLI to support new fields, which could be done as a followup. |
|
This is not the final PR for non-AWS S3 :) We should certainly add CLI support, but probably in a follow-up PR. |
No functional change. Introduce a dedicated interface for StsClient suppliers and implement it using a pool of cached clients. All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down. This is a step towards supporting non-AWS S3 storage (apache#1530). For this reason the STS endpoint is present in new interfaces, but is not used yet.
| private final Function<StsDestination, StsClient> clientBuilder; | ||
|
|
||
| public StsClientsPool( | ||
| S3AccessConfig effectiveSts, SdkHttpClient sdkHttpClient, MeterRegistry meterRegistry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass effectiveSts as an integer instead of passing the whole S3AccessConfig object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly - updated.
No functional change. Introduce a dedicated interface for `StsClient` suppliers and implement it using a pool of cached clients. All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down. This is a step towards supporting non-AWS S3 storage (apache#1530). For this reason the STS endpoint is present in new interfaces, but is not used yet. (cherry picked from commit 95d1eac)
No functional change. Introduce a dedicated interface for `StsClient` suppliers and implement it using a pool of cached clients. All client are "thin" and share the same `SdkHttpClient`. The latter is closed when the server shuts down. This is a step towards supporting non-AWS S3 storage (apache#1530). For this reason the STS endpoint is present in new interfaces, but is not used yet.
* chore(deps): update dependency mypy to >=1.17, <=1.17.0 (apache#2114) * Spark 3.5.6 and Iceberg 1.9.1 (apache#1960) * Spark 3.5.6 and Iceberg 1.9.1 * Cleanup * Add `pathStyleAccess` to AwsStorageConfigInfo (apache#2012) * Add `pathStyleAccess` to AwsStorageConfigInfo This change allows configuring the "path-style" access mode in S3 clients (both in Polaris Servers and Iceberg REST Catalog API clients). This change is applicable both to AWS storage and to non-AWS S3-compatible storage (apache#1530). * Add TestFileIOFactory helper (apache#2105) * Add FileIOFactory.wrapExisting helper * fix(deps): update dependency gradle.plugin.org.jetbrains.gradle.plugin.idea-ext:gradle-idea-ext to v1.2 (apache#2125) * fix(deps): update dependency boto3 to v1.39.7 (apache#2124) * Abstract polaris-runtime-service tests for all persistence implementations (apache#2106) The NoSQL persistence implementation has to run the Iceberg table & view catalog plus the Polaris specific tests as well. Reusing existing tests is beneficial to avoid a lot of code duplcation. This change moves the actual tests to `Abstract*` classes and refactors the existing tests to extend those. The NoSQL persistence work extends the same `Abstract*` classes but runs with different Quarkus test profiles. * Add IMPLICIT authentication support to the CLI (apache#2121) PRs apache#1925 and apache#1912 were merged around the same time. This PR connects the two changes and enables the CLI to accept IMPLICIT authentication type. Since Hadoop federated catalogs rely purely on IMPLICIT authentication, the CLI parsing test has been updated to reflect the same. * feat(helm): Add support for external authentication (apache#2104) * fix(deps): update dependency org.apache.iceberg:iceberg-bom to v1.9.2 (apache#2126) * fix(deps): update quarkus platform and group to v3.24.4 (apache#2128) * fix(deps): update dependency boto3 to v1.39.8 (apache#2129) * fix(deps): update dependency io.smallrye.config:smallrye-config-core to v3.13.3 (apache#2130) * Add newIcebergCatalog helper (apache#2134) creation of `IcebergCatalog` instances was quite redundant as tests mostly use the same parameters most of the time. also remove an unused field in 2 other tests. * Add server and client support for the new generic table `baseLocation` field (apache#2122) * Use Makefile to simplify setup and commands (apache#2027) * Use Makefile to simplify setup and commands * Add targets for minikube state management * Add podman support and spark plugin build * Add version target * Update README.md for Makefile usage and relation to the project * Fix nit * Package polaris client as python package (apache#2049) * Package polaris client as python package * Package polaris client as python package * Change owner to spark when copying files from local into Dockerfile * CI: Address failure from accessing GH API (apache#2132) CI sometimes fails with this failure: ``` * What went wrong: Execution failed for task ':generatePomFileForMavenPublication'. > Unable to process url: https://api.github.com/repos/apache/polaris/contributors?per_page=1000 ``` The sometimes failing request fetches the list of contributors to be published in the "root" POM. Unauthorized GH API requests have an hourly(?) limit of 60 requests per source IP. Authorized requests have a much higher rate limit. We do have a GitHub token available in every CI run, which can be used in GH API requests. This change adds the `Authorization` header for the failing GH API request to leverage the higher rate limit and let CI not fail (that often). * fix(deps): update dependency com.nimbusds:nimbus-jose-jwt to v10.4 (apache#2139) * fix(deps): update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.2.0 (apache#2142) * fix(deps): update dependency software.amazon.awssdk:bom to v2.32.4 (apache#2146) * fix(deps): update dependency org.xerial.snappy:snappy-java to v1.1.10.8 (apache#2138) * fix(deps): update dependency org.junit:junit-bom to v5.13.4 (apache#2147) * fix(deps): update dependency boto3 to v1.39.9 (apache#2137) * fix(deps): update dependency com.fasterxml.jackson:jackson-bom to v2.19.2 (apache#2136) * Python client: add support for endpoint, sts-endpoint, path-style-access (apache#2127) This change adds support for endpoint, sts-endpoint, path-style-access to the Polaris Python client. Amends apache#1913 and apache#2012 * Remove PolarisEntityManager.getCredentialCache (apache#2133) `PolarisEntityManager` itself is not using the `StorageCredentialCache` but just hands it out via `getCredentialCache`. the only caller of `getCredentialCache` is `FileIOUtil.refreshAccessConfig`, which in in turn is only called by `DefaultFileIOFactory` and `IcebergCatalog`. note that in a follow-up we will likely be able to remove `PolarisEntityManager` usage completely from `IcebergCatalog`. additional cleanups: - use `StorageCredentialCache` injection in tests (but we need to invalidate all entries on test start) - remove unused `UserSecretsManagerFactory` from `PolarisCallContextCatalogFactory` * chore(deps): update registry.access.redhat.com/ubi9/openjdk-21-runtime docker tag to v1.22-1.1752676419 (apache#2150) * fix(deps): update dependency com.diffplug.spotless:spotless-plugin-gradle to v7.2.1 (apache#2152) * fix(deps): update dependency boto3 to v1.39.10 (apache#2151) * chore: fix class reference in the javadoc of TableLikeEntity (apache#2157) * fix(deps): update dependency commons-codec:commons-codec to v1.19.0 (apache#2160) * fix(deps): update dependency boto3 to v1.39.11 (apache#2159) * Last merged commit 395459f --------- Co-authored-by: Mend Renovate <bot@renovateapp.com> Co-authored-by: Yong Zheng <yongzheng0809@gmail.com> Co-authored-by: Dmitri Bourlatchkov <dmitri.bourlatchkov@gmail.com> Co-authored-by: Christopher Lambert <xn137@gmx.de> Co-authored-by: Pooja Nilangekar <poojan@umd.edu> Co-authored-by: Alexandre Dutra <adutra@apache.org> Co-authored-by: Yun Zou <yunzou.colostate@gmail.com>
This change enables using non-AWS S3 implementations (#1530) when
SKIP_CREDENTIAL_SUBSCOPING_INDIRECTIONis set tofalse.endpointto AWSAwsStorageConfigInfo(API change)SdkHttpClient. The latter is closed when the server shuts down.Existing catalogs are not affected by this change.
Dev ML discussion: https://lists.apache.org/thread/72psyjb40fb9l73sxld2qcr69l4tf4cw