Added force virtual addressing configuration for S3, Alibaba OSS protocol to use PyArrowFileIO#1392
Conversation
| client_kwargs["session_name"] = session_name | ||
|
|
||
| if force_virtual_addressing := self.properties.get(S3_FORCE_VIRTUAL_ADDRESSING): | ||
| client_kwargs["force_virtual_addressing"] = property_as_bool(self.properties, force_virtual_addressing, False) |
There was a problem hiding this comment.
force_virtual_addressing docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html
pyiceberg/io/pyarrow.py
Outdated
|
|
||
| def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: | ||
| if scheme in {"s3", "s3a", "s3n"}: | ||
| if scheme in {"s3", "s3a", "s3n", "oss", "r2"}: |
There was a problem hiding this comment.
what is oss? I've never heard of it. And does S3FileSystem support both oss and r2?
There was a problem hiding this comment.
oss is protocol for Alibaba Cloud Object Storage Service, and it's compatible with S3 API as long as the URL is in virtual address style.

From my quick test it looks like the class does support OSS, but not sure yet about R2.
There was a problem hiding this comment.
Plugging out R2. It uses account ID to create the URL, which is quite different from how S3 natively set the virtual hosted style.
There was a problem hiding this comment.
thanks for testing it out! do you mind editing the PR description and mention this PR only supports oss
mkdocs/docs/configuration.md
Outdated
| | s3.region | us-west-2 | Sets the region of the bucket | | ||
| | s3.proxy-uri | <http://my.proxy.com:8080> | Configure the proxy server to be used by the FileIO. | | ||
| | s3.connect-timeout | 60.0 | Configure socket connection timeout, in seconds. | | ||
| | s3.force-virtual-addressing | False | Configure the style of requests. Set `False` to use path-style request and `True` for virtual-hosted-style request. | |
There was a problem hiding this comment.
nit: perhaps just copy/paste S3FileSystem docs
https://arrow.apache.org/docs/python/generated/pyarrow.fs.S3FileSystem.html
force_virtual_addressing bool, default False
Whether to use virtual addressing of buckets. If true, then virtual addressing is always enabled. If false, then virtual addressing is only enabled if endpoint_override is empty. This can be used for non-AWS backends that only support virtual hosted-style access.
|
reopened #21, and added a few comments on this PR. Thanks for the contribution! |
|
|
||
| def _initialize_fs(self, scheme: str, netloc: Optional[str] = None) -> FileSystem: | ||
| if scheme in {"s3", "s3a", "s3n"}: | ||
| if scheme in {"s3", "s3a", "s3n", "oss"}: |
There was a problem hiding this comment.
lets also update the docs to mention oss
https://py.iceberg.apache.org/configuration/#fileio
There was a problem hiding this comment.
should we also add oss to fsspec?
iceberg-python/pyiceberg/io/fsspec.py
Lines 245 to 255 in 68e17af
There was a problem hiding this comment.
looks like theres https://github.com/fsspec/ossfs
There was a problem hiding this comment.
looks like theres https://github.com/fsspec/ossfs
The project hasn't been maintained for almost a year now, and when I tried it on DuckDB it returns an unknown dependency error (?). I'd say it's probably safer to use s3fs than ossfs (not that I've tried it, though). I'll try making a new pull request for that later.
There was a problem hiding this comment.
makes sense, thanks for looking into it! lets keep the scope just pyarrow for now
|
looks like theres a linter issue, you can run |
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! thanks for the contribution!
|
I've tried to run |
|
@helmiazizm that's a intermittent issue with the CI, I've retrigger the run |
Fokko
left a comment
There was a problem hiding this comment.
Looks good, thanks for updating the docs as well @helmiazizm 🙌
* Added force virtual addressing configuration for S3. Also added oss and r2 protocol. * Rewrote force virtual addressing as written in PyArrow documentation * Added the missing r2 key value in schema_to_file_io * Removed R2 protocol for now * Linter fix * Updated documentation for OSS support * Another linter fix
Added the option to enforce virtual hosted style request for S3 PyArrow IO as it already has the support by PyArrow. Also added
oss://protocol into the list of S3 PyArrow scheme. This was addressed in #21 before.