feat: add partition field/partition spec#54
Conversation
gty404
commented
Mar 30, 2025
- Implement the API for partition spec/field.
- add pure virtual classes for transform
src/iceberg/partition_spec.h
Outdated
| [[nodiscard]] int32_t spec_id() const; | ||
| /// \brief Get a view of the partition fields. | ||
| [[nodiscard]] virtual std::span<const PartitionField> fields() const; | ||
| /// \brief Get a field by field ID. |
There was a problem hiding this comment.
Should we clarify whether these are source/transform field IDs?
There was a problem hiding this comment.
Perhaps we don't need these GetFieldByXXX at this moment?
There was a problem hiding this comment.
Perhaps we don't need these
GetFieldByXXXat this moment?
+1
There was a problem hiding this comment.
Should we clarify whether these are source/transform field IDs?
I will change it to transform_fields, and the PartitionField class will define source_field_id/transform_field_id
There was a problem hiding this comment.
Perhaps we don't need these
GetFieldByXXXat this moment?
Okay, I'll delete this part of the interface first
src/iceberg/partition_field.h
Outdated
| /// \brief Get the transform type. | ||
| [[nodiscard]] TransformType transform_type() const; |
There was a problem hiding this comment.
Java stores the full transform, not the type; do we need to do the same?
There was a problem hiding this comment.
I think so. Some transform types required additional parameters like bucket number or truncate length.
There was a problem hiding this comment.
I will change it to TransformFunction
src/iceberg/transform.h
Outdated
| #include <cstdint> | ||
| #include <memory> | ||
|
|
||
| #include "iceberg/arrow_c_data_internal.h" |
There was a problem hiding this comment.
Do not include any internal header in the public header file. To use Arrow C data, we can include arrow_c_data.h.
src/iceberg/transform.h
Outdated
| /// Always produces `null` | ||
| kVoid, | ||
| /// Used to represent some customized transform that can't be recognized or supported now. | ||
| kUnknown, |
There was a problem hiding this comment.
What about putting kUnknown to the beginning?
src/iceberg/transform.h
Outdated
| public: | ||
| explicit TransformFunction(TransformType type); | ||
| /// \brief Transform an input array to a new array | ||
| [[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0; |
There was a problem hiding this comment.
| [[nodiscard]] virtual expected<std::unique_ptr<ArrowArray>, Error> Transform(const ArrowArray& inputArrowArray) = 0; | |
| virtual expected<ArrowArray, Error> Transform(const ArrowArray& data) = 0; |
expected already has nodiscard attribute in its definition, we don't need it here.
For ArrowArray, usually we don't use std::unique_ptr since we have to explicitly call its release function.
src/iceberg/transform.h
Outdated
| [[nodiscard]] virtual TransformType transform_type() const; | ||
| [[nodiscard]] std::string ToString() const override; |
There was a problem hiding this comment.
| [[nodiscard]] virtual TransformType transform_type() const; | |
| [[nodiscard]] std::string ToString() const override; | |
| virtual TransformType transform_type() const; | |
| std::string ToString() const override; |
I don't think we need to add nodiscard to these functions.
src/iceberg/partition_spec.h
Outdated
| [[nodiscard]] int32_t spec_id() const; | ||
| /// \brief Get a view of the partition fields. | ||
| [[nodiscard]] virtual std::span<const PartitionField> fields() const; | ||
| /// \brief Get a field by field ID. |
There was a problem hiding this comment.
Perhaps we don't need these GetFieldByXXX at this moment?
src/iceberg/partition_spec.h
Outdated
| class ICEBERG_EXPORT PartitionSpec { | ||
| public: | ||
| virtual ~PartitionSpec() = default; | ||
| PartitionSpec(int32_t spec_id, std::vector<PartitionField> fields); |
There was a problem hiding this comment.
IMO, we need an Schema in the constructor to denote to the partition schema.
test/transform_test.cc
Outdated
| { | ||
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | ||
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | ||
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | ||
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | ||
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | ||
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | ||
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | ||
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | ||
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); | ||
| } |
There was a problem hiding this comment.
| { | |
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | |
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | |
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | |
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | |
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | |
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | |
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | |
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | |
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); | |
| } | |
| EXPECT_EQ("Identity", iceberg::ToString(iceberg::TransformType::kIdentity)); | |
| EXPECT_EQ("Bucket", iceberg::ToString(iceberg::TransformType::kBucket)); | |
| EXPECT_EQ("Truncate", iceberg::ToString(iceberg::TransformType::kTruncate)); | |
| EXPECT_EQ("Year", iceberg::ToString(iceberg::TransformType::kYear)); | |
| EXPECT_EQ("Month", iceberg::ToString(iceberg::TransformType::kMonth)); | |
| EXPECT_EQ("Day", iceberg::ToString(iceberg::TransformType::kDay)); | |
| EXPECT_EQ("Hour", iceberg::ToString(iceberg::TransformType::kHour)); | |
| EXPECT_EQ("Void", iceberg::ToString(iceberg::TransformType::kVoid)); | |
| EXPECT_EQ("Unknown", iceberg::ToString(iceberg::TransformType::kUnknown)); |
There was a problem hiding this comment.
Add the iceberg namespace so we don't have to add iceberg:: everywhere.
test/transform_test.cc
Outdated
| EXPECT_EQ("Unknown", std::format("{}", transform)); | ||
|
|
||
| auto [_, arrow_array] = | ||
| iceberg::internal::CreateExampleArrowSchemaAndArrayByNanoarrow(); |
There was a problem hiding this comment.
I will remove CreateExampleArrowSchemaAndArrayByNanoarrow some day. Perhaps add a TODO here?
There was a problem hiding this comment.
Okay, I'll build a simple arrow array to test
src/iceberg/exception.h
Outdated
| explicit IcebergError(const std::string& what) : std::runtime_error(what) {} | ||
| }; | ||
|
|
||
| #define ICEBERG_CHECK_FMT(condition, ...) \ |
There was a problem hiding this comment.
| #define ICEBERG_CHECK_FMT(condition, ...) \ | |
| #define ICEBERG_CHECK(condition, ...) \ |
src/iceberg/partition_field.h
Outdated
| /// \param[in] field_id The partition field ID. | ||
| /// \param[in] name The partition field name. | ||
| /// \param[in] transformType The transform type. | ||
| PartitionField(int32_t source_id, int32_t field_id, std::string name, TransformType transformType); |
There was a problem hiding this comment.
source_id->source_field_id? keep consistent with impl
There was a problem hiding this comment.
I think it is ok to be consistent with the spec: https://iceberg.apache.org/spec/#partition-specs
| V1 | V2 | V3 | Field | JSON representation | Example |
|----------|----------|----------|------------------|---------------------|--------------|
| required | required | omitted | **`source-id`** | `JSON int` | 1 |
| | | required | **`source-ids`** | `JSON list of ints` | `[1,2]` |
| | required | required | **`field-id`** | `JSON int` | 1000 |
| required | required | required | **`name`** | `JSON string` | `id_bucket` |
| required | required | required | **`transform`** | `JSON string` | `bucket[16]` |
There was a problem hiding this comment.
iceberg-java used Transform, iceberg-rust used Transform, which is an enum type, and TransformFunction serves as the implementation interface for Transform. Which one does iceberg-cpp tend to use? @wgtmac @yingcai-cy
58e0b5e to
be03f0a
Compare
|
@lidavidm @zhjwpku @yingcai-cy Do you want to take a look? |
