refactor(dataset): Redirect multipart upload through File Service#4136
Merged
chenlica merged 22 commits intoapache:mainfrom Jan 5, 2026
Merged
Conversation
xuang7
reviewed
Dec 21, 2025
aicam
requested changes
Dec 22, 2025
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Outdated
Show resolved
Hide resolved
xuang7
reviewed
Jan 2, 2026
Contributor
xuang7
left a comment
There was a problem hiding this comment.
Thanks for the PR! I have tested, main functionality is working for different scale of file sizes. There is one issue of uploading the same file; both uploads were canceled due to previous approach on the frontend. Left few comments.
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
common/workflow-core/src/main/scala/org/apache/texera/service/util/S3StorageClient.scala
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
file-service/src/main/scala/org/apache/texera/service/resource/DatasetResource.scala
Outdated
Show resolved
Hide resolved
.../app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.ts
Outdated
Show resolved
Hide resolved
…ttps://github.com/carloea2/texera into refactor/multipart_upload_through_dataset_resource
Contributor
|
@carloea2 Please resolve those finished conversations. |
Contributor
Author
|
…ttps://github.com/carloea2/texera into refactor/multipart_upload_through_dataset_resource
Contributor
Author
|
@chenlica can we run the workflows to see if they pass? |
aicam
approved these changes
Jan 4, 2026
chenlica
reviewed
Jan 4, 2026
Contributor
chenlica
left a comment
There was a problem hiding this comment.
I left comments. Please check.
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
.../app/dashboard/component/user/user-dataset/user-dataset-explorer/dataset-detail.component.ts
Outdated
Show resolved
Hide resolved
file-service/src/test/scala/org/apache/texera/service/MockLakeFS.scala
Outdated
Show resolved
Hide resolved
chenlica
reviewed
Jan 4, 2026
...flow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala
Show resolved
Hide resolved
chenlica
reviewed
Jan 4, 2026
file-service/src/test/scala/org/apache/texera/service/MockLakeFS.scala
Outdated
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
DB / schema
Add
dataset_upload_sessionto track multipart upload sessions, including:(uid, did, file_path)as the primary keyupload_id(UNIQUE),physical_addressnum_parts_requestedto enforce expected part countAdd
dataset_upload_session_partto track per-part completion for a multipart upload:(upload_id, part_number)as the primary keyetag(TEXT NOT NULL DEFAULT '') to persist per-part ETags for finalizeCHECK (part_number > 0)for sanityFOREIGN KEY (upload_id) REFERENCES dataset_upload_session(upload_id) ON DELETE CASCADEBackend (
DatasetResource)Multipart upload API (server-side streaming to S3, LakeFS manages multipart state):
POST /dataset/multipart-upload?type=initnum_parts_requested.dataset_upload_session_partfor part numbers1..num_parts_requestedwithetag = ''(enables deterministic per-part locking and simple completeness checks).(uid, did, file_path)(409 Conflict). Race is handled via PK/duplicate handling + best-effort LakeFS abort for the losing initializer.POST /dataset/multipart-upload/part?filePath=...&partNumber=...Content-Lengthfor streaming uploads.partNumber <= num_parts_requested.(upload_id, part_number)row usingSELECT … FOR UPDATE NOWAITto prevent concurrent uploads of the same part.dataset_upload_session_part.etag(upsert/overwrite for retries).POST /dataset/multipart-upload?type=finishLocks the session row using
SELECT … FOR UPDATE NOWAITto prevent concurrent finalize/abort.Validates completeness using DB state:
num_parts_requestedrows for theupload_id.Fetches
(part_number, etag)ordered bypart_numberfrom DB and completes multipart upload via LakeFS.Deletes the DB session row; part rows are cleaned up via
ON DELETE CASCADE.NOWAIT lock contention is handled (mapped to “already being finalized/aborted”, 409).
POST /dataset/multipart-upload?type=abortSELECT … FOR UPDATE NOWAIT.finish.Access control and dataset permissions remain enforced on all endpoints.
Frontend service (
dataset.service.ts)multipartUpload(...)updated to reflect the server flow and return values (ETag persistence is server-side; frontend does not need to track ETags).Frontend component (
dataset-detail.component.ts)type=abortto clean up the upload session.Any related issues, documentation, discussions?
Closes #4110
How was this PR tested?
Unit tests added/updated (multipart upload spec):
Manual testing via the dataset detail page (single and multiple uploads), verified:
Was this PR authored or co-authored using generative AI tooling?
GPT partial use.