refactor(dataset): Redirect multipart upload through File Service#4130
refactor(dataset): Redirect multipart upload through File Service#4130carloea2 wants to merge 9 commits intoapache:mainfrom
Conversation
|
Why I prefer backing multipart uploads with a DB I am proposing to store multipart upload state in a DB instead of keeping everything fully stateless, for several reasons:
|
|
@carloea2 Please schedule a meeting to discuss the details then report the results here. |
| ) | ||
|
|
||
| val raw = | ||
| s"$Version|${payload.uploadId}|${payload.did}|${payload.uid}|$filePathB64|$physicalB64" |
There was a problem hiding this comment.
I think there are structured ways to use encryption instead of manually concatenating by |
There was a problem hiding this comment.
The concatenation and use of | is just for convenience, would you prefer a JSON approach? still the encryption will use the raw chars of the JSON.
There was a problem hiding this comment.
I have changed to JSON
| * - malformed structure | ||
| * - unsupported version | ||
| */ | ||
| def decode(token: String): UploadTokenPayload = { |
There was a problem hiding this comment.
Same here, decode function has too much manual work which I believe library should already provide high level functions
There was a problem hiding this comment.
Still with JSON there will be some manual work to be done, how will you modularize this?
There was a problem hiding this comment.
New version uses JSON
|
|
||
| val cipherText = cipher.doFinal(plain.getBytes(StandardCharsets.UTF_8)) | ||
|
|
||
| val combined = new Array[Byte](iv.length + cipherText.length) |
There was a problem hiding this comment.
Add comments to explain which part of algorithm is done by each line and why we need it
There was a problem hiding this comment.
Got it, thank you.
| final case class MultipartUploadInfo(key: String, uploadId: String) | ||
|
|
||
| /** Minimal info about a completed part in an upload. */ | ||
| final case class PartInfo(partNumber: Int, eTag: String) |
There was a problem hiding this comment.
Are these case classes used as type definition?
There was a problem hiding this comment.
Yes, I only keep the info needed and discard others
|
Thank you for your comments Ali, I will take a look at it. |
|
According to our discussion, we will be back to an DB approach, #4136 |
What changes were proposed in this PR?
Backend (
DatasetResource)Partially new multipart upload API:
POST /dataset/multipart-upload?type=init→ creates a LakeFS multipart upload session, encode uploadId|did|uid|filePathB64|physicalAddress into a server side encrypted uploadToken, and returns it.POST /dataset/multipart-upload/part?token=...&partNumber=...→ streams a single partPOST /dataset/multipart-upload?type=finish|abort→ completes or aborts the LakeFS multipart uploadKeep existing access control and dataset permissions enforced on all changed endpoints.
Frontend service (
dataset.service.ts)Main changes in
multipartUpload(...):initto get encrypteduploadToken./multipart-upload/partstreaming them with concurrency.Frontend component (
dataset-detail.component.ts)uploadTokento cancel/abort.Any related issues, documentation, discussions?
Closes #4110
How was this PR tested?
Manually uploaded large files via the dataset detail page (single and multiple), checked:
COMPLETED, LakeFS object present, dataset version creation works, use of files in workflow is correct).Was this PR authored or co-authored using generative AI tooling?
GPT partial use.