Skip to content

GH-25025: [C++] Move non core compute kernels into separate shared library#46261

Merged
raulcd merged 53 commits intoapache:mainfrom
raulcd:GH-25025-2
Jun 13, 2025
Merged

GH-25025: [C++] Move non core compute kernels into separate shared library#46261
raulcd merged 53 commits intoapache:mainfrom
raulcd:GH-25025-2

Conversation

@raulcd
Copy link
Member

@raulcd raulcd commented Apr 29, 2025

Rationale for this change

Arrow is quite a heavy dependency and some users don't need all the tools we provide bundled with libarrow. Moving Arrow Compute to its own shared library allows users installations that better suit their needs having smaller memory footprint if necessary.
It might also help some users adding new kernels into an existing Arrow without recompiling it.

What changes are included in this PR?

  • Move all the Arrow Compute kernel functions to a new ArrowCompute shared library (libarrow_compute.so).
  • Create a new API to Initialize arrow compute registering the existing Kernels into the FunctionRegistry
  • Update Python/R/CGLib bindings to automatically register the Compute kernels transparently to the user.
  • Update Linux Packaging to provide the new arrow-compute library.
  • Update documentation with new requirements to call arrow::compute::Initialize()

Are these changes tested?

Yes on all CI jobs.

Are there any user-facing changes?

Yes. The Arrow compute functions will be provided as a different library. Any user using Arrow Compute from C++ directly will require a call to arrow::compute::Initialize() in order for the functions and kernels to be registered

This PR includes breaking changes to public APIs.

@raulcd
Copy link
Member Author

raulcd commented Apr 29, 2025

@github-actions crossbow submit -g cpp

@github-actions

This comment was marked as outdated.

@raulcd raulcd force-pushed the GH-25025-2 branch 2 times, most recently from e8fc17f to f21df32 Compare April 30, 2025 08:10
@raulcd
Copy link
Member Author

raulcd commented Apr 30, 2025

It seems I am down to one job failure to the Windows R release job (CGLib Ruby is failing due to an existing issue):
I am unsure why I am getting missing symbols on dataset and acero when building statically. I've tried several things, around our *.pc.in and configure.win files to try and fix linking to the new arrow_compute but I am unsure where the problem is exactly coming.
@kou @nealrichardson @assignUser any idea if what can I try?

$ g++ -shared -s -static-libgcc -o arrow.dll tmp.def RTasks.o altrep.o array.o array_to_vector.o arraydata.o arrowExports.o bridge.o buffer.o chunkedarray.o compression.o compute-exec.o compute.o config.o csv.o dataset.o datatype.o expression.o extension-impl.o feather.o field.o filesystem.o io.o json.o memorypool.o message.o parquet.o r_to_arrow.o recordbatch.o recordbatchreader.o recordbatchwriter.o safe-call-into-r-impl.o scalar.o schema.o symbols.o table.o threadpool.o type_infer.o -L../windows/arrow-20.0.0.9000/lib-14.2.0/x64 -L../windows/arrow-20.0.0.9000/lib/x64-ucrt -larrow_dataset -larrow_acero -lparquet -larrow_compute -larrow -larrow_bundled_dependencies -lutf8proc -lsnappy -lz -lzstd -llz4 -lbz2 -lbrotlienc -lbrotlidec -lbrotlicommon -lole32 -lbcrypt -lpsapi -lcrypto -lcrypt32 -lre2 -luserenv -lversion -lws2_32 -lbcrypt -lwininet -lwinhttp -lsecur32 -lshlwapi -lncrypt -lcurl -lnormaliz -lssh2 -lgdi32 -lssl -lcrypto -lcrypt32 -lwldap32 -lz -lws2_32 -lnghttp2 -ldbghelp -LC:/rtools45/x86_64-w64-mingw32
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_dataset.a(partition.cc.obj):(.text+0x7d9a): undefined reference to `__imp__ZN5arrow7compute7Grouper4MakeERKSt6vectorINS_10TypeHolderESaIS3_EEPNS0_11ExecContextE'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_dataset.a(partition.cc.obj):(.text+0x7f16): undefined reference to `__imp__ZN5arrow7compute7Grouper13MakeGroupingsERKNS_12NumericArrayINS_10UInt32TypeEEEjPNS0_11ExecContextE'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x1f2e): undefined reference to `__imp__ZN5arrow4util15TempVectorStack5allocEjPPhPi'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x1fb5): undefined reference to `__imp__ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EExPNS_4util15TempVectorStackExx'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x20b5): undefined reference to `__imp__ZN5arrow4util15TempVectorStack7releaseEij'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x2256): undefined reference to `__imp__ZN5arrow4util15TempVectorStack7releaseEij'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x7c05): undefined reference to `__imp__ZN5arrow4util15TempVectorStack4InitEPNS_10MemoryPoolEx'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text+0x7c6f): undefined reference to `__imp__ZN5arrow4util15TempVectorStackD1Ev'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text$_ZN5arrow5acero12HashJoinNodeD1Ev[_ZN5arrow5acero12HashJoinNodeD1Ev]+0x2f): undefined reference to `__imp__ZN5arrow4util15TempVectorStackD1Ev'
C:\rtools45\x86_64-w64-mingw32.static.posix\bin/ld.exe: ../windows/arrow-20.0.0.9000/lib/x64-ucrt/libarrow_acero.a(hash_join_node.cc.obj):(.text$_ZN5arrow5acero26BloomFilterPushdownContext17FilterSingleBatchEyPNS_7compute9ExecBatchE[_ZN5arrow5acero26BloomFilterPushdownContext17FilterSingleBatchEyPNS_7compute9ExecBatchE]+0x64f): undefined reference to `__imp__ZN5arrow7compute9Hashing329HashBatchERKNS0_9ExecBatchEPjRSt6vectorINS0_14KeyColumnArrayESaIS7_EExPNS_4util15TempVectorStackExx'

Still other improvements to be done but I wanted to have a clean CI PR to move to the cleaning and improvement stage.

@kou
Copy link
Member

kou commented May 1, 2025

Acero and Dataset static libraries weren't built with -DARROW_COMPUTE_STATIC:

https://github.com/apache/arrow/actions/runs/14751558013/job/41410004318?pr=46261#step:7:14525

 [ 92%] Building CXX object src/arrow/acero/CMakeFiles/arrow_acero_static.dir/groupby_aggregate_node.cc.obj
cd /D/a/arrow/arrow/src/build-x86_64-cpp/src/arrow/acero && /C/rtools40/ucrt64/bin/ccache.exe /C/rtools40/ucrt64/bin/c++.exe -DARROW_ACERO_EXPORTING -DARROW_ACERO_STATIC -DARROW_FLIGHT_SQL_STATIC -DARROW_FLIGHT_STATIC -DARROW_HAVE_RUNTIME_AVX2 -DARROW_HAVE_RUNTIME_BMI2 -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_STATIC -DARROW_WITH_TIMING_TESTS -DBOOST_ALL_NO_LIB -DUTF8PROC_STATIC -D_CRT_SECURE_NO_WARNINGS @CMakeFiles/arrow_acero_static.dir/includes_CXX.rsp -Wno-noexcept-type -DCURL_STATICLIB -fdiagnostics-color=always  -Wa,-mbig-obj -Wall -fno-semantic-interposition -mxsave -msse4.2 -DUTF8PROC_STATIC -O3 -DNDEBUG -O2 -ftree-vectorize  -std=c++17 -MD -MT src/arrow/acero/CMakeFiles/arrow_acero_static.dir/groupby_aggregate_node.cc.obj -MF CMakeFiles/arrow_acero_static.dir/groupby_aggregate_node.cc.obj.d -o CMakeFiles/arrow_acero_static.dir/groupby_aggregate_node.cc.obj -c /D/a/arrow/arrow/cpp/src/arrow/acero/groupby_aggregate_node.cc

Could you add the following

if(ARROW_BUILD_STATIC AND WIN32)
  target_compile_definitions(arrow_compute_static PUBLIC ARROW_COMPUTE_STATIC)
endif()

like we did for Flight?

if(ARROW_BUILD_STATIC AND WIN32)
target_compile_definitions(arrow_flight_static PUBLIC ARROW_FLIGHT_STATIC)
endif()

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 1, 2025
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 2, 2025
@raulcd
Copy link
Member Author

raulcd commented May 2, 2025

@github-actions crossbow submit -g nightly-tests

@github-actions

This comment was marked as outdated.

@raulcd
Copy link
Member Author

raulcd commented May 5, 2025

@github-actions crossbow submit -g nightly-tests

@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels May 5, 2025
* Revert manual edit of arrowExports.cpp

* Re-do R initialization change
@raulcd
Copy link
Member Author

raulcd commented Jun 6, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

github-actions bot commented Jun 6, 2025

Revision: 9001ec2

Submitted crossbow builds: ursacomputing/crossbow @ actions-74be4cb1e9

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, I posted two small suggestions.

amoeba and others added 2 commits June 9, 2025 08:49
Co-authored-by: Rossi Sun <zanmato1984@gmail.com>
Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.

Excellent work @raulcd !

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only able to do a quick pass. Looks great, love the ARROW_COMPUTE_EXPORT macro.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of this, it's been a long journey!

@raulcd
Copy link
Member Author

raulcd commented Jun 11, 2025

Thanks everyone for taking the time to review. If no more concerns are raised, it has already been approved by 4 committers, I plan to merge in a couple days.

@raulcd
Copy link
Member Author

raulcd commented Jun 13, 2025

Thanks everyone, I am merging!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit 6a5db61.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants