Fix DataFusion EventDate handling: cast UInt16 to DATE#803
Fix DataFusion EventDate handling: cast UInt16 to DATE#803KARTIK64-rgb wants to merge 3 commits intoClickHouse:mainfrom
Conversation
cb33c5f to
f5fa7cb
Compare
| OPTIONS ('binary_as_string' 'true'); | ||
|
|
||
|
|
||
| CREATE VIEW hits AS |
There was a problem hiding this comment.
I think we need to apply the same thing to the partitioned version in https://github.com/ClickHouse/ClickBench/tree/main/datafusion-partitioned as well
There was a problem hiding this comment.
-Sure sir
-I will update the datafusion-partitioned/create.sql.
thanks for guiding.
|
Thanks @KARTIK64-rgb - this looks good |
|
This is also related to the discussion here, several years ago |
| STORED AS PARQUET | ||
| LOCATION 'hits.parquet' | ||
| OPTIONS ('binary_as_string' 'true'); | ||
|
|
There was a problem hiding this comment.
I checked out this PR, then ran cd datafusion and ./benchmark and it printed this (after successfully compiling datafusion):
[...]
2026-03-01 11:06:31 (95.4 MB/s) - ‘hits.parquet’ saved [14779976446/14779976446]
Run benchmarks
[0.015, 0.003, 0.003],
[0.015, 0.003, 0.003],
[0.014, 0.003, 0.003],
[0.014, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.011, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.015, 0.003, 0.003],
[0.011, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.013, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.013, 0.003, 0.003],
[0.011, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.013, 0.003, 0.003],
[0.014, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.012, 0.003, 0.003],
[0.012, 0.003, 0.004],
[0.013, 0.004, 0.003],
[0.012, 0.003, 0.003],
[0.013, 0.003, 0.003],
[0.013, 0.003, 0.003],
[0.011, 0.003, 0.003],
[0.012, 0.003, ^CThe printed values are obviously unrealistically low.
To double-check, I ran datafusion-cli manually:
arrow-datafusion/target/release/./datafusion-cli -f create.sql /tmp/query.sql(query.sql contained one of the ClickBench queries). This ran for > 30 seconds and printed a reasonable output. It looks like something is wrong with the way how run.sh parses datafusion-cli's output. Can't tell if the problem was introduced with this PR or if it existed already before. Anyways, would you like to check / fix this? Thanks!
There was a problem hiding this comment.
thanks for the update i will check out what's the issue .
Summary
ClickBench encodes
EventDateas aUInt16representing days since1970-01-01. The DataFusion runner was treating it as a raw integer,
causing ClickBench queries 36–42 to return 0 rows due to broken date
range predicates.
Changes
Updated the DataFusion runner scripts under
datafusion/to castEventDatefromUInt16to a proper SQLDATE:Correctness fix — no meaningful performance impact.
Without this change, queries Q37-Q43 that filter on EventDate using date literals (e.g., >= '2013-07-01') return incorrect results (0 rows) because EventDate is stored as INT32 in the parquet file and DataFusion compares it against string literals. The new view casts EventDate to a proper DATE type
References
hitsview apache/datafusion#19881