Skip to content

Panic in HashJoin with dictionary-encoded column in multi-column join key #20437

@erratic-pattern

Description

@erratic-pattern

Describe the bug

When executing a hash join with multiple join keys where one column is dictionary-encoded with fewer unique values than rows, DataFusion panics with:

InvalidArgumentError("Incorrect array length for StructArray field \"c1\", expected N got M")

To Reproduce

-- Small table with dictionary-encoded region (2 rows, 1 unique value)
CREATE TABLE small AS
SELECT id, arrow_cast(region, 'Dictionary(Int32, Utf8)') as region
FROM (VALUES (1, 'west'), (2, 'west')) AS t(id, region);

CREATE TABLE large AS
SELECT id, region, value
FROM (VALUES (1, 'west', 100), (2, 'west', 200), (3, 'east', 300)) AS t(id, region, value);

-- Multi-column join triggers panic
SELECT s.id, s.region, l.value
FROM small s
JOIN large l ON s.id = l.id AND s.region = l.region;

Expected behavior

Query returns 2 rows:

+----+--------+-------+
| id | region | value |
+----+--------+-------+
| 1  | west   | 100   |
| 2  | west   | 200   |
+----+--------+-------+

Actual behavior

Panic:

thread 'main' panicked at arrow-array/src/array/struct_array.rs:91:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect array length for StructArray field \"c1\", expected 3 got 2")

Root cause

In flatten_dictionary_array introduced by #18393:

fn flatten_dictionary_array(array: &ArrayRef) -> ArrayRef {
    downcast_dictionary_array! {
        array => {
            flatten_dictionary_array(array.values())
        }
        _ => Arc::clone(array)
    }
}

The function calls array.values() which returns the dictionary's unique values array, not the full array of values.

When building a StructArray for multi-column join keys, StructArray::try_new_with_length() detects the length mismatch:

if a.len() != len {
    return Err(ArrowError::InvalidArgumentError(format!(
        "Incorrect array length for StructArray field {:?}, expected {} got {}",
        f.name(), len, a.len()
    )));
}

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionSomething that used to work no longer does

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions