Repo for the paper, MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation. NAACL 2025.
Here we introduce a benchmark for memory-augmented chatbot by proposing two-stage memory recall and multi-recall paradigm based on cognitive science and psychological theory.
The dialogue and memory data for testing are in data directory.
- "relevant-id" in data/en/MADial-Bench-en-memory.json and data/en/MADial-Bench-en-memory.json,are index of related memories in memory archieve.
- "test-turn" in data/en/MADial-Bench-en-dialogue.json, and data/zh/MADial-Bench-zh-dialogue.json,the turns to be predicted.
The embeddings of dialogues and memories in memory recall task are in embeddings directory.
The inference results from LLM models in memory recognition and response genereation task are in output directory.
Evaluation results are in results directory with annotation files in annotatation directory. The critiera and guidelines are
You can already make use of the benchmark if you read the paper and have strong coding ability.
- First download embeddings models and save them in pretrained_models.
- Then run
Embeddings.pyto generate embeddings for dialogues and memories. - Run
embeddings_top_20_new.pyto get top 20 candidates. - Run
embedding_scores_new.pyto calculate scores of certain metrics.
- for English version, run
make_setting_candidates.pyto generate dialogues for setting2 and 3. - First download opensourced LLM and save them in pretrained_models. If you usage API, then skip.
- change the code in
infer_setting1/2/3_en/ch.pyto load your LLM, then run the infer program. - copy the output file path and change the path in
evaluate.pyto run automatic evaluation. It is not reliable, we recommand you to run step 4. - Human evaluation. Criteria and guidelines are in annotation directory. If you want to try LLM as judge, please try. We find the LLMs (up to 2024.10.25) are unable to do such a careful job.
Generate a file from the inference results of different LLMs:
- run
make_annotation_candidates.pyto group the dialogus and the responses togather. - run
prepare_anno.pyto sample certain amount of dialogues from the test set and form an excel.
The codes need to be rewritten if you need one-click start codes. Sorry for the inconvenience and I will tidy them up ASAP.
Please feel free to ask any questions and report issues.
@inproceedings{he-etal-2025-madial,
title = "{MAD}ial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation",
author = "He, Junqing and
Zhu, Liang and
Wang, Rui and
Wang, Xi and
Haffari, Gholamreza and
Zhang, Jiaxing",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.499/",
pages = "9902--9921",
ISBN = "979-8-89176-189-6"
}
@misc{he2024madialbenchrealworldevaluationmemoryaugmented,
title={MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation},
author={Junqing He and Liang Zhu and Rui Wang and Xi Wang and Reza Haffari and Jiaxing Zhang},
year={2024},
eprint={2409.15240},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15240},
}