MADial-Bench

Repo for the paper, MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation. NAACL 2025.

Here we introduce a benchmark for memory-augmented chatbot by proposing two-stage memory recall and multi-recall paradigm based on cognitive science and psychological theory.

The dialogue and memory data for testing are in data directory.

"relevant-id" in data/en/MADial-Bench-en-memory.json and data/en/MADial-Bench-en-memory.json，are index of related memories in memory archieve.
"test-turn" in data/en/MADial-Bench-en-dialogue.json, and data/zh/MADial-Bench-zh-dialogue.json，the turns to be predicted.

The embeddings of dialogues and memories in memory recall task are in embeddings directory.

The inference results from LLM models in memory recognition and response genereation task are in output directory.

Evaluation results are in results directory with annotation files in annotatation directory. The critiera and guidelines are

You can already make use of the benchmark if you read the paper and have strong coding ability.

Usage & start up

For memory recall task:

First download embeddings models and save them in pretrained_models.
Then run Embeddings.py to generate embeddings for dialogues and memories.
Run embeddings_top_20_new.py to get top 20 candidates.
Run embedding_scores_new.py to calculate scores of certain metrics.

For memory recognition and response generation task:

for English version, run make_setting_candidates.py to generate dialogues for setting2 and 3.
First download opensourced LLM and save them in pretrained_models. If you usage API, then skip.
change the code in infer_setting1/2/3_en/ch.py to load your LLM, then run the infer program.
copy the output file path and change the path in evaluate.py to run automatic evaluation. It is not reliable, we recommand you to run step 4.
Human evaluation. Criteria and guidelines are in annotation directory. If you want to try LLM as judge, please try. We find the LLMs (up to 2024.10.25) are unable to do such a careful job.

Generate a file from the inference results of different LLMs:

run make_annotation_candidates.py to group the dialogus and the responses togather.
run prepare_anno.py to sample certain amount of dialogues from the test set and form an excel.

The codes need to be rewritten if you need one-click start codes. Sorry for the inconvenience and I will tidy them up ASAP.

Please feel free to ask any questions and report issues.

Please Cite

@inproceedings{he-etal-2025-madial,
    title = "{MAD}ial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation",
    author = "He, Junqing  and
      Zhu, Liang  and
      Wang, Rui  and
      Wang, Xi  and
      Haffari, Gholamreza  and
      Zhang, Jiaxing",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.499/",
    pages = "9902--9921",
    ISBN = "979-8-89176-189-6"
}


@misc{he2024madialbenchrealworldevaluationmemoryaugmented,
      title={MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation}, 
      author={Junqing He and Liang Zhu and Rui Wang and Xi Wang and Reza Haffari and Jiaxing Zhang},
      year={2024},
      eprint={2409.15240},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.15240}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
annotation		annotation
data		data
embeddings		embeddings
output		output
results		results
Embedding.py		Embedding.py
LICENSE		LICENSE
README.md		README.md
calculate_human_metrics.py		calculate_human_metrics.py
embedding_score_new.py		embedding_score_new.py
embedding_top_20_new.py		embedding_top_20_new.py
evaluate.py		evaluate.py
infer_setting1_ch.py		infer_setting1_ch.py
infer_setting1_en.py		infer_setting1_en.py
infer_setting2_ch.py		infer_setting2_ch.py
infer_setting2_en.py		infer_setting2_en.py
infer_setting3_ch.py		infer_setting3_ch.py
infer_setting3_en.py		infer_setting3_en.py
infer_setting3_gpt4.py		infer_setting3_gpt4.py
intro_pic.png		intro_pic.png
make_annotion_candidates.py		make_annotion_candidates.py
make_setting_candidates.py		make_setting_candidates.py
metrics.py		metrics.py
prepare_anno.py		prepare_anno.py
sidebyside.py		sidebyside.py
translation.py		translation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MADial-Bench

Usage & start up

For memory recall task:

For memory recognition and response generation task:

Please Cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MADial-Bench

Usage & start up

For memory recall task:

For memory recognition and response generation task:

Please Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages