expect json response from the LLM by davidsbailey · Pull Request #53 · code-dot-org/aiproxy

davidsbailey · 2024-02-15T22:33:32Z

Description

Adds option to handle a json response from the LLM instead of a TSV response. This improves accuracy and reliability for some LLMs including GPT 4 Turbo.

The code changes in this PR enable running both the 'json' and 'reason' scenarios in the LLM comparison spreadsheet, for GPT 4 classic and turbo models.

Code changes

add response type param to params.json, which can either by 'tsv' (default) or 'json'.
use response type to determine which filename suffix to reading example LLM responses from
add parse_json_response method
use response type decide whether to parse the LLM response as json

S3 updates

alongside this PR, the following data has been uploaded to s3://cdo-ai/teaching_assistant/experiments/:

ai-rubrics-json: copy of ai-rubrics-pilot-baseline, with:
- system prompts updated to request JSON instead of TSV
- L14 and L18 updated to provide JSON instead of TSV example responses
ai-rubrics-json-reason: copy of ai-rubrics-json, with:
- system prompt modified to request Reason before Grade, to improve chain-of-thought reasoning
- L14 and L18 examples regenerated using GPT 4 classic and then hand-tuned for correctness (see below)
ai-rubrics-json-gpt-4-turbo: copy of ai-rubrics-json, with gpt-4-0125-preview model
ai-rubrics-json-reason-gpt-4-turbo: copy of ai-rubrics-json-reason, with gpt-4-0125-preview model

to help keep costs down and options open, I've included the output/report-exact-match.html files as well as the cached_responses directory. this will allow you, for example, to regenerate a pass-fail version of any report using rubric tester's -c option without incurring further LLM costs.

regenerating examples

as part of configuring ai-rubrics-json-reason, I used rubric tester to regenerate the examples (see steps added to the readme in this PR). Since I was regenerating existing examples rather than creating them from scratch, I chose the labels for the actual_labels.csv file in the temp dataset based on the examples/*.tsv files I was replacing, rather than having to determine those values from scratch.

rubric tester commands

The commands run to produce the results in LLM comparison are:

python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-gpt-4-turbo
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-reason
python ./lib/assessment/rubric_tester.py --experiment_name ai-rubrics-json-reason-gpt-4-turbo

before repeating these commands, please note that each GPT 4 classic run costs about $12 and each GPT 4 Turbo run costs about $5, so the above 4 commands cost about $35 to run (!!). see previous note about how report outputs and cached responses have been uploaded to S3 to hopefully avoid costs of redundant test runs, or shrink the cost of your test runs with -s and --lesson-names rubric tester params.

Testing story

updated existing unit tests
new unit test for reading json examples
new unit test for parsing json response

… suffix

…esponses

wilkie · 2024-02-26T17:57:43Z

tests/unit/assessment/test_rubric_tester.py


+    def test_should_open_example_js_and_json_files(self, mocker, code_generator, rubric, examples):
+        examples_set = examples(rubric)
+        print(examples_set)


I think this is a stray print.

thank you! removed here and above

wilkie

Wow this is great stuff! I like how you've added the response type as another option... JSON will be so much better to work with overall.

I think I found one stray print, but other than that it looks great.

davidsbailey · 2024-02-26T18:25:04Z

just wanted to add, thank you @wilkie for all the great test coverage in here! as a python newbie, not having to figure out how to test my code changes from scratch has been huge help 😁

davidsbailey force-pushed the handle-json-response branch from a004f07 to 50d6754 Compare February 16, 2024 00:27

davidsbailey added 9 commits February 15, 2024 16:33

try parse_json_response before parse_non_json_response

2680502

add response-type param, and use it to select example rubric filename…

d46f7a4

… suffix

use response_type to choose method for parsing AI response

47fc9a7

fix get_params processing

a14416c

look for last ] when parsing json

71355a6

fix unit tests

c753c2e

add json example response

52946a8

fix non-json scenarios

35400aa

add readme notes on creating experiments and generating example LLM r…

0a07f6f

…esponses

davidsbailey force-pushed the handle-json-response branch from 50d6754 to 0a07f6f Compare February 16, 2024 00:33

davidsbailey added 3 commits February 15, 2024 16:56

add note to readme about using cached responses

3a8779a

add more notes about running rubric tester cheaply

aff5ad5

try harder to fix missing response-type param

38bbbc9

davidsbailey mentioned this pull request Feb 16, 2024

add support for llama2 and claude via bedrock #54

Merged

Base automatically changed from rename-tsv-response-data to main February 16, 2024 17:35

davidsbailey added 6 commits February 16, 2024 12:23

test get_examples can read json

152d0be

fix test class names

bed8bb2

refactor openai_gpt_response fixture

cca202a

add support for json output type in test fixture

b9af39c

extract delimiter logic into gen_tabular_response

3832641

test json response type for get_response_data_if_valid

e7fc4ec

davidsbailey marked this pull request as ready for review February 17, 2024 00:18

davidsbailey requested review from a team and wilkie February 17, 2024 00:18

davidsbailey added 2 commits February 16, 2024 16:25

remove stale comment

d0f25a6

Merge branch 'main' into handle-json-response

008cd82

wilkie reviewed Feb 26, 2024

View reviewed changes

wilkie approved these changes Feb 26, 2024

View reviewed changes

remove stray prints from rubric tests

8803886

replace missing EOF newline

23710dd

davidsbailey merged commit c011928 into main Feb 26, 2024

davidsbailey deleted the handle-json-response branch February 26, 2024 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expect json response from the LLM#53

expect json response from the LLM#53
davidsbailey merged 22 commits intomainfrom
handle-json-response

davidsbailey commented Feb 15, 2024 •

edited

Loading

Uh oh!

wilkie Feb 26, 2024

Uh oh!

davidsbailey Feb 26, 2024

Uh oh!

wilkie left a comment

Uh oh!

davidsbailey commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davidsbailey commented Feb 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Code changes

S3 updates

regenerating examples

rubric tester commands

Testing story

Uh oh!

wilkie Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

davidsbailey Feb 26, 2024

Choose a reason for hiding this comment

Uh oh!

wilkie left a comment

Choose a reason for hiding this comment

Uh oh!

davidsbailey commented Feb 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidsbailey commented Feb 15, 2024 •

edited

Loading