Merged
Conversation
a004f07 to
50d6754
Compare
50d6754 to
0a07f6f
Compare
wilkie
reviewed
Feb 26, 2024
|
|
||
| def test_should_open_example_js_and_json_files(self, mocker, code_generator, rubric, examples): | ||
| examples_set = examples(rubric) | ||
| print(examples_set) |
Contributor
There was a problem hiding this comment.
I think this is a stray print.
Member
Author
There was a problem hiding this comment.
thank you! removed here and above
wilkie
approved these changes
Feb 26, 2024
Contributor
wilkie
left a comment
There was a problem hiding this comment.
Wow this is great stuff! I like how you've added the response type as another option... JSON will be so much better to work with overall.
I think I found one stray print, but other than that it looks great.
Member
Author
|
just wanted to add, thank you @wilkie for all the great test coverage in here! as a python newbie, not having to figure out how to test my code changes from scratch has been huge help 😁 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds option to handle a json response from the LLM instead of a TSV response. This improves accuracy and reliability for some LLMs including GPT 4 Turbo.
The code changes in this PR enable running both the 'json' and 'reason' scenarios in the LLM comparison spreadsheet, for GPT 4 classic and turbo models.
Code changes
'tsv'(default) or'json'.parse_json_responsemethodS3 updates
alongside this PR, the following data has been uploaded to
s3://cdo-ai/teaching_assistant/experiments/:ai-rubrics-json: copy ofai-rubrics-pilot-baseline, with:ai-rubrics-json-reason: copy ofai-rubrics-json, with:ai-rubrics-json-gpt-4-turbo: copy ofai-rubrics-json, withgpt-4-0125-previewmodelai-rubrics-json-reason-gpt-4-turbo: copy ofai-rubrics-json-reason, withgpt-4-0125-previewmodelto help keep costs down and options open, I've included the
output/report-exact-match.htmlfiles as well as thecached_responsesdirectory. this will allow you, for example, to regenerate a pass-fail version of any report using rubric tester's-coption without incurring further LLM costs.regenerating examples
as part of configuring
ai-rubrics-json-reason, I used rubric tester to regenerate the examples (see steps added to the readme in this PR). Since I was regenerating existing examples rather than creating them from scratch, I chose the labels for theactual_labels.csvfile in the temp dataset based on theexamples/*.tsvfiles I was replacing, rather than having to determine those values from scratch.rubric tester commands
The commands run to produce the results in LLM comparison are:
before repeating these commands, please note that each GPT 4 classic run costs about $12 and each GPT 4 Turbo run costs about $5, so the above 4 commands cost about $35 to run (!!). see previous note about how report outputs and cached responses have been uploaded to S3 to hopefully avoid costs of redundant test runs, or shrink the cost of your test runs with
-sand--lesson-namesrubric tester params.Testing story