evaluate_external_ratings.Rd
This function compares external typicality ratings (e.g., generated by a new LLM) against the validation dataset included in baserater. The validation set contains average typicality ratings collected from 50 Prolific participants on a subset of 100 group–adjective pairs, as described in the accompanying paper.
The input ratings are merged with this reference set, and then:
Computes a correlation (cor.test
) between the external ratings and the human average;
Compares it to one or more built-in model baselines (default: GPT-4 and LLaMA 3.3);
Prints a clear summary of all correlation coefficients and flags whether the external model outperforms each baseline;
Returns a tidy result invisibly.
evaluate_external_ratings(
df,
method = "pearson",
baselines = c("mean_gpt4_rating", "mean_llama3_rating")
)
A data frame with columns adjective
, group
, and rating
. Must contain
typicality scores for all 100 validation items used in the original study.
The correlation method to use in stats::cor.test()
. Must be one of:
"pearson"
(default), "spearman"
, or "kendall"
.
Character vector of column names in the validation set to compare against
(default: c("mean_gpt4_rating", "mean_llama3_rating")
).
A tibble (invisibly) with one row per model (external
and each baseline),
and columns model
, r
, and p
for the correlation coefficient and p-value.