evaluate_external_ratings.RdThis function compares external typicality ratings (e.g., generated by a new LLM) against the validation dataset included in 'baserater'. The validation set contains average typicality ratings collected from 50 Prolific participants on a subset of 100 group–adjective pairs, as described in the accompanying paper.
The input ratings are merged with this reference set, and then:
Computes a correlation (cor.test) between the external ratings and the human average;
Compares it to one or more built-in model baselines (default: 'GPT-4' and 'LLaMA 3.3');
Prints a clear summary of all correlation coefficients and flags whether the external model outperforms each baseline;
Returns a tidy result invisibly.
evaluate_external_ratings(
df,
method = "pearson",
baselines = c("mean_gpt4_rating", "mean_llama3_rating"),
verbose = TRUE
)A data frame with columns adjective, group, and rating. Must contain
typicality scores for all 100 validation items used in the original study.
The correlation method to use in stats::cor.test(). Must be one of:
"pearson" (default), "spearman", or "kendall".
Character vector of column names in the validation set to compare against
(default: c("mean_gpt4_rating", "mean_llama3_rating")).
Logical. If TRUE (default), prints a summary of the correlations
and baseline comparisons. Set to FALSE to suppress console output.
A tibble (invisibly) with one row per model (external and each baseline),
and columns model, r, and p for the correlation coefficient and p-value.