Evaluate how new typicality ratings predict human ratings and compares performance to LLM baselines

This function compares external typicality ratings (e.g., generated by a new LLM) against the validation dataset included in 'baserater'. The validation set contains average typicality ratings collected from 50 Prolific participants on a subset of 100 group–adjective pairs, as described in the accompanying paper.

The input ratings are merged with this reference set, and then:

Computes a correlation (cor.test) between the external ratings and the human average;
Compares it to one or more built-in model baselines (default: 'GPT-4' and 'LLaMA 3.3');
Prints a clear summary of all correlation coefficients and flags whether the external model outperforms each baseline;
Returns a tidy result invisibly.

evaluate_external_ratings(
  df,
  method = "pearson",
  baselines = c("mean_gpt4_rating", "mean_llama3_rating"),
  verbose = TRUE
)

Arguments

df: A data frame with columns adjective, group, and rating. Must contain typicality scores for all 100 validation items used in the original study.
method: The correlation method to use in stats::cor.test(). Must be one of: "pearson" (default), "spearman", or "kendall".
baselines: Character vector of column names in the validation set to compare against (default: c("mean_gpt4_rating", "mean_llama3_rating")).
verbose: Logical. If TRUE (default), prints a summary of the correlations and baseline comparisons. Set to FALSE to suppress console output.

Value

A tibble (invisibly) with one row per model (external and each baseline), and columns model, r, and p for the correlation coefficient and p-value.

Examples

if (FALSE) { # \dontrun{
new_scores <- tibble::tibble(
  group = ratings$group,
  adjective = ratings$adjective,
  rating = runif(100)  # Replace with model predictions
)
evaluate_external_ratings(new_scores)
} # }