This function compares external typicality ratings (e.g., generated by a new LLM) against the validation dataset included in baserater. The validation set contains average typicality ratings collected from 50 Prolific participants on a subset of 100 group–adjective pairs, as described in the accompanying paper.

The input ratings are merged with this reference set, and then:

  1. Computes a correlation (cor.test) between the external ratings and the human average;

  2. Compares it to one or more built-in model baselines (default: GPT-4 and LLaMA 3.3);

  3. Prints a clear summary of all correlation coefficients and flags whether the external model outperforms each baseline;

  4. Returns a tidy result invisibly.

evaluate_external_ratings(
  df,
  method = "pearson",
  baselines = c("mean_gpt4_rating", "mean_llama3_rating")
)

Arguments

df

A data frame with columns adjective, group, and rating. Must contain typicality scores for all 100 validation items used in the original study.

method

The correlation method to use in stats::cor.test(). Must be one of: "pearson" (default), "spearman", or "kendall".

baselines

Character vector of column names in the validation set to compare against (default: c("mean_gpt4_rating", "mean_llama3_rating")).

Value

A tibble (invisibly) with one row per model (external and each baseline), and columns model, r, and p for the correlation coefficient and p-value.

Examples

if (FALSE) { # \dontrun{
new_scores <- tibble::tibble(
  group = ratings$group,
  adjective = ratings$adjective,
  rating = runif(100)  # Replace with model predictions
)
evaluate_external_ratings(new_scores)
} # }