202511060909 Status: idea Tags: Datascience, LLM

Cosine Similarity

Cosine simularity is a method to compare how simular 2 non-zero vectors are.

<?php
function cosineSimilarity($vector1, $vector2)
{
    $dotProduct = 0.0;
    $magnitude1 = 0.0;
    $magnitude2 = 0.0;
 
    foreach ($vector1 as $key => $value) {
        $dotProduct += $value * $vector2[$key];
        $magnitude1 += pow($value, 2);
        $magnitude2 += pow($vector2[$key], 2);
    }
 
    $magnitude1 = sqrt($magnitude1);
    $magnitude2 = sqrt($magnitude2);
 
    $epsilon = 1e-8; // A small epsilon value to avoid division by very small numbers
 
    if (abs($magnitude1 * $magnitude2) < $epsilon) {
        return 0;
    } else {
        return $dotProduct / ($magnitude1 * $magnitude2);
    }
}

This can be used to compare how close-by 2 strings of text are with a LLM.


References

I first used this when trying to compare chatbot asked questions to FAQ questions, I didn’t really know what it did, jsut that it worked.

But now after this datascience lesson I realized that it is just comparing dataset columns (that had used dimensionality reduction)