Jump to content

Recommended Posts

Posted

I can use the OpenAI API to get arrays containing vector embeddings for a word/phrase using this: https://platform.openai.com/docs/guides/embeddings

But what's the process of comparing the two vector arrays using something like this: https://en.wikipedia.org/wiki/Cosine_similarity

In python, there's a library for this: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

Anything similar in AutoIt? Thanks!

Posted

Did I get this right? Just working off of the Wikipedia definition.

#include <Array.au3>
#include <Math.au3>


Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]


Local $dotProduct = 0.0
For $i = 0 To UBound($embedding1) - 1
    $dotProduct += $embedding1[$i] * $embedding2[$i]
Next


Local $magnitude1 = 0.0
For $i = 0 To UBound($embedding1) - 1
    $magnitude1 += $embedding1[$i] ^ 2
Next
$magnitude1 = Sqrt($magnitude1)

Local $magnitude2 = 0.0
For $i = 0 To UBound($embedding2) - 1
    $magnitude2 += $embedding2[$i] ^ 2
Next
$magnitude2 = Sqrt($magnitude2)

Local $cosineSimilarity = $dotProduct / ($magnitude1 * $magnitude2)

MsgBox(0, "", "Cosine similarity: " & $cosineSimilarity)

 

  • Solution
Posted

looks okay, but you should really look into E4A's DotProduct (section: Multiplication) and GetNorm (section: Reduction) functions.

Posted
23 minutes ago, RTFC said:

looks okay, but you should really look into E4A's DotProduct (section: Multiplication) and GetNorm (section: Reduction) functions.

I remember you recommending this library some time back, and I downloaded it but it looked so daunting (I don't have a CS background) I backed off immediately :)
Okay I'll give it another go :)

Posted (edited)

How is this daunting?:D

#include "C:\AutoIt\Eigen\Eigen4AutoIt.au3" ; NB adjust path to wherever you put it

Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]

_Eigen_StartUp()

$vec1=_Eigen_CreateMatrix_FromArray($embedding1)
$vec2=_Eigen_CreateMatrix_FromArray($embedding2)

MsgBox(0, "", "Cosine similarity: " & _
    _Eigen_DotProduct($vec1,$vec2) / (_Eigen_GetNorm($vec1) * _Eigen_GetNorm($vec2)))

_Eigen_CleanUp()

(I don't have a CS background either.)

Edited by RTFC
typo
  • 1 month later...
Posted

Update: as of version 5.4 (released: 29 May 2023), E4A supports direct retrieval of the angle between two vectors with function _Eigen_GetVectorAngle ( $vecA, $vecB, $returnRadians = False ). A zero-degree angle signifies parallel vectors (aligned and pointing in the exact same direction), a 90-degree angle perpendicular ones, and a 180-degree angle implies the vectors are anti-parallel (aligned, but pointing in opposite directions).

#include "C:\AutoIt\Eigen\Eigen4AutoIt.au3" ; NB adjust path to wherever you put it

Local $embedding1[3] = [1.0, 2.0, 3.0]
Local $embedding2[3] = [4.0, 5.0, 6.0]

_Eigen_StartUp()

$vec1=_Eigen_CreateMatrix_FromArray($embedding1)
$vec2=_Eigen_CreateMatrix_FromArray($embedding2)

MsgBox(0, "", "Cosine similarity: " & _Eigen_GetVectorAngle($vec1,$vec2))

_Eigen_CleanUp()

 

Posted

Never jumped on the Python bandwagon myself either. From what I read at stackoverflow in various threads, you should be able to get significantly better performance when replacing numPy with raw Eigen/C++, even without GPU/CUDA/MPI refactoring.

If you're serious about setting up ML in this way, I can probably help you. Because many of Eigen's speed optimisations are obtained at compile-time (e.g. lazy evaluation, smart loop unrolling, and matrix operation-specific stuff), if you were to present a snippet  of E4A code (say, a UDF that applies a number of E4A functions to some input matrices), I could duplicate/optimise/rewrite that and present you with single pre-compiled E4A  dllcall. I first suggested this when I started the E4A thread many years ago, but so far nobody has taken me up on this. Up to you of course. If you're worried about your intellectual property, you can PM me instead. In any case, hope it helps.

Posted
1 hour ago, RTFC said:

so far nobody has taken me up on this

Would love to :) but nothing in my workflow (so far) has warranted anything extremely complex - - at most, I'm using SBERT embeddings + Milvus vector DB and doing some vector comparisons, indexing corpus, some n-gram extractions with Yake.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...