Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
7Jun 20 24

OpenAI Serving Up Nonsense

The magic of vector search comes from the model — how closely do distance metrics between embeddings align with the searcher’s own evaluation? An unsophisticated model won’t produce the “James Bond Car returns Aston Martin” sort of matches that prove vector search goes beyond traditional full-text techniques.

OpenAI offers two great embedding models in their current generation: text-embedding-3-small and text-embedding-3-large. These often produce amazing results, including the example above. However, they have some weird behaviors, arguably bugs, with certain words like waitress and guns.

Pasteboards. Lots of pasteboards. — Neo

My takeaway from the experiment that revealed these anomalies is that embeddings are good enough to use now, but the models are still immature and customers should plan for the expense and time of regenerating embeddings with newer models as they are released — they should expect to upgrade embeddings just like they would upgrade a valuable but unstable software application.

An Experiment

I wanted to better understand the differences between text-embedding-3-small and text-embedding-3-large, especially because both models can be used to produce embeddings of the same length (see Native support for shortening embeddings on this page). I took the words from subtlex-us and generated embeddings for each individual word using both models.

Do the models agree?

For each word I ran an experiment:

  1. Find the closest 500 matches (using Faiss IndexFlatL2, an exact metric) from text-embedding-3-large.
  2. Find the closest 3 matches from text-embedding-3-small.
  3. If any of the 3 matches from text-embedding-3-small are not present in the 500 matches from text-embedding-3-large, flag that word as “suspicious.”
  4. Review the suspicious words manually.

Nothing Too Suspicious At First

The flagged words were often due to inconsistencies around words with different meanings but similar spelling. For example, text-embedding-3-small’s results make embers the best match for member, but that is not within the first 500 results for text-embedding-3-large.

text-embedding-3-smalltext-embedding-3-large
membermember
embersmembers
membersmembership
membershipparticipant
participantmembered

Sometimes text-embedding-3-large seemed to make some connections that text-embedding-3-small did not, like just with justice (but also here it includes jest which is more of a similar spelling match):

text-embedding-3-smalltext-embedding-3-large
justjust
onlyjustest
somewhatjustly
lightlyjustice
remainingjusto
althoughjustify
somejustness
likedjest
reallyequal

For well text-embedding-3-small matches with farewell and text-embedding-3-large doesn’t, also, le modèle grande parle bien français as shown by the word bien:

text-embedding-3-smalltext-embedding-3-large
wellwell
farewellnicely
okayweli
okweel
somewhatfine
ablybetter
reallybien
ellgood

Highly Suspicious

The embeddings for some words are inexplicable, at least for me. For text-embedding-3-large, one such word is waitress. For text-embedding-3-small, guns has the same problem.

Here’s what each model serves up for waitress. I don’t think any of the text-embedding-3-large results makes sense.

text-embedding-3-smalltext-embedding-3-large
waitresswaitress
waiterrame
waitressesstrate
waitersploy
waitpersoners
headwaiterarsed
restauranteurchooser
headwaitersdater
barmaidusher
waitstafftings
bartendercreen
stewardesstempts
hostesshelpers
receptionistonus
cashierresponse
bartendedentrant
hostessesregister
attendantresents
manageressprocess

The nonsensical ordering comes from the distance metric — the dot product of text-embedding-3-large’s embedding for waitress is only 0.48 against waiter but 0.58 against rats.

For the text-embedding-3-small, guns strongly matches some compound words like pasteboards, matchboxes, and dreamboats, but not weapons.

text-embedding-3-smalltext-embedding-3-large
gunsguns
gungun
antsweapons
pasteboardsfirearms
inksweapon
matchboxespistols
ammohandguns
dreamboatsrifles
lassespopguns
teesgunners
erasuresshotguns
hammerarms
vehiclesfusils
bagspistol

I haven’t drilled into why these embeddings are so different from human expectations, but by the casual way I found these words, I suspect there are more with this problem.

Wait - Is This Right?

These results were so surprising to me that I verified them by generating the embeddings multiple times in case something had corrupted them on the first attempt. My worry over making a mistake initially seemed right: when I compared newly generated embeddings the values were not identical to the ones I had saved earlier. At first I wondered if I had screwed up some serialization and this entire finding was a mistake. But then as I repeated the test many times, I found that the embedding values for the same text are actually not stable.

def get_embedding(text, model):
    response = openai.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

embedding_guns_small_new = [get_embedding("guns", model="text-embedding-3-small") for i in range(10)]

for i in range(10):
    print(np.dot(embedding_guns_small_new[0], embedding_guns_small_new[i]))

Sometimes the output is the same, but sometimes it is different. So embeddings can’t be compared byte-for-byte. And by the thread above and my own checks in Python and NodeJS, this discrepancy comes from the service, not the client handling of floating-point numbers.

0.9999999277543438
0.9999999277543438
0.9999999277543438
0.9999989712268592
0.9999999277543438
0.9999989653030437
0.9999989653030437
0.9999989653030437
0.9999999277543438
0.9999989653030437

To verify the match results, I retrieved fresh embeddings for guns, muskets, and pasteboards with text-embedding-3-small, and waitress, waiter, and rats with text-embedding-3-large. The behavior is consistent:

print(f"guns vs. pasteboards small: {np.dot(get_embedding("guns", model="text-embedding-3-small"), get_embedding("pasteboards", model="text-embedding-3-small"))}")
print(f"guns vs. muskets small:     {np.dot(get_embedding("guns", model="text-embedding-3-small"), get_embedding("muskets", model="text-embedding-3-small"))}")
print()
print(f"guns vs. pasteboards large: {np.dot(get_embedding("guns", model="text-embedding-3-large"), get_embedding("pasteboards", model="text-embedding-3-large"))}")
print(f"guns vs. muskets large:     {np.dot(get_embedding("guns", model="text-embedding-3-large"), get_embedding("muskets", model="text-embedding-3-large"))}")
print()
print(f"waitress vs. waiter large:  {np.dot(get_embedding("waitress", model="text-embedding-3-large"), get_embedding("waiter", model="text-embedding-3-large"))}")
print(f"waitress vs. rats large:    {np.dot(get_embedding("waitress", model="text-embedding-3-large"), get_embedding("rats", model="text-embedding-3-large"))}")
print()
print(f"waitress vs. waiter small:  {np.dot(get_embedding("waitress", model="text-embedding-3-small"), get_embedding("waiter", model="text-embedding-3-small"))}")
print(f"waitress vs. rats small:    {np.dot(get_embedding("waitress", model="text-embedding-3-small"), get_embedding("rats", model="text-embedding-3-small"))}")

This is just a straight dot product, so the higher the result the closer the match. The text-embedding-3-small model’s weirdness with guns and text-embedding-3-large’s weirdness with waitress is evident.

guns vs. pasteboards small: 0.6474366503911342
guns vs. muskets small:     0.47351160151253646

guns vs. pasteboards large: 0.23913953704300578
guns vs. muskets large:     0.5322755486400199

waitress vs. waiter large:  0.4853430603463844
waitress vs. rats large:    0.5838521794318046

waitress vs. waiter small:  0.8872824468235675
waitress vs. rats small:    0.3148280237056705

Conclusion

Embeddings are already powerful, but it’s clear they have room for improvement! I am excited to see how OpenAI’s models and others evolve over time and will make sure that for my own applications, not only can embeddings be generated at initial load, but also refreshed over time. Thanks for reading to the very end! For more content like this, please follow me on X.

 Top