OpenAI Serving Up Nonsense

Vector Search Is New and Weird

The magic of vector search comes from the model — how closely do distance metrics between embeddings align with the searcher’s own evaluation? An unsophisticated model won’t produce the “James Bond Car returns Aston Martin” sort of matches that prove vector search goes beyond traditional full-text techniques.

OpenAI offers two great embedding models in their current generation: text-embedding-3-small and text-embedding-3-large. These often produce amazing results, including the example above. However, they have some weird behaviors, arguably bugs, with certain words like waitress and guns.

Pasteboards. Lots of pasteboards. — Neo

My takeaway from the experiment that revealed these anomalies is that embeddings are good enough to use now, but the models are still immature and customers should plan for the expense and time of regenerating embeddings with newer models as they are released — they should expect to upgrade embeddings just like they would upgrade a valuable but unstable software application.

An Experiment

I wanted to better understand the differences between text-embedding-3-small and text-embedding-3-large, especially because both models can be used to produce embeddings of the same length (see Native support for shortening embeddings on this page). I took the words from subtlex-us and generated embeddings for each individual word using both models.

Do the Models Agree?

For each word I ran an experiment:

Find the closest 500 matches (using Faiss IndexFlatL2, an exact metric) from text-embedding-3-large.
Find the closest 3 matches from text-embedding-3-small.
If any of the 3 matches from text-embedding-3-small are not present in the 500 matches from text-embedding-3-large, flag that word as “suspicious.”
Review the suspicious words manually.

Nothing Too Suspicious At First

The flagged words were often due to inconsistencies around words with different meanings but similar spelling. For example, text-embedding-3-small’s results make embers the best match for member, but that is not within the first 500 results for text-embedding-3-large.

text-embedding-3-small	text-embedding-3-large
member	member
embers	members
members	membership
membership	participant
participant	membered
…	…

Sometimes text-embedding-3-large seemed to make some connections that text-embedding-3-small did not, like just with justice (but also here it includes jest which is more of a similar spelling match):

text-embedding-3-small	text-embedding-3-large
just	just
only	justest
somewhat	justly
lightly	justice
remaining	justo
although	justify
some	justness
liked	jest
really	equal
…	…

For well text-embedding-3-small matches with farewell and text-embedding-3-large doesn’t, also, le modèle grande parle bien français as shown by the word bien:

text-embedding-3-small	text-embedding-3-large
well	well
farewell	nicely
okay	weli
ok	weel
somewhat	fine
ably	better
really	bien
ell	good
…	…

Highly Suspicious

The embeddings for some words are inexplicable, at least for me. For text-embedding-3-large, one such word is waitress. For text-embedding-3-small, guns has the same problem.

Here’s what each model serves up for waitress. I don’t think any of the text-embedding-3-large results makes sense.

text-embedding-3-small	text-embedding-3-large
waitress	waitress
waiter	rame
waitresses	strate
waiters	ploy
waitperson	ers
headwaiter	arsed
restauranteur	chooser
headwaiters	dater
barmaid	usher
waitstaff	tings
bartender	creen
stewardess	tempts
hostess	helpers
receptionist	onus
cashier	response
bartended	entrant
hostesses	register
attendant	resents
manageress	process
…	…

The nonsensical ordering comes from the distance metric — the dot product of text-embedding-3-large’s embedding for waitress is only 0.48 against waiter but 0.58 against rats.

For the text-embedding-3-small, guns strongly matches some compound words like pasteboards, matchboxes, and dreamboats, but not weapons.

text-embedding-3-small	text-embedding-3-large
guns	guns
gun	gun
ants	weapons
pasteboards	firearms
inks	weapon
matchboxes	pistols
ammo	handguns
dreamboats	rifles
lasses	popguns
tees	gunners
erasures	shotguns
hammer	arms
vehicles	fusils
bags	pistol
…	…

I haven’t drilled into why these embeddings are so different from human expectations, but by the casual way I found these words, I suspect there are more with this problem.

Wait — Is This Right?

These results were so surprising to me that I verified them by generating the embeddings multiple times in case something had corrupted them on the first attempt. My worry over making a mistake initially seemed right: when I compared newly generated embeddings the values were not identical to the ones I had saved earlier. At first I wondered if I had screwed up some serialization and this entire finding was a mistake. But then as I repeated the test many times, I found that the embedding values for the same text are actually not stable.

def get_embedding(text, model):
    response = openai.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

embedding_guns_small_new = [
    get_embedding("guns", model="text-embedding-3-small") for i in range(10)
]

for i in range(10):
    print(np.dot(embedding_guns_small_new[0], embedding_guns_small_new[i]))

Sometimes the output is the same, but sometimes it is different. So embeddings can’t be compared byte-for-byte. And by the thread above and my own checks in Python and NodeJS, this discrepancy comes from the service, not the client handling of floating-point numbers.

0.9999999277543438
0.9999999277543438
0.9999999277543438
0.9999989712268592
0.9999999277543438
0.9999989653030437
0.9999989653030437
0.9999989653030437
0.9999999277543438
0.9999989653030437

To verify the match results, I retrieved fresh embeddings for guns, muskets, and pasteboards with text-embedding-3-small, and waitress, waiter, and rats with text-embedding-3-large. The behavior is consistent:

print(f"guns vs. pasteboards small: {np.dot(get_embedding("guns", model="text-embedding-3-small"), get_embedding("pasteboards", model="text-embedding-3-small"))}")
print(f"guns vs. muskets small:     {np.dot(get_embedding("guns", model="text-embedding-3-small"), get_embedding("muskets", model="text-embedding-3-small"))}")
print()
print(f"guns vs. pasteboards large: {np.dot(get_embedding("guns", model="text-embedding-3-large"), get_embedding("pasteboards", model="text-embedding-3-large"))}")
print(f"guns vs. muskets large:     {np.dot(get_embedding("guns", model="text-embedding-3-large"), get_embedding("muskets", model="text-embedding-3-large"))}")
print()
print(f"waitress vs. waiter large:  {np.dot(get_embedding("waitress", model="text-embedding-3-large"), get_embedding("waiter", model="text-embedding-3-large"))}")
print(f"waitress vs. rats large:    {np.dot(get_embedding("waitress", model="text-embedding-3-large"), get_embedding("rats", model="text-embedding-3-large"))}")
print()
print(f"waitress vs. waiter small:  {np.dot(get_embedding("waitress", model="text-embedding-3-small"), get_embedding("waiter", model="text-embedding-3-small"))}")
print(f"waitress vs. rats small:    {np.dot(get_embedding("waitress", model="text-embedding-3-small"), get_embedding("rats", model="text-embedding-3-small"))}")

This is just a straight dot product, so the higher the result the closer the match. The text-embedding-3-small model’s weirdness with guns and text-embedding-3-large’s weirdness with waitress is evident.

guns vs. pasteboards small: 0.6474366503911342
guns vs. muskets small:     0.47351160151253646

guns vs. pasteboards large: 0.23913953704300578
guns vs. muskets large:     0.5322755486400199

waitress vs. waiter large:  0.4853430603463844
waitress vs. rats large:    0.5838521794318046

waitress vs. waiter small:  0.8872824468235675
waitress vs. rats small:    0.3148280237056705

Conclusion

Embeddings are already powerful, but it’s clear they have room for improvement! I am excited to see how OpenAI’s models and others evolve over time and will make sure that for my own applications, not only can embeddings be generated at initial load, but also refreshed over time. Thanks for reading to the very end! For more content like this, please follow me on X.