Infopost | 2025.06.14

Poseidon with trident AI art
The only graphics in this post are histograms so here's a mildly-relevant AI Poseidon.

When we last talked I had done a simple embeddings implementation for webpage matching and web graph entry. I had to slim the 4,000,000-keyword vocabulary down to a reasonable size for use on a desktop, this was accomplished by using a list of keywords perloined from the Outer Web trigram corpus.

This rough implementation left me considering ways to squeeze some more words for semantic matching out of the source data without blowing up the heap or cluttering it with rare terms like 'unomig' (United Nations Observer Mission in Georga) and 'imathia' (part of Macedonia). I was also curious about moving from vector size 100 to 300 to get more data fidelity. A few ideas came to mind:
Since the first two items are lossy, I needed some sort of giggle test to ensure my code worked as desired and didn't overcook the compression. Embedding math (discussed last time) seemed like a reasonable test so I settled on a few variations on the canonical example, "king + woman = queen". Or maybe it's "king - man = queen". Something like that.

My baseline code from a couple weeks ago used an embedding vector size of 100 for each term in the 41k-word Outer Web vocabulary (note, I'll refer to this 41k vocabulary through this post). Here are a few sample results with the top five matches listed from left to right:

king + woman: queen   monarch mistress lover    throne
king - man:   monarch prince  lion     throne   warrior
queen + man:  king    woman   princess mistress monarch

Using the same dictionary I generated a ~100mb embeddings file with the size-300 vectors. The results:

king + woman: queen princess monarch  mistress bride
king - man:   kings woman    monarch  queen    gentleman
queen + man:  king  woman    princess maid     monarch

Not the same! So for a more nuanced application, moving to the bigger vector size would probably be worthwhile. But this required success in one or more of the compression strategies.
Truncation

An embedding looks like:

            0      1     2          99
macaroni [0.044 -3.026 1.209 ... -0.180]    # "..." is 96 more numbers

The larger values (positive or negative), 1.209 and -3.026 are strong signal along those dimensions, the values closer to zero (0.044 and -0.180) are weak signal. If toss all of the embedding values for a word in a histogram, you get something gaussianlike:

Embedding histogram reluctance
An example histogram of the values in an embedding.

Most of the values congregate near 0.0/weak signal which, to busy individuals such as myself, are equivalent to noise and therefore an excellent candidate for truncation. So what if instead of the vector above I have:

macaroni [0.000 -3.026 1.209 ...  0.000]

The embedding is still claimin "I'm super [2] and very, very not [1]" without having much detail about feeling so-so about [0] and [99].

Histogram top and bottom end truncation
Zeroizing values < 0.2 and setting a max to 0.5. The need for a max comes up in a minute.

If the vector is represented this way, I can choose to not store the 0 values just as long as I remember the position of the non-zero values.

macaroni [1]: -3.026 [2]: 1.209

Ignoring all of the numbers in the ellipse, I've shrunk the 'macaroni' vector by half, then added two position index values. In the worst case, this is 8 bytes saved and 4 bytes added.

Okay but can small values actually be truncated?

In my initial implementation I addressed the converse of this question by seeing if high value data indicated semantic importance. It did, in general, but gave me a lot of obscure words. So it wasn't great for choosing words to keep, but it might still be good for choosing values associated with words to ignore.

To get some idea of this, I created a histogram of embedding values for unmeaningful words like 'and' and 'was', they hovered between -0.3 and 0.3:

Embedding histogram and
'and'

Embedding histogram was
'was'

Meanwhile, 'catalytic', 'intergovernmental', and 'reluctance' showed a wider histogram:

Embedding histogram catalytic
'catalytic'

Embedding histogram intergovernmental
'intergovernmental'

Embedding histogram reluctance
'reluctance'

This was convincing enough to proceed with an experiment in zeroizing and truncating.

Words that can't be truncated easily

Using my Outer Web dataset, I listed all of the words where at least 60 of their 100 embedding values exceeded an arbitrary threshold of ±0.35. This truncation thing wouldn't work if most words have very few truncatible vector components. Thankfully, from the 41k only about 200 required more than 60 vector values:

54mm oflag monoamine triassic whitish tonumber palomar verband spay
households motile hollers eigenstates welterweight statistique strikeouts
hispanic srgb forelimbs aaaaaa pounder dormers riemann 35px ssrn
medallists polytope animalia ffff00 geosynchronous webgl markku countywide
tiltrotor dihedral atman coulomb terns 13px latino neowise longlisted
participação finned campagnes scholarpedia mirrorless civilwar isfahan
tengah targetable ffffff prelate lanarkshire median 19mm oboes annaeus
colspan solidworks megabit cellpadding iacr cornus godine parsecs 80px
savez infinitive tomatoes aland pesquisa diverses gazetteer webkit
isnotempty templated mixolydian ssse3 oficial shailesh strasberg quarteto
flycatchers electroweak livros bgcolor dimms flyweight rete honeycombs
makeup evergrande megabits ingenieros floodgap microsd inductance gruyter
florets lightgreen catkins archimedean giang hypotension norepinephrine
antarctic aldrig liên mathworld blancpain wildcards haleakala _main
fermionic subpages 46mm quarks viernes iucn condensates 23mm izquierda
infielder scalar winklevoss name1 name2 valign bookable mollusk 28px
tostring aggregator ochrony pointier 48khz decadal ferodo yalsa proposers
teknologi nikkor lgpl gluons darkgray homomorphic phylogenetics saguaros
90px aquatics significand reuptake honkaku brembo compactflash sepals
slalom volum decoction crüe females orbitofrontal escultura anos sbac
mötley rowspan waisted eigenvalue selatan volcanology battlefleet blazon
penstemon sdhc lemmon census args herpes bioavailability 800mhz serviço
clásica transclusion amstrad direito nonacademic exoplanets quizás boneh
islander nucleocapsid breasted pentax

Quantization

Quantization involves going from a floating point value to an 8-bit, 16-bit, or 32-bit whole number. With a two-byte index field that would support any reasonable vector size, it made sense to pack the index with a short holding the corresponding value.

Scaling a float to a short requires a min and max. My min was already the arbitrarily-chosen zeroize threshold. Based on an observed 2.5 maximum value, having a max in the 1.5-2.0 range seemed reasonable. So:

macaroni [0.044 -3.026 1.209 ... -0.180] # Original
macaroni [0.000 -3.026 1.209 ...  0.000] # With zeroized
macaroni [1]: -3.026 [2]: 1.209          # Truncated zeros
macaroni [1]: -2.5   [2]: 1.209          # With a max threhsold

The [1]: -2.5 then becomes a tidy 32-bit value that is something like:

 index  value
 0001   fe01

And so:

macaroni: 0001fe01 0002804a ... # Truncated and quantized

Performance

I ran the binary storage, truncation, and quantization with the Outer Web data. With a 0.35 threshold and 1.3 max value, my existing 30mb embeddings file shrank to 6.5mb. My target file size was < 50mb so this result meant I could move to larger vectors and/or a larger dictionary. Unless, of course, the compression made the results suck.

If you'll recall, my pre-compression embedding math examples were this:

king + woman: queen   monarch mistress lover  throne
king - man:   monarch prince  lion     throne warrior
queen + man:  king    woman   princess mistress monarch

Compression/truncation with a 0.35 theshold and 1.3 max changed some of the results/ordering but the semantics seem to be in the right ballpark.

king + woman: monarch queen   throne    grandchild mistress
king - man:   monarch bastard throne    grandchild woman
queen + man:  woman   king    nursemaid monarch    princesses

I tweaked the variables to be 0.25 and 1.9:

king + woman: monarch throne  queen    warrior grandchild
king - man:   warrior monarch imposter lover   curse
queen + man:  king    woman   bride    monarch princess

It's tough to make a confident determination about the goodness of the results, but they certainly weren't prohibitively bad. 'Queen + man' didn't equal 'bicycle'.
Expanding

With the memory savings from compression and truncation, I looked at increasing vector dimensions and vocabulary individually.

First I created a compressed embedding file that only excluded Outer Web stopwords, as opposed to the 41k dictionary that was based on search terms and a whitelist. With vector size 100, 0.28 threshold, and 2.1 max the 3.25gb raw file shrank to 287mb with 1.8 million words. That's good but uses a larger heap footprint than I'd like, particularly since many of the indexed words are rather obscure.

The other expansion direction was to move to a vector size of 300. Using the Outer Web whitelist, 0.28, and 2.1, the 41k words fit in a very reasonable 16mb.

This meant I could move to 300-value vectors and increase my dictionary size, so long as I could come up with a list of words between 41k and 1.8M.

A new word list

It made sense to use the Outer Web data to generate a list of keywords for the new embeddings table. First, I wasn't super successful in using the vector data to sort good from bad. Second, since the embeddings would be used to characterize the blogosphere, using the blogsphere vocabulary would minimize waste.

Since I needed a keyword list bigger than my search index, I wrote a function to step through every page in the Outer Web corpus and create a list of words that appear in at least four distinct domains. This, intersected with the embedding source list, gave me around 117k keywords and a 43.7mb compressed embedding file.

The math looks good and has some new words:

king + woman: monarch queen   kings   warrior princess
king - man:   kings   warrior monarch queen   jester
queen + man:  king    maid    monarch beggar  majesty

The expanded vocabulary was the difference between knowing what ancient Egyptian kings were called and not knowing:

#  41k words, uncompressed, 100 dimensions:
king + egypt: persia  farouk   morocco   monarch        cyprus    

#  41k words, uncompressed, 300 dimensions:
king + egypt: persia  egyptian kings     queen          morocco   

#  41k words, compressed,   100 dimensions:
king + egypt: monarch farouk   hashemite nebuchadnezzar syria     

# 117k words, compressed,   300 dimensions:
king + egypt: pharaoh kings    retenu    persia         reign     

The new set of embeddings also added 'antoinette' to my vocabulary ('queen + guillotine'). Unfortunately for my lead image, 'ocean + king' = 'atlantic', 'kings', 'queen', 'oceans', and 'tsunami'. 'Poseidon' did not appear.




Review | 2025.06.07

Horizon Forbidden West overridden thunderjaw

Having finished Persona 3 Reload I decided to jump back in to Horizon Forbidden West before moving to my substantial Steam backlog. But first, since Lego Star Wars didn't work out (due to blasterplay), I needed a new game to play with Dani. I downloaded a bunch of things from my PS+ list and started with...
Hot Wheels (Unleashed 2: Turbocharged)

Hot Wheels Unleashed 2 t-rex dinosaur track

We started with the wordy racing title Hot Wheels Unleashed 2: Turbocharged. I was hopeful HWU2 would be like Mario Kart and have robust driver assist for the little one. There was an option for it, though I'm not sure if it applied to story mode. We quickly had Daddy driving and Dani reading track notes. Still, at this age doing track selection, car selection, and unlockables is plenty of fun.

Hot Wheels Unleashed 2 cinematic tentacles Hot Wheels Unleashed 2 Audi Sport Quattro Hot Wheels Unleashed 2 Car-de-Asada
Hot Wheels Unleashed 2 dinosaur museum Hot Wheels Unleashed 2 octopus boss Audi

It's an enjoyable arcade racer - I wouldn't play it on my own but it's a good shared experience. HWU2's story mode has normal races, time trials, last car standing, and boss fights (drive fast, get powerups). You can unlock and upgrade cars from the Hot Wheels lineup that have different handling characteristics. Dani's favorite triceratops car turns like a brick while the Audi Quattro glides through corners but only vibes with Daddy's aesthetic.
Sackboy

Sackboy villain Vex

The next game we tried is the one we're going with. Sackboy: A Big Adventure is a 3D platformer descendent of the Mario and Donkey Kong games. Ordinarily, I shy away from these games because they feel bland, yet invariably I enjoy the heck out of them. Sackboy starts with some very safe areas that Dani had no trouble with. Quickly, however, the levels offered enemies, cliffs, disappearing platforms, etc. It's easy enough for Daddy which, I think, is important to keeping the little one's attention but until Dani has better controller authority she will have to stick to friendlier areas and overworld navigation.

Sackboy dialogue plot Sackboy Zom Zom shop mountaineer Sackboy platform mountain level Sackboy train

Sackboy has a ton of unlockable cosmetics that Dani gets a kick out of. There are also a variety of level types like a single moving platform, sliding, and the catchy music levels.
Somewhere a great rune has broken

Elden Ring SOTE sunset view

Me and J are still working through Shadow of the Erdtree.

Elden Ring SOTE Manus Metyr cathedral
We were disappointed to see that the Manus Metyr area is solo-only. Perhaps that's just because it's a frantic gallop followed by a talky cathedral and From wanted to save us from summoning for that.

Elden Ring SOTE dragon communion altar priestess

Elden Ring SOTE Bayle the Dread
You can't really make it to the top of the Jagged Peak without popping in to Bayle's arena to get wrecked.

Like with the main game, my primary strategy has been to wander through the map and keep track of bosses I (we) had to skip. This isn't bad, but it results in some neglected NPC quests that suddenly became relevant when you hit a difficult boss. Messmer, Putrescent Knight, Bayle - they allow ally summons if you're in that ally's good graces. So I worked backward from those battles to find out what I had to do to bring some meat shields to the encounter. And since dead ending quests is pretty easy to do in Elden Ring, I represented the research as a dependency graph:

Elden Ring Shadow of the Erdtree to-do list dependency graph
My to-do list that excludes the endgame. The starting point for this graph is having broken the great rune and killed the Dancing Lion and Golden Hippopotamus.

I've avoided looking directly at endgame spoilers but it seems that every questgiver comes back later in a big fight or final gauntlet ͥ . So that's a pretty good reason to help them collect 100 rocks or fight a megabear or whatever. Not Moore though, I guess I killed the wrong bug and so he invaded me and got thwacked.

Elden Ring SOTE Bonny Village jars Elden Ring SOTE crossroads view
Elden Ring SOTE dragon communion altar Elden Ring SOTE jagged peak drake Elden Ring SOTE furnace golem cerulean coast
Elden Ring SOTE jagged peak vew Elden Ring SOTE Jolan dialogue Elden Ring SOTE view moon Elden Ring SOTE horn finger ruins Rhia
The Forbidden West

Horizon Forbidden West stormbird canyon

I put HFW down a few years back because it's a substantial game and I was needed elsewhere. Happily, I've loaded it up and am once again exploring postapocalpytic California and a few inconsequential states to its east. Let's get my meme question out of the way:

Is Horizon Forbidden West good for a four-year-old?

Horizon Forbidden West Yosemite Valley Half Dome

Despite the game's rating, about 15% of HFW is totally okay for a kid. Most of HFW is shooting arrows at people and dinosaurs with glowing red eyes, but it's quite easy to explore the vast wilderness while avoiding combat. Part of this owes to the invisible geofences in which each dinosaur resides. We even got to see the virtual version our last Thanksgiving trip. Also entertaining to a kid: climbing, collecting crafting resources, and limited amounts of menus/settlements.

Difficulty and progression

Horizon Forbidden West sunwing stealth

I think I'm playing on hard. I'm not sure, I chose the setting three years ago. Still, when I fired the game up I was surprised by how long it took to take down even midsize enemies. Possibly related - I seem way overleveled for my point in the story. I haven't been grinding to get ahead of the difficulty curve, rather it seems as if side activities and normal machine fights provide a steady stream of xp that makes you overleveled (on paper).

Equipment and upgrades

Horizon Forbidden West upgrade job list Utaru Winterweave

Gear upgrades are not cheap, you need plenty of money and some uncommon machine parts. So it made sense to bypass the purps ͥ and aim for legendary gear that would carry through to the final fight. Turns out, these aren't easy to get either. Some weapons are rewards for finishing the collectibles/challenges scattered throughout the map (rebel outposts, ruins, races, etc.). A few are available from vendors. Most are acquired by completing arena challenges. So while I snagged the dreadwing parts needed for the legendary infiltrator armor and killed all the rebel outpost leaders, I'm not sure I have the skill to take down a tideripper in two minutes.

Horizon Forbidden West stormbird acid arrow

With the Utaru Winterweave outfit at the top of my list, I had to fight the stormbird and dreadwing in an unmarked mountain lake a few times. It wasn't as farmy of an experience as Borderlands 2 or some of the clickfest games, but the climb made it a bit laborious. Pretty though.

thumbnail Horizon Forbidden West unmarked dreadwing stormbird site thumbnail Horizon Forbidden West stormbird thumbnail Horizon Forbidden West stormbird moon backlight thumbnail Horizon Forbidden West stormbird unmarked site
thumbnail Horizon Forbidden West stormbird shock charge thumbnail Horizon Forbidden West stormbird night glide

Pet velociraptor

Horizon Forbidden West clawstrider target dummy

Being able to reprogram a herding dino in both Horizon games was an awesome spin on the common RPG mechanic of having a horse or whatever. Fast travel makes them only useful in unexpored areas and even then they're of limited use because they're no good offroad. But the mount mechanic takes a turn for the awesome when you unlock the ability to reprogram clawstriders (robo-raptors).

Horizon Forbidden West clawstrider mount on bridge Horizon Forbidden West clawstrider mount desert
Horizon Forbidden West clawstrider mount Horizon Forbidden West clawstrider mount Horizon Forbidden West desert tremortusk

Clawstrider mounts are a bit more nimble than chargers and bristlebacks but their awesomeness comes not from being rideable. Set in aggressive mode, Aloy's clawstrider buddy is like a stealthy, trusty hunting dog that punches way above its weight in a fight.

Horizon Forbidden West clawstrider fighting dreadwing
Raptor bae couldn't solo a dreadwing but he drew a lot of aggro and survived the fight.

They did a what?

Horizon Forbidden West water level puzzle

Yep. A water temple. It's not as bad as Lake Hylia but that's no excuse.

Cauldrons

Horizon Forbidden West cauldron slitherfang battle

I finished the cauldrons but have yet to collect the resources needed to override everything. I hope there's a spot I can turn an overridden thunderjaw loose on something else inside its geofence.

Horizon Forbidden West cauldron apex tideripper Horizon Forbidden West cauldron hanging from machinery Horizon Forbidden West cauldron assembly machine Horizon Forbidden West cauldron apex tideripper

Gallery

Horizon Forbidden West cannon settlement defense Horizon Forbidden West hunters fighting charger Horizon Forbidden West clawstrider shock arrow Horizon Forbidden West clawstrider mount stream
Horizon Forbidden West clawstriders Horizon Forbidden West bristleback mount night desert Horizon Forbidden West diving plane wing
Horizon Forbidden West rebel outpost focus target tracking Horizon Forbidden West settlement defense cannon Horizon Forbidden West mountaintop view Horizon Forbidden West shieldwing glide
Horizon Forbidden West mountaintop view Sierra Nevada mountains Horizon Forbidden West cauldron Horizon Forbidden West cave diving burrower
Horizon Forbidden West swimming in current Horizon Forbidden West tallneck assembler cauldron Horizon Forbidden West overridden thunderjaw disc thrower Horizon Forbidden West thunderjaw sneaking sunset
Horizon Forbidden West tideripper beach combat Horizon Forbidden West tideripper Horizon Forbidden West tideripper focus view
Horizon Forbidden West tremortusk night Horizon Forbidden West capitol hologram Horizon Forbidden West settlement warrior Horizon Forbidden West marshal recruit warriors


THAT SENTENCE HAD TOO MANY SYLLABLES! APOLOGIZE!

From Mr. Torgue in Borderlands 2 (pic unrelated)





<-- May 2025