Infopost | 2023.09.27

Spiderman pointing meme seven spideys

Continuing the small/indie/old-web saga, I wrote some code in the vague direction of my feature request.

Important: this post contains some disgusting xml that I found while processing rss feeds. It's intended for humor only and not meant to be read.
RSS

To review, rss is an xml file with formatted post information. E.g. my latest entry looks like this:

<item>
   <title>Jets</title>
   <link>https://www.chrisritchie.org/kilroy/archive/2023/09/jets.
   html</link>
   <description>Some telephoto shots of the Blue Angels doing neat tricks.
   </description>
   <category>Gallerypost</category>
   <pubDate>Sun, 24 Sep 2023 00:00:00 -0700</pubDate>
   <guid>https://www.chrisritchie.org/kilroy/archive/2023/09/jets.
   html</guid>
</item>

Since rss is succinct, well-defined, and covers the content from many pages, it's a great starting point for an indieweb explorer. I decided to start with rss 2.0 and ignore 1.0 and the 'feed' format for the moment.

Things started okay:

<channel>
   <title>The Zozzled Cocktail</title>
   <atom:link href="https://zozzledcocktail.wordpress.com/feed/" rel=
   "self" type="application/rss+xml" />
   <link>https://zozzledcocktail.wordpress.com</link>
   <description>A drinker's guide to historic
   cocktailing</description>
   <lastBuildDate>Sun, 25 Jun 2023 18:46:53 +0000</lastBuildDate>
   <language>en</language>
   <sy:updatePeriod>
   hourly   </sy:updatePeriod>
   <sy:updateFrequency>
   1   </sy:updateFrequency>
   <generator>http://wordpress.com/</generator>

Uh oh, "generator = wordpress". Considering how unreadable generated html is, this doesn't bode well for rss. On the other hand, how ugly can you make a simple xml format?

<cloud domain='zozzledcocktail.wordpress.com' port='80' path='/?rsscloud=
notify' registerProcedure='' protocol='http-post' />
<atom:link rel="search" type="application/opensearchdescription+xml" href=
"https://zozzledcocktail.wordpress.com/osd.xml" title="The Zozzled
Cocktail" />
<atom:link rel='hub' href='https://zozzledcocktail.wordpress.com/?
pushpress=hub'/>
<item>
<title>Mojito</title>
<link>https://zozzledcocktail.wordpress.com/2023/06/25/mojito/</link>
<comments>https://zozzledcocktail.wordpress.com/2023/06/25/mojito/
#respond</comments>
<dc:creator><![CDATA[Robert H.]]></dc:creator>
<pubDate>Sun, 25 Jun 2023 18:46:53 +0000</pubDate>
<category><![CDATA[Uncategorized]]></category>
<category><![CDATA[lime juice]]></category>
<category><![CDATA[mint]]></category>
<guid isPermaLink="false">http://zozzledcocktail.wordpress.com/?p=
9352</guid>
<description><![CDATA[The venerable Mojito. We’ve all had one. Most
are horrible; made with cheap ingredients and less than thoughtful
preparation. It’s a deceptively simple concoction and thus, like the
Martini, requires quality ingredients and… <a class="read-more" href=
"https://zozzledcocktail.wordpress.com/2023/06/25/mojito/">Continue
reading <span class="meta-nav">→</span></a>]]></description>
<content:encoded><![CDATA[

Oh....... kay. Well luckily I just need link, title, and description for the moment and Jsoup can clean things up:

Before After
<![CDATA[The Maslow CNC router is popular because it is large, <b>open-source and cheap</b>. It is uniquely well-suited in the CNC space for making furniture on a b... The Maslow CNC router is popular because it is large, open-source and cheap. It is uniquely well-suited in the CNC space for making furniture on a budget. Th...
Finding what I don't want to find

Web page feed publishing

Web languages really aren't my thing, so I tried a quick and dirty scoring on each of the posts in a feed to avoid that stuff.

double score = info.getSimilarity("css", "javascript", "ruby", "seo",
"nodejs");

Sure enough, it pointed me away from web languages and some inauthentic-sounding posts:
My web neighbors

Marginalia crawling web post blog

The goal of the exercise was Tinder for posts, not web sites, but I'm going to need more than the rss description for that. So I tried another experiment (not knowing if any of the feeds would have any overlap).
  1. I put all 2500ish tags from this site into a set.
  2. I put tokenized title and description data from an rss feed into another set.
  3. I took the intersection and normalized that size by the number of tokens in the rss feed.
Simply, I asked the rss feed how much of it overlapped topics from this site. A subset of the results:

Latest post Match pct Matches
spritesmods 12% doom | run | sonic | tree | christmas | pinball | zombies | table
skinnylatte 7% bicycle | shooting | raw | union | sunset | information | post | shelf | ride | live | coffee | college | camping | oakland | 50mm | boat | thanksgiving | law | christmas | lake | led | cheese | skills | java | night | photography | housing | kitchen | camp | wave | light | organization | travel | history | square | trip | beer | grass | food | easter | train | parts | party | dog | weekend | summer | camera | singapore | photos | equipment | film | switch | nikon | about | band | fish | california | ocean | learning | stocks | work | books | writing | event | culture | car
theoverspill 10% mgm | ai | google | chatgpt | email | watch | zero | view | post | france | privacy | apple | digital | reading | ski | twitter | work | books | light | phone | library | camera | pirates | chatbot | tennis | nissan | bots | insulation | amazon | switch | offsets | flickr
marginalia 5% investment | police | information | blogs | list | live | hats | block | intel | law | design | discussion | night | windows | twitter | options | light | computer | lexicon | history | meat | record | solo | simulation | messages | dragons | austin | topics | film | tactics | structure | about | fire | internet | preparation | work | books | writing | nice | apocalypse | space | html | quote | google | computers | pandemic | ethics | post | fails | movie | ikea | algorithms | markup | tool | gear | construction | ship | skills | english | art | run | engineering | camp | crop | book | news | race | discourse | study | modern | web | organization | bots | market | ai | drive | coding | search | seo | jobs | food | bed | startup | france | programming | parts | party | plot | digital | bathroom | table | break | pants | office | posts | expansion | code | dive | pc | paint | reddit | github | email | zero | learning | reading | technology | vaccine | software | flag | feature | site | magic | owl | said | arguments
caleb 12% about | proposal | google | bbq | programming | maps | list | book | ride | society | motogp | animals | dog | austin | writing | tire | hiking | pipeline | posts | photos | track | park | wheel | cota
synesthesia 4% investment | sunset | information | blogs | neural | hats | block | christmas | flickr | stats | watch | disaster | gallery | pipeline | island | water | toyota | training | government | stream | usa | logs | band | 17 | 18 | allies | 40 | wood | maps | apple | lost | work | europe | opinion | portfolio | html | clouds | covid-19 | computers | spring | canada | fails | horizon | driving | careers | strategy | skills | waterfall | microsoft | features | memorial | news | race | war | fan | chocolate | study | web | statistics | politics | ai | ar | virus | adobe | food | cv | interview | dc | startup | ea | ge | digital | diving | portal | photos | legacy | storage | dive | pc | seat | commentary | email | tv | yahoo | preview | reading | tools | podcasts | sound | translate | security | bugs | sets | gpt | alienation | retrospective | focus | wiring | covid | discussion | scanning | voting | video | wave | indieweb | tennis | beam | bear | history | square | meme | action | barbecue | topics | workflow | beta | puzzle | quotes | quoted | meta | spike | memes | nice | excel | culture | ubuntu | replacement | repair | advice | linkedin | pandemic | post | shelf | united | college | burnout | construction | debate | blogger | bender | bonds | captain | drive | jobs | programming...
jacquesmattheij 4% investment | hosting | characters | zombie | police | information | society | hats | gaming | law | floor | maintenance | led | mountain | night | watch | disaster | twitter | primary | computer | door | airbnb | water | electronics | record | shoes | articles | division | electrical | dog | trading | government | stream | map | julian | lists | backpack | about | crash | stick | 17 | 18 | wires | 40 | wood | maps | internet | apple | movies | lost | work | books | writing | ear | europe | facebook | opinion | portfolio | html | clouds | covid-19 | egg | music | google | computers | mechanics | ethics | spring | http | ride | fails | aliens | driven | tool | driving | careers | strategy | skills | java | microsoft | research | features | furniture | engineering | news | election | race | war | risk | taxes | fan | study | modern | web | nvidia | board | survival | market | politics | electric | ukraine | virus | coding | search | food | stairs | startup | image | elephant | plot | digital | library | photos | amazon | code | legacy | storage | rescue | pc | github | email | tv | support | privacy | learning | spreadsheets | reading | software...
andreacorinti 3% valheim | corona | forza | acid | post | fantasy | live | russia | covid | gear | gazebo | futurama | lightning | chatgpt | bolt | playstation | watchmen | news | twitter | fan | web | island | park | nostalgia | prime | ai | virus | deck | cyberpunk | chrome | solo | 2020 | meme | action | mario | summer | casino | seattle | film | netflix | monaco | troubleshooting | tv | internet | dragon | steam

A few omitted datapoints and takeaways:
Moment of zen

From someone's feed:

<figure class="highlight"><pre><code class="
language-c" data-lang="c"><span class="n"
>interface</span> <span class="n">
IGroupBoostManager</span> <span class="o">:</
span> <span class="n">IDispatch</span> <span
class="p">{</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000001</span><span class="p"
>),</span> <span class="n">propget</span>
<span class="p">,</span> <span class="n"
>helpstring</span><span class="p">(</span>
<span class="s">"Gets the Property of the GroupBoost
Enabled State - see enum GroupBoostState"</span><span class=
"p">)]</span>
<span class="n">HRESULT</span> <span class="
n">GroupBoostEnabledState</span><span class="p"
>([</span><span class="n">out</span><
span class="p">,</span> <span class="n">
retval</span><span class="p">]</span> <span
class="kt">long</span><span class="o">
*</span> <span class="n">pVal</span><span
class="p">);</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000001</span><span class="p"
>),</span> <span class="n">propput</span>
<span class="p">,</span> <span class="n"
>helpstring</span><span class="p">(</span>
<span class="s">"Gets the Property of the GroupBoost
Enabled State - see enum GroupBoostState"</span><span class=
"p">)]</span>
<span class="n">HRESULT</span> <span class="
n">GroupBoostEnabledState</span><span class="p"
>([</span><span class="n">in</span><span
class="p">]</span> <span class="kt">
long</span> <span class="n">pVal</span><
span class="p">);</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000002</span><span class="p"
>),</span> <span class="n">helpstring</span>
<span class="p">(</span><span class="s"
>"Clears all User Generated Boosted Groups"</span><
span class="p">)]</span>



Comments
Me

Hmmm, yes. I was expecting a HN-associated techblog list to have clean, minimalist RSS. Imagine my reaction when my parser showed me ^^^. I probably should have been more explicit.


I didn?t really understand your post, too much xml



Related - internal

Some posts from this site with similar content.

Post
2023.10.08

Granularity

Linking internally and externally using text tokens and n-grams.
Post
2023.12.30

Feature complete

My static site generator can now recommend external blog/smallweb posts with similar subject matter.
Post
2023.03.04

C0D3

Indie SEO, Google Search Console, static websites, and Java fails/parallelization.

Related - external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

404ed
blog.pragmaticengineer.com

The Pragmatic Engineer Test: 12 Questions on Engineering Culture - The Pragmatic Engineer

12 questions to get a sense of what a tech company is like to work at, based on things most job postings do not mention. I created this test to reflect healthy software engineering cultures in 2021 better. I've found the now 20-year-old the Joel test [https://www.joelonsoftware.com/
johannesbrodwall.com

How to write better code » Thinking Inside a Bigger Box

ohio.araw.xyz

The Old Computer Challenge

Researcher's log

Created 2024.04 from an index of 185,804 pages.