Continuing the
small/indie/old-web saga, I wrote
some code in the vague direction of my feature request.
Important: this post contains some disgusting xml that I found while processing rss feeds. It's intended for humor only and not meant to be read.
RSS
To review, rss is an xml file with formatted post information. E.g. my latest entry looks like this:
<item>
<title>Jets</title>
<link>https://www.chrisritchie.org/kilroy/archive/2023/09/jets.
html</link>
<description>Some telephoto shots of the Blue Angels doing neat tricks.
</description>
<category>Gallerypost</category>
<pubDate>Sun, 24 Sep 2023 00:00:00 -0700</pubDate>
<guid>https://www.chrisritchie.org/kilroy/archive/2023/09/jets.
html</guid>
</item>
Since
rss is succinct, well-defined, and covers the content from many pages, it's a great starting point for an indieweb explorer. I decided to start with rss 2.0 and ignore 1.0 and the 'feed' format for the moment.
Things started okay:
<channel>
<title>The Zozzled Cocktail</title>
<atom:link href="https://zozzledcocktail.wordpress.com/feed/" rel=
"self" type="application/rss+xml" />
<link>https://zozzledcocktail.wordpress.com</link>
<description>A drinker's guide to historic
cocktailing</description>
<lastBuildDate>Sun, 25 Jun 2023 18:46:53 +0000</lastBuildDate>
<language>en</language>
<sy:updatePeriod>
hourly </sy:updatePeriod>
<sy:updateFrequency>
1 </sy:updateFrequency>
<generator>http://wordpress.com/</generator>
Uh oh, "generator = wordpress". Considering how unreadable generated html is, this doesn't bode well for rss. On the other hand, how ugly can you make a simple xml format?
<cloud domain='zozzledcocktail.wordpress.com' port='80' path='/?rsscloud=
notify' registerProcedure='' protocol='http-post' />
<atom:link rel="search" type="application/opensearchdescription+xml" href=
"https://zozzledcocktail.wordpress.com/osd.xml" title="The Zozzled
Cocktail" />
<atom:link rel='hub' href='https://zozzledcocktail.wordpress.com/?
pushpress=hub'/>
<item>
<title>Mojito</title>
<link>https://zozzledcocktail.wordpress.com/2023/06/25/mojito/</link>
<comments>https://zozzledcocktail.wordpress.com/2023/06/25/mojito/
#respond</comments>
<dc:creator><![CDATA[Robert H.]]></dc:creator>
<pubDate>Sun, 25 Jun 2023 18:46:53 +0000</pubDate>
<category><![CDATA[Uncategorized]]></category>
<category><![CDATA[lime juice]]></category>
<category><![CDATA[mint]]></category>
<guid isPermaLink="false">http://zozzledcocktail.wordpress.com/?p=
9352</guid>
<description><![CDATA[The venerable Mojito. We’ve all had one. Most
are horrible; made with cheap ingredients and less than thoughtful
preparation. It’s a deceptively simple concoction and thus, like the
Martini, requires quality ingredients and… <a class="read-more" href=
"https://zozzledcocktail.wordpress.com/2023/06/25/mojito/">Continue
reading <span class="meta-nav">→</span></a>]]></description>
<content:encoded><![CDATA[
Oh....... kay. Well luckily I just need link, title, and description for the moment and Jsoup can clean things up:
Before
|
After
|
<![CDATA[The Maslow CNC router is popular because it is large, <b>open-source and cheap</b>. It is uniquely well-suited in the CNC space for making furniture on a b...
|
The Maslow CNC router is popular because it is large, open-source and cheap. It is uniquely well-suited in the CNC space for making furniture on a budget. Th...
|
Finding what I don't want to find
Web languages really aren't my thing, so I tried a
quick and dirty scoring on each of the posts in a feed to avoid that stuff.
double score = info.getSimilarity("css", "javascript", "ruby", "seo",
"nodejs");
Sure enough, it pointed me away from web languages and some inauthentic-sounding posts:
- "Invidious is an alternative front-end to YouTube. It greatly diminishes the amount of JavaScript required to watch content."
- "Nitter is an alternative front-end to Twitter, which uses no JavaScript at all to render the posts and comments. It also supports RSS feeds for user profiles."
- "Omgur is a JavaScript free alternative front-end to Imgur. This project does not include a “front page”, only pages which show actual uploaded content are implemented."
- "Teddit as an alternative front-end to Reddit, without the need for any JavaScript to operate."
My web neighbors
The goal of the exercise was Tinder for posts, not web sites, but I'm going to need more than the rss description for that. So I tried another experiment (not knowing if any of the feeds would have any overlap).
- I put all 2500ish tags from this site into a set.
- I put tokenized title and description data from an rss feed into another set.
- I took the intersection and normalized that size by the number of tokens in the rss feed.
Simply, I asked the rss feed how much of it overlapped topics from this site. A subset of the results:
Latest post
|
Match pct
|
Matches
|
spritesmods
|
12%
|
doom | run | sonic | tree | christmas | pinball | zombies | table
|
skinnylatte
|
7%
|
bicycle | shooting | raw | union | sunset | information | post | shelf | ride | live | coffee | college | camping | oakland | 50mm | boat | thanksgiving | law | christmas | lake | led | cheese | skills | java | night | photography | housing | kitchen | camp | wave | light | organization | travel | history | square | trip | beer | grass | food | easter | train | parts | party | dog | weekend | summer | camera | singapore | photos | equipment | film | switch | nikon | about | band | fish | california | ocean | learning | stocks | work | books | writing | event | culture | car
|
theoverspill
|
10%
|
mgm | ai | google | chatgpt | email | watch | zero | view | post | france | privacy | apple | digital | reading | ski | twitter | work | books | light | phone | library | camera | pirates | chatbot | tennis | nissan | bots | insulation | amazon | switch | offsets | flickr
|
marginalia
|
5%
|
investment | police | information | blogs | list | live | hats | block | intel | law | design | discussion | night | windows | twitter | options | light | computer | lexicon | history | meat | record | solo | simulation | messages | dragons | austin | topics | film | tactics | structure | about | fire | internet | preparation | work | books | writing | nice | apocalypse | space | html | quote | google | computers | pandemic | ethics | post | fails | movie | ikea | algorithms | markup | tool | gear | construction | ship | skills | english | art | run | engineering | camp | crop | book | news | race | discourse | study | modern | web | organization | bots | market | ai | drive | coding | search | seo | jobs | food | bed | startup | france | programming | parts | party | plot | digital | bathroom | table | break | pants | office | posts | expansion | code | dive | pc | paint | reddit | github | email | zero | learning | reading | technology | vaccine | software | flag | feature | site | magic | owl | said | arguments
|
caleb
|
12%
|
about | proposal | google | bbq | programming | maps | list | book | ride | society | motogp | animals | dog | austin | writing | tire | hiking | pipeline | posts | photos | track | park | wheel | cota
|
synesthesia
|
4%
|
investment | sunset | information | blogs | neural | hats | block | christmas | flickr | stats | watch | disaster | gallery | pipeline | island | water | toyota | training | government | stream | usa | logs | band | 17 | 18 | allies | 40 | wood | maps | apple | lost | work | europe | opinion | portfolio | html | clouds | covid-19 | computers | spring | canada | fails | horizon | driving | careers | strategy | skills | waterfall | microsoft | features | memorial | news | race | war | fan | chocolate | study | web | statistics | politics | ai | ar | virus | adobe | food | cv | interview | dc | startup | ea | ge | digital | diving | portal | photos | legacy | storage | dive | pc | seat | commentary | email | tv | yahoo | preview | reading | tools | podcasts | sound | translate | security | bugs | sets | gpt | alienation | retrospective | focus | wiring | covid | discussion | scanning | voting | video | wave | indieweb | tennis | beam | bear | history | square | meme | action | barbecue | topics | workflow | beta | puzzle | quotes | quoted | meta | spike | memes | nice | excel | culture | ubuntu | replacement | repair | advice | linkedin | pandemic | post | shelf | united | college | burnout | construction | debate | blogger | bender | bonds | captain | drive | jobs | programming...
|
jacquesmattheij
|
4%
|
investment | hosting | characters | zombie | police | information | society | hats | gaming | law | floor | maintenance | led | mountain | night | watch | disaster | twitter | primary | computer | door | airbnb | water | electronics | record | shoes | articles | division | electrical | dog | trading | government | stream | map | julian | lists | backpack | about | crash | stick | 17 | 18 | wires | 40 | wood | maps | internet | apple | movies | lost | work | books | writing | ear | europe | facebook | opinion | portfolio | html | clouds | covid-19 | egg | music | google | computers | mechanics | ethics | spring | http | ride | fails | aliens | driven | tool | driving | careers | strategy | skills | java | microsoft | research | features | furniture | engineering | news | election | race | war | risk | taxes | fan | study | modern | web | nvidia | board | survival | market | politics | electric | ukraine | virus | coding | search | food | stairs | startup | image | elephant | plot | digital | library | photos | amazon | code | legacy | storage | rescue | pc | github | email | tv | support | privacy | learning | spreadsheets | reading | software...
|
andreacorinti
|
3%
|
valheim | corona | forza | acid | post | fantasy | live | russia | covid | gear | gazebo | futurama | lightning | chatgpt | bolt | playstation | watchmen | news | twitter | fan | web | island | park | nostalgia | prime | ai | virus | deck | cyberpunk | chrome | solo | 2020 | meme | action | mario | summer | casino | seattle | film | netflix | monaco | troubleshooting | tv | internet | dragon | steam
|
A few omitted datapoints and takeaways:
- Since I normalized by the rss feed tokens, I got some artificially high match rates with very few nominal matches.
- Similarly I had low hit rates with verbose descriptions.
- I found peer sites with similar content/format; talking Netflix and Playstation, posting excerpts of news items and discussing, even a motorcyclist software engineer who plays soccer and got box passes to MotoGP CotA. It's not surprising that these sites exist, but I haven't seen them from the blogroulette services.
- My list of site tags is pretty generic and so there were a lot of generic matches.
Moment of zen
From someone's feed:
<figure class="highlight"><pre><code class="
language-c" data-lang="c"><span class="n"
>interface</span> <span class="n">
IGroupBoostManager</span> <span class="o">:</
span> <span class="n">IDispatch</span> <span
class="p">{</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000001</span><span class="p"
>),</span> <span class="n">propget</span>
<span class="p">,</span> <span class="n"
>helpstring</span><span class="p">(</span>
<span class="s">"Gets the Property of the GroupBoost
Enabled State - see enum GroupBoostState"</span><span class=
"p">)]</span>
<span class="n">HRESULT</span> <span class="
n">GroupBoostEnabledState</span><span class="p"
>([</span><span class="n">out</span><
span class="p">,</span> <span class="n">
retval</span><span class="p">]</span> <span
class="kt">long</span><span class="o">
*</span> <span class="n">pVal</span><span
class="p">);</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000001</span><span class="p"
>),</span> <span class="n">propput</span>
<span class="p">,</span> <span class="n"
>helpstring</span><span class="p">(</span>
<span class="s">"Gets the Property of the GroupBoost
Enabled State - see enum GroupBoostState"</span><span class=
"p">)]</span>
<span class="n">HRESULT</span> <span class="
n">GroupBoostEnabledState</span><span class="p"
>([</span><span class="n">in</span><span
class="p">]</span> <span class="kt">
long</span> <span class="n">pVal</span><
span class="p">);</span>
<span class="p">[</span><span class="n"
>id</span><span class="p">(</span><span
class="mh">0x00000002</span><span class="p"
>),</span> <span class="n">helpstring</span>
<span class="p">(</span><span class="s"
>"Clears all User Generated Boosted Groups"</span><
span class="p">)]</span>
Some posts from this site with similar content.
(and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see
.
Comments