Infopost | 2023.03.04

Hacker toddler

Topics:
Deindexed

I'm an indieweb wannabe and that means three things: I'm all about good content, I hate SEO, and I secretly need visibility and validation. Well, that doesn't exactly describe my sentiment either, but it's nice to see my top search result for the last month was this Stable Diffusion setup issue:

importerror: cannot import name 'vectorquantizer2' from 'taming.modules.
vqvae.quantize'

Github bug threads aren't the easiest of reading so it made sense to summarize the fix for myself and anyone else trying to do the same thing. So it was frustrating when a bunch of people searched for it, Google said, "I have an answer" and proceeded to link them to my blog landing page rather than the specific post.

Griftlands lick the goo party

The #2 and #3 slots had to do with deckbuilders:

griftlands lick the goo
across the obelisk sylvie build

It's never not a rabbit hole

While scrutinizing my Search Console sitch I also found this:

Google search console unindexed
Unindexed pages in gray, indexed in green.

Other than tag culls and refactors, my site steadily increases in size. But in December my footprint on the Googles took a sharp downturn.

What the heck is "Duplicate without user-selected canonical"?

Google search console duplicate user canonical

Answer, tldr: web sites duplicate a lot of content, so the internet longbeards decided site developers should choose one golden url for each distinct chunk of data. That way, web crawlers can ignore the noncanonical stuff. Here's the tag:

link rel="canonical" href="https://www.chrisritchie.org/kilroy/archive/
2023/03/highlights.html"

My initial implementation just did "https://chrisritchie..." without the "www.". That caused some problems, including Google deciding the www. one was canonical but not indexing it. And in the realm of navigating the Search Console UI, this helped:

The "Duplicate, Google chose different canonical than user" status doesn't show which page Google chose instead. But you can find out which page Google decided to index using the URL Inspection tool.

Bringing my page headers up to 2011 standards

While I was on that faq, I added a few more meta tags such that my header looks like this:

meta charset="UTF-8"
meta name="description" content="Three short clips from recent PUBG
matches."
meta name="keywords" content="lolbaters, battlegrounds, video, fps battle
royale, battle royale, pubg, video games"
link rel="canonical" href="https://www.chrisritchie.org/kilroy/archive/
2023/03/highlights.html"
meta name="viewport" content="width=1030, initial-scale=1.0"
link rel="stylesheet" type="text/css" href="../../../style.css"
link rel="shortcut icon" href="../../../res/favicon.ico"style type="text/
css" media="screen"/style
One more:

Google search console mobile usability viewport
The beautiful thing about variables is that they can have different values rather than just one acceptable assignment. Yeah, mobile sites are trash and need to die, so I'm leaving this at exactly what my page was designed for.

Did it work?

Google search console validation failure

The neat thing is, you can do a batch re-run on the errors Google finds. The not-so-neat thing is that Search Console just abandons the effort if it finds a few pages with issues. Not pouring resources into validating something that wasn't fixed makes sense, except when the chunk of urls to revalidate includes a bunch of moved or removed content. I assume there's a way to blacklist that stuff, but it'd be nicer if it just skipped the 404s and moved them to a different error category (where the lone "Validate Fix" button would be accompanied by a trash button). Considering how much churn web sites have, I'm sure there's a way to do it I just haven't looked too hard.

Google search console indexed pages

I guess in December Google changed something and began rejecting my pages left and right. Since then it's been a slow climb back.

Slow climb

Google search console discovered not indexed

Around the time that I started labeling my canonical pages, Search Console showed a big spike in Discovered/Unindexed pages for the canonical urls. I guess the canonical pages weren't reindexed in the validation phase, they're just sitting in the deep backlog.
SEO

Gagachicken Gagachicken Two SEO experts walk into a bar, bars, nightclub, pubs, tavern, beer, alcohol, drinks, alcoholic beverages, bars in my area, places to drink.

I hadn't seen the meta head tags before, though I certainly never looked. I also don't even try checking out the html for professional/SEOed web sites, they're not meant to be read. But since I was already looking at Search Console and easy-win meta tags, I popped over to /r/seo to see what people do.

Homer Simpson bush meme

The sub felt like one of the many also-ran meme stock forums (AMC, Blackberry, Hertz, Nokia, Clover). Everyone was focused on scraping out a few more clicks to make those monetization tendies. The posts and top comments had Secrets to Business Success conference vibes - "my company turns nobodies into tycoons, here's a taste, PM me to find out more!" This guy was actually pretty alright in the comments but his title gets the point across:

malchik23 malchik23 SEO is easy. The EXACT process we use to scale our clients' SEO from 0 to 200k monthly traffic and beyond

Despite the sub being business-oriented, there were a few interesting technical tidbits to someone familiar with the technology but not the standards.

Futurama future stock 80s business guy

Posts on /r/seo had the basics like "have a sitemap and robots file". Then there was "don't have 30MB images and use that neat new webp format". The more involved site audit techniques advised scrutinizing words and phrasings to closely align with targets searches. Yuck.

And finally, linkbacks. The /r/seo contributors talked about basically having an 80s business guy rolodex of sites that'll link to their content without getting both blacklisted. And that's part of why Google search sucks.

RobRob The web was a mistake let's go back to the internet.

Eternal September and such.
Development

Back burner programming projects meme

Since internetting is a hobby and I'm not a web developer, I've slowly adopted normal web site stuff. For example:

I did a favicon

Crying of Lot 49 WASTE symbol Pynchon

Landing pages/index pages

https://chrisritchie.org/posts.html -> 
   https://chrisritchie.org/kilroy/all/index.html

The post index page shows title, description, and tags form all 600ish posts here. Since pure text walls aren't fun, the page is seasoned with a sampling of lead images that changes every time I generate the page.

https://chrisritchie.org/kilroy/tags/index.html

I brought tags over from Blogger, but never properly indexed or aggregated them. It was easy enough to give /tags an index.html showing a randomized dump of all the tags.

https://chrisritchie.org/games.html -> 
   https://chrisritchie.org/kilroy/games/index.html

In December I talked about moving video games from being representated as tags to being its own data type. Each game has a page and there's now an index page with all of them. The game title still appears in the tag section of each post, but the link takes you to the game page.
A rabbit hole of graphics and performance and primitives


My graphics editor needed a second design pass. Even on my l33t gamer monitor, a lot of downscaling is required to show four images on one screen with associated toolbars. What's more, I wanted to take a second hack at showing and correcting image histograms (see below).

My graphics code isn't very fast. It doesn't need to be and it's written in Java so there's always going to be a generous performance ceiling. But for the histogram correction operation, I was seeing fifteen second run times to touch every pixel in a modest-size image (like 800x600).

Lessons previously learned

A long while ago I had the idea that I could make a more OO BufferedImage by creating a 2d array of RGBPixel objects. VoluminousUtilityClass.rgb2HSV(int argb) is nice, but pixel.hsv() is nicer. Alas there was an insurmountable performance hit from having millions of objects in lieu of millions of primitives. I ripped out the RGBPixel class from my BufferedImage 2.0, but kept it as a convenience type. Sometimes argb ints become RGBPixels for a series of operations before returning to their primitive roots.

Side note: I never liked the RGBPixel name (with potential peers HSVPixel, LuminancePixel, CIEPixel, etc.). It's effectively Color, but originally needed the semantics of being a datapoint in an image. The name stuck since I couldn't find anything better.

Signed bytes

Playstation PS4 Armed Hacker 4K theme cyber anonymous

So this histogram operation exposed performance issues in my graphics library and I intended to find them. With the RGBPixel lesson in mind, I immediately looked at my Byte object.

"Byte object? Whaaa?" Java uses signed bytes. It's not inherently terrible, but the lack of an unsigned alternative creates problems for graphics code that might, say, want to represent the value 200. You also can't "typedef short byte", though doing so would mean all of the client code should check bounds.

Being an OO/embedded/IDA guy, I created a proper Byte class. 0-255, no need to check if your int value is 256 because the type does that. A lot of the math that you have to do with A+R+G+B filed neatly under the Byte type, e.g.

public static Byte average(Collection<Byte> values).

Surely the use of Byte in many of my graphics operations (after they are unpacked from the argb int, for example) was the performance issue.

I should have never learned

Sisyphus AMD cpu

I ripped out a lot of code, replacing all zillion Byte instances with short and the agonizing range checking code around it. But nothing got noticeably faster. Here was the code:

canvas.parallelStream().forEach(
   c -> adjustIntensity(canvas, 
                        c, 
                        black_threshold, 
                        histogram.getMedian().getMin(), 
                        white_threshold));

It, in parallel, iterates over every pixel in the image and bumps the brightness based on the histogram bounds and median. Swapping the Bytes out for shorts didn't make any difference, what the heck?

histogram.getMedian().getMin(), 

This was re-computing the image histogram in each call. My brain quietly expected the getMedian() return value to get put in a register and sent to each adjustIntensity() call since the getMedian() call was very thread-unsafe.

parallelStream() and exceptions

Java's parallelStream() is a really convenient way to do distributed processing since you can call it on any Collection but it doesn't natively handle exceptions. The kinds of things you use parallel processing for are often complex enough to run into exceptional situations. Likewise, the kinds of things you use parallel processing for are often numerous enough that you don't want each thread to report the same issue n times.

Heretofore I was dumb and lazy and implemented a bunch of "...Unchecked()" sister functions that caught the exception and, say, spit out a bright green pixel so I woudn't miss that there was an issue. It still meant exceptions disappeared, and the code became really cluttered. But it's easy to add another no-throw function and it's hard to do things the right way.

It really isn't hard to do things the right way. I saw some suggestions on StackOverflow, but decided to do this:

/**
 * Provides an interface into parallel streams with exception handling.
 * @param <T>
 * 
 * Usage:
 *    Parallelizable<Type> lambda = t -> operation(t);
 *    Parallelizable.parallelize(collection_of_type, lambda);
 * 
 */
public interface Parallelizable<T> {
   void run(T value) throws Exception;
   
   /**
    * Calls a parallel stream on the given collection.  On throw, 
    * rethrows just the last exception.
    * exception.
    * @param <T>
    * @param collection
    * @param lambda
    * @throws Exception
    * 
    * Usage:
    *    Parallelizable<Type> lambda = t -> operation(t);
    *    Parallelizable.parallelize(collection_of_type, lambda);
    * 
    */
   public static <T> void parallelize(final Collection<T> collection, 
                                      final Parallelizable<T> lambda) 
                                      throws Exception {
      // The list is just a hack to have a final variable that I can
      modify.
      final List<Exception> list = new ArrayList<Exception>(1);
      collection.parallelStream().forEach(t -> 
      {
         try {
            lambda.run(t);   
         }
         catch (Exception e) {
            list.clear();
            list.add(e);
         }
         });
         if (list.size() > 0) {
            System.err.println("One or more of the parallel operations 
                                failed, rethrowing last exception.");
            throw list.get(0);
         }
      }
}

There are probably even more generic ways to do this but all of my use cases so far are on Lists or Sets. Why disappoint Future Chris by doing it 100% correct this time? This prints the last exception (and could be modified to aggregate a handful) and replaces the parallelStream().forEach() lines with two more succinct operations.
One final Google complaint

Youtube 1080 is now an advanced setting

At what point did 1080 become an advanced setting?



Related - internal

Some posts from this site with similar content.

Post
2023.09.27

Enwebbed

Parsing RSS feeds to find peers.
Post
2023.12.30

Feature complete

My static site generator can now recommend external blog/smallweb posts with similar subject matter.
Post
2023.11.20

Subsurface

Links pages, webrings, and search.

Related - external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

blog.slaks.net

Writing the endraw tag in Jekyll code blocks - SLaks.Blog

404ed
aleksandrhovhannisyan.com

Getting Started with Jekyll and GitHub Pages | Aleksandr Hovhannisyan

Jekyll is a static site generator that makes it easy for you to create a website. Learn how to get started with Jekyll and GitHub Pages.
codeyarns.com

Code Yarns Ashwin Nanjappas Tech Blog


Created 2024.04 from an index of 185,804 pages.