Infopost | 2019.06.02

Java programming coding Duke parallelism

A short story from graphics to parallelism to lambda to mapreduce to hadoop to Spliterator to ParallelStream.

As my graphics library has been growing slowly, but the building blocks are there for slightly interesting things. Solved things, but not always so customizable when bought off the shelf. Elegant designs have won out for the most part, and that's brought creeping overhead. So you get to the thumbnail algorithm that would take seconds to complete. For my baseline image, it was 3.61 seconds (we'll come back to this at the very end). That's not bad for hitting go on a blog post (which chugs on ftp so who cares about some slow graphics processing) or a background batch operation, but not great for the realtime case.

Of course in the modern era you want to farm graphics processing out to the GPU. In addition to the difficulty of connecting the CUDA layer (in Windows), there are some conceptual concerns that kept me from driving down this path. For one, there's overhead moving data for incremental graphics operations. And while GPUs are superb, there are limited gains vs an x-core high clock rate CPU when you're talking about general purpose code (with branching) on a single image. There's absolutely a use case for shipping stuff off to my still-awesomeish graphics card, but I don't think photo processing is it - with the side comment that the GPU is used by the JVM for the actual rendering.

Having some experience with distributed processing, I know that applying a pixel transform to 800x600 = 480,000 pixels is a large and very parallelizable task. For example converting RGB to luminance is the same operation, with zero state or dependencies, executed 480,000 times. It's also a prime counter example to my conclusions from the above paragraph, but let's suspend disbelief because this is about Java parallelization, not throwing my hands up at Windows/CUDA and installing Ubuntu then refactoring code into something that works with a graphics API.

The old school approach to the luminance conversion would be:

convert(int pixels[][])
   foreach(pixel)
      value = (pixel & 0x00ff0000 * 0.2126 + pixel & 0x0000ff00 * 0.7152 +
      pixel && 0x000000ff * 0.0722) / 3;
      pixel = value | value << 8 | value << 16;

Simple, sequential.

The functional programming approach would be to map the function to the data. Your lambda function is just:

convert(int pixel)
      value = (pixel & 0x00ff0000 * 0.2126 + pixel & 0x0000ff00 * 0.7152 +
      pixel && 0x000000ff * 0.0722) / 3;
      pixel = value | value << 8 | value << 16;     
...
convert->map(pixels[][]) // or however your language's syntax prefers

... and the means by which it is applied to the data is left unspecified.

The unspecified means of application left me wondering if there was a way to leverage this for parallelism. In Java. I quickly found that 1.8 had introduced the Spliterator interface. The silly portmanteau name (plus the context by which I got to it) led me to believe this might be the way to iterate over elements in an distributed way. It was even more encouraging that Collection had implemented the interface, so Lists and Sets could be natively split. The docs and StackOverflows had examples like:

Spliterator it = list.spliterator();
it.forEachRemaining(lambdaFunction);

This would behave just like an iterator with bonus of contextual properties like Lists are sequential and Sets are not. Okay, nice, but still serial processing. I was hopeful when I read
"Java example to split the elements to two groups and iterate independently." But the example was:

ArrayList<String> list = new ArrayList<>();
         
list.add("A");
list.add("B");
list.add("C");
list.add("D");
list.add("E");
list.add("F");
 
Spliterator<String> spliterator1 = list.spliterator();
Spliterator<String> spliterator2 = spliterator1.trySplit();
 
spliterator1.forEachRemaining(System.out::println);
 
System.out.println("========");
 
spliterator2.forEachRemaining(System.out::println);

Oh, independently/sequentially, not independently/concurrently. Damn. What I was learning is well-articulated here:

Spliterator itself does not provide the parallel programming behaviour ? the reader might be interested in experimenting with a Collection type holding a vastly larger number of objects and then look to implement a parallel processing solution to exploit the use of the Spliterator. A suggestion is that such a solution may incorporate the fork/join framework and/or the Stream API.

Source.

Spliterator doesn't do much unless you want to add the code to kick off threads/processes to leverage its real contribution: chopping up the work nicely and with thread safety. And so this is a solution to the luminance problem, but it's a bad one. To get parallelism, I would need to explicitly split my data into a specific number of subsets and then kick off threads for each. This is both laborious, verbose, and full of overhead. I want to parallelize, but more than that I want to map the function, that is, semantically convey that the data is processed independently.

This led to a brief digression into MapReduce - the concept of mapping a function and then performing a lassoing of results. That led to Hadoop; an Apache platform for farming out large MapReduce and similar distributed processes. Neat, but overkill for this.

Things were looking grim, but then I at last checked up on another type that had popped up in some of my queries. Stream. You know, that vague supertype that your grandpappy used for IO and de/serialization Turns out this isn't it. InputStream and OutputStream and their many descendents actually are just subclasses of Object. Stream was introduced in 1.8.

The tutorial page had this MapReduce example:

double average = roster
    .parallelStream()
    .filter(p -> p.getGender() == Person.Sex.MALE)
    .mapToInt(Person::getAge)
    .average()
    .getAsDouble();

Okay so they're pulling a parallel stream from the roster Collection and performing a bunch of steps on it. That actually looked good.
The code is straightforward and the documentation seemed to indicate it'd be a parallel effort with width determined by the platform. It was worth a try, so I decided to see if I could feed my thumbnailer code into Stream. Pardon my switching of example horse mid-stream, so to speak, but this one had that nice 3.61-second benchmark.

The baseline code just took a user-specified number of areas and did a bunch of computations on them. The area size would be inversely proportional to the number of areas, so we're talking generally the same amount of work regardless of the client code.

foreach (area)
   area.computeInterest(); // Does a lot of pixel math.

I fumbled a bit with how to back into the area.computeInterest() call:

areas.parallelStream()
   .mapToInt(a -> a.computeInterest());

This and similar formulations gave me errors that seemed to understand how badly I was abusing their interface. I calmed down and read a bit more, oh there's a foreach:

areas.parallelStream()
   .foreach(a -> a.computeInterest())

Boom, 1.10 seconds. And tighter code that more deeply conveys the independence of each operation.




2019.06.01

Back to it

A trip out east, a RHCP tribute band, Borderlands GOTY, and RoR2.
2019.06.03

Chipper

A review of the Power King 7hp woodchipper.


Related / internal

Some posts from this site with similar content.

Post
2020.05.09

Raster

Moving pixels around with Java.
Post
2023.03.04

C0D3

Indie SEO, Google Search Console, static websites, and Java fails/parallelization.
Post
2018.05.13

Deep dive

Neural style transfer using DL4J and other starter projects. An ATL trip and two hours before the mast.

Related / external

Risky click advisory: these links are produced algorithmically from a crawl of the subsurface web (and some select mainstream web). I haven't personally looked at them or checked them for quality, decency, or sanity. None of these links are promoted, sponsored, or affiliated with this site. For more information, see this post.

winterbe.com

Java 8 Stream Tutorial - winterbe

Learn Java 8 streams by example: functional programming with filter, map, flatMap, reduce, collect, lambdas, sequential and parallel streams are covered in-depth in this tutorial.
code.haleby.se

Occurrent - Event Sourcing for the JVM - Coding all the things

readrust.net

Performance - Read Rust

Read Rust collects and categorises interesting posts related to the Rust programming language. This page lists posts in the Performance category.

Created 2024.10 from an index of 441,465 pages.