2019.06.02

Java programming coding Duke parallelism

A short story from graphics to parallelism to lambda to mapreduce to hadoop to Spliterator to ParallelStream.

As my graphics library has been growing slowly, but the building blocks are there for slightly interesting things. Solved things, but not always so customizable when bought off the shelf. Elegant designs have won out for the most part, and that's brought creeping overhead. So you get to the thumbnail algorithm that would take seconds to complete. For my baseline image, it was 3.61 seconds (we'll come back to this at the very end). That's not bad for hitting go on a blog post (which chugs on ftp so who cares about some slow graphics processing) or a background batch operation, but not great for the realtime case.

Of course in the modern era you want to farm graphics processing out to the GPU. In addition to the difficulty of connecting the CUDA layer (in Windows), there are some conceptual concerns that kept me from driving down this path. For one, there's overhead moving data for incremental graphics operations. And while GPUs are superb, there are limited gains vs an x-core high clock rate CPU when you're talking about general purpose code (with branching) on a single image. There's absolutely a use case for shipping stuff off to my still-awesomeish graphics card, but I don't think photo processing is it - with the side comment that the GPU is used by the JVM for the actual rendering.

Having some experience with distributed processing, I know that applying a pixel transform to 800x600 = 480,000 pixels is a large and very parallelizable task. For example converting RGB to luminance is the same operation, with zero state or dependencies, executed 480,000 times. It's also a prime counter example to my conclusions from the above paragraph, but let's suspend disbelief because this is about Java parallelization, not throwing my hands up at Windows/CUDA and installing Ubuntu then refactoring code into something that works with a graphics API.

The old school approach to the luminance conversion would be:
convert(int pixels[][])
   foreach(pixel)
      value = (pixel & 0x00ff0000 * 0.2126 + pixel & 0x0000ff00 * 0.7152 + pixel && 0x000000ff * 0.0722) / 3;
      pixel = value | value << 8 | value << 16;
Simple, sequential.

The functional programming approach would be to map the function to the data. Your lambda function is just:
convert(int pixel)
      value = (pixel & 0x00ff0000 * 0.2126 + pixel & 0x0000ff00 * 0.7152 + pixel && 0x000000ff * 0.0722) / 3;
      pixel = value | value << 8 | value << 16;     
...
convert->map(pixels[][]) // or however your language's syntax prefers
... and the means by which it is applied to the data is left unspecified.

The unspecified means of application left me wondering if there was a way to leverage this for parallelism. In Java. I quickly found that 1.8 had introduced the Spliterator interface. The silly portmanteau name (plus the context by which I got to it) led me to believe this might be the way to iterate over elements in an distributed way. It was even more encouraging that Collection had implemented the interface, so Lists and Sets could be natively split. The docs and StackOverflows had examples like:
Spliterator it = list.spliterator();
it.forEachRemaining(lambdaFunction);
This would behave just like an iterator with bonus of contextual properties like Lists are sequential and Sets are not. Okay, nice, but still serial processing. I was hopeful when I read
"Java example to split the elements to two groups and iterate independently." But the example was:
ArrayList list = new ArrayList<>();
         
list.add("A");
list.add("B");
list.add("C");
list.add("D");
list.add("E");
list.add("F");
 
Spliterator spliterator1 = list.spliterator();
Spliterator spliterator2 = spliterator1.trySplit();
 
spliterator1.forEachRemaining(System.out::println);
 
System.out.println("========");
 
spliterator2.forEachRemaining(System.out::println);
Oh, independently/sequentially, not independently/concurrently. Damn. What I was learning is well-articulated here:

Spliterator itself does not provide the parallel programming behaviour the reader might be interested in experimenting with a Collection type holding a vastly larger number of objects and then look to implement a parallel processing solution to exploit the use of the Spliterator. A suggestion is that such a solution may incorporate the fork/join framework and/or the Stream API.

Source.

Spliterator doesn't do much unless you want to add the code to kick off threads/processes to leverage its real contribution: chopping up the work nicely and with thread safety. And so this is a solution to the luminance problem, but it's a bad one. To get parallelism, I would need to explicitly split my data into a specific number of subsets and then kick off threads for each. This is both laborious, verbose, and full of overhead. I want to parallelize, but more than that I want to map the function, that is, semantically convey that the data is processed independently.

This led to a brief digression into MapReduce - the concept of mapping a function and then performing a lassoing of results. That led to Hadoop; an Apache platform for farming out large MapReduce and similar distributed processes. Neat, but overkill for this.

Things were looking grim, but then I at last checked up on another type that had popped up in some of my queries. Stream. You know, that vague supertype that your grandpappy used for IO and de/serialization Turns out this isn't it. InputStream and OutputStream and their many descendents actually are just subclasses of Object. Stream was introduced in 1.8.

The tutorial page had this MapReduce example:
double average = roster
    .parallelStream()
    .filter(p -> p.getGender() == Person.Sex.MALE)
    .mapToInt(Person::getAge)
    .average()
    .getAsDouble();
Okay so they're pulling a parallel stream from the roster Collection and performing a bunch of steps on it. That actually looked good.
The code is straightforward and the documentation seemed to indicate it'd be a parallel effort with width determined by the platform. It was worth a try, so I decided to see if I could feed my thumbnailer code into Stream. Pardon my switching of example horse mid-stream, so to speak, but this one had that nice 3.61-second benchmark.

The baseline code just took a user-specified number of areas and did a bunch of computations on them. The area size would be inversely proportional to the number of areas, so we're talking generally the same amount of work regardless of the client code.
foreach (area)
   area.computeInterest(); // Does a lot of pixel math.
I fumbled a bit with how to back into the area.computeInterest() call:
areas.parallelStream()
   .mapToInt(a -> a.computeInterest());
This and similar formulations gave me errors that seemed to understand how badly I was abusing their interface. I calmed down and read a bit more, oh there's a foreach:
areas.parallelStream()
   .foreach(a -> a.computeInterest())
Boom, 1.10 seconds. And tighter code that more deeply conveys the independence of each operation.