I spent the last two days in a very interesting discussion group about visualization challenges for ALMA. ALMA is arguably the first observatory where the data products will routinely lie in “big data” territory — that is, the Gigabyte-Terabyte range where data sets can’t easily be analyzed on a single machine. We’ve created observational datasets this large before, but they have arguably been niche products that only a few researchers use in their entirety (large swaths of the entire 2MASS or Sloan surveys, for example). Many, many people who use ALMA data will have to contend with data sizes >> RAM. The community needs to come up with solutions for people to work with these data products.
The big theme at this discussion group was moving visualization and analysis to the cloud, where more numerous and powerful computers crunch through mammoth files, and astronomers interact with this resource through some kind of web service. We spent a lot of time looking at a nice data viewer and infrastructure developed in Canada that is great for browsing through 100GB (and larger) image cubes. Yet I find myself uneasy about this move to the cloud. I seemed to be in the minority within the group, as most others embraced or accepted this methodology as the inevitable future of data interaction in astronomy (I may or may not have been called a dinosaur — admittedly, I was being a bit obnoxious about my point!).
I get that cloud computing is unavoidable at some level — most astronomers do not have nearly enough computational resources or knowledge to tackle Terabyte image cubes, and we will need to rely on a centralized infrastructure for our big data needs. Centralized resources are also great for community science, where lots of people need to work on the same data. But in an attempt to defend (or at least define) my dinosaur attitudes, here are the issues that I think astronomy cloud computing needs to address:
Scope of access: How often and to what extent will an observer have access to cloud resources? Will she be able to visualize data whenever she wants? Will she be able to run arbitrary computation? How much of a lag will there be between requests and results? Many of us are used to a tight feedback cycle when visualizing, analyzing and interpreting data. Is it a priority to preserve this workflow? Is that technologically and financially feasible?
Style of access: How many ways will we be able to interact with data? What restrictions will be placed on the computation and visualizations we undertake? Will we be able to download smaller sections of the data product for exploration offline? Will this API be in a convenient form (python library, RESTful URL, SQL) or some more awkward solution (custom VO protocol, cluttered web form)? What will the balance be between GUI and programmatic access? How well will each be designed and supported (personally, I can tolerate a poor GUI interface much more than a bad programming library)?
Bottlenecks for single machines. Underlying all of this is is the assumption that it is impossible to work with ALMA data on local machines. I think this is overhyped in some aspects. Storing even a Terabyte of data is trivial (1 Tb hard drives are $100, compared to $2000 per year to store 1 TB on Amazon’s cloud, to say nothing of computation). While churning through all of this data is certainly a many-hour task with a single disk, many operations relevant for visualization, exploration, and simple analysis are trivial (extracting profiles, slices, and postage stamps on a properly indexed data cube is very cheap, and gives you a lot of power to understand data and develop analysis plans). Should we really fully abandon this workflow that almost all astronomers currently use? Is it worth developing new software to help interact with local data more easily?
By no means are these issues insurmountable, and I was probably sweating the details too much for the high-level discussion at the meeting. But the details do matter, and the Astronomical community has had a mixed track record with creating interfaces to remote data products (new visualization clients are getting pretty good, but services for analysis or data retrieval are still pretty cumbersome). My reaction to most of these clumsy products has been to avoid them, because it has been possible to fetch and analyze the data myself. Once we lose that ability, we will all become very dependent on external services. At that point, the details of remote data interfaces may become the new bottleneck for discovery.
RAWRRRR (dinosaur noises)
What is the difference between these two images?
Obviously, a lot. The first image is a map of the rho-Ophiuchus molecular cloud — one of the nearest sites of star formation. The second is, apparently, random noise.
There is one important way in which these images are similar, however. Here is the histogram of the pixel brightnesses in each image:
The two images have the same distribution of pixel values. In fact, the “noise” image is simply a scrambled version of the first image. They contain identical pixels, arranged in different order.
Who cares? Well, this illustrates a common limitation to using a histogram to characterize data. It turns out that most maps of molecular clouds have similar histograms — that probably says something interesting about the physical processes that determine cloud structure. However, as the images above show, similar histograms can hide a lot of interesting differences between two data sets.
Histograms contain no information about the arrangement of pixels in an image — that’s why I could scramble the pixels in rho-Oph and preserve the histogram exactly. But there are other ways to rearrange those pixels. How about this, for example?
Again, the histogram this image is identical to the first two (download the data yourself if you don’t believe me!). The strategy for transforming an image while preserving the histogram turns out to be pretty simple. Here’s the strategy:
1) Find an image you want to match (in the case above, I used this)
2) If necessary, crop/resize the image to match the dimensions of the original image.
3) Find the location of the faintest pixel in the target image.
4) Replace this pixel with the faintest pixel in the original image.
5) Repeat for the second faintest, etc, until you replace all the pixels.
I put together a Processing applet that demonstrates this for a bunch of different images. You can find it here. This applet also shows you how the pixels in either image correspond to each other — hover your mouse over a pixel in one image to see the location of that pixel in the other.
You can even do this with color images by modifying step 4. Instead of simply substituting pixels in that step, alter the brightness to match the original image, while preserving the color. This will create a histogram of brightnesses that matches the first image, with colors that match the second.
There isn’t anything too profound going on here (although it always surprises me how well this works). But it does highlight the limitations of histograms in rather stark fashion. It’s interesting that maps of star forming regions all possess similar histograms, but this does not rule out the possibility that these regions have interesting structural differences between them. Complementary techniques are needed to tease out this possibility.
In the mean-time, enjoy this picture of the Bieber-ized rho-Ophiuchus (any guesses to what the histogram looks like?).