The Aspects of Astronomy in the Cloud That Scare Me

I  spent the last two days in a very interesting discussion group about visualization challenges for ALMA. ALMA is arguably the first observatory where the data products will routinely lie in “big data” territory — that is, the Gigabyte-Terabyte range where data sets can’t easily be analyzed on a single machine. We’ve created observational datasets this large before, but they have arguably been niche products that only a few researchers use in their entirety (large swaths of the entire 2MASS or Sloan surveys, for example). Many, many people who use ALMA data will have to contend with data sizes >> RAM. The community needs to come up with solutions for people to work with these data products.

The big theme at this discussion group was moving visualization and analysis to the cloud, where more numerous and powerful computers crunch through mammoth files, and astronomers interact with this resource through some kind of web service. We spent a lot of time looking at a nice data viewer and infrastructure developed in Canada that is great for browsing through 100GB (and larger) image cubes.  Yet I find myself uneasy about this move to the cloud. I seemed to be in the minority within the group, as most others embraced or accepted this methodology as the inevitable future of data interaction in astronomy (I may or may not have been called a dinosaur — admittedly, I was being a bit obnoxious about my point!).

I get that cloud computing is unavoidable at some level — most astronomers do not have nearly enough computational resources or knowledge to tackle Terabyte image cubes, and we will need to rely on a centralized infrastructure for our big data needs. Centralized resources are also great for community science, where lots of people need to work on the same data. But in an attempt to defend (or at least define) my dinosaur attitudes, here are the issues that I think astronomy cloud computing needs to address:

Scope of access: How often and to what extent will an observer have access to cloud resources? Will she be able to visualize data whenever she wants? Will she be able to run arbitrary computation? How much of a lag will there be between requests and results? Many of us are used to a tight feedback cycle when visualizing, analyzing and interpreting data. Is it a priority to preserve this workflow? Is that technologically and financially feasible?

Style of access: How many ways will we be able to interact with data? What restrictions will be placed on the computation and visualizations we undertake? Will we be able to download smaller sections of the data product for exploration offline? Will this API be in a convenient form (python library, RESTful URL, SQL) or some more awkward solution (custom VO protocol, cluttered web form)? What will the balance be between GUI and programmatic access? How well will each be designed and supported (personally, I can tolerate a poor GUI interface much more than a bad programming library)?

Bottlenecks for single machines. Underlying all of this is is the assumption that it is impossible to work with ALMA data on local machines. I think this is overhyped in some aspects. Storing even a Terabyte of data is trivial (1 Tb hard drives are $100, compared to $2000 per year to store 1 TB on Amazon’s cloud, to say nothing of computation). While churning through all of this data is certainly a many-hour task with a single disk, many operations relevant for visualization, exploration, and simple analysis are trivial (extracting profiles, slices, and postage stamps on a properly indexed data cube is very cheap, and gives you a lot of power to understand data and develop analysis plans). Should we really fully abandon this workflow that almost all astronomers currently use? Is it worth developing new software to help interact with local data more easily?

By no means are these issues insurmountable, and I was probably sweating the details too much for the high-level discussion at the meeting. But the details do matter, and the Astronomical community has had a mixed track record with creating interfaces to remote data products (new visualization clients are getting pretty good, but services for analysis or data retrieval are still pretty cumbersome). My reaction to most of these clumsy products has been to avoid them, because it has been possible to fetch and analyze the data myself. Once we lose that ability, we will all become very dependent on external services. At that point, the details of remote data interfaces may become the new bottleneck for discovery.

RAWRRRR (dinosaur noises)