Bio-IT World Magazine
October 2006 Issue, Cover Story by Kevin Davies
Marshall's IT Plan for Janelia Farm
Oct. 16, 2006 | Driving north from Washington Dulles Airport towards the Potomac River, it's easy to miss Janelia Farm. The only road sign faces the opposite direction, belatedly guiding lost taxi drivers retracing their route in search of the campus. Outside a makeshift hut in the middle of a construction site, the security guard waves a visitor's taxi down a long, winding dirt road appropriately named Helix Drive. Around a corner, however, the scene changes dramatically.
In 2000, the Howard Hughes Medical Institute (HHMI), the nation's largest medical institute, paid $53 million to acquire the Janelia Farm estate, an historical riverfront property. HHMI has since expanded and transformed the near-700-acre estate into a spectacular Rafael Viñoly-designed research campus, which officially opens this month. The first of some two dozen scientific group leaders are already settling in - neuroscientist Karol Svoboda from Cold Spring Harbor, bioinformaticist Sean Eddy from Washington University St Louis, and former Celera informatics chief Gene Myers from UC Berkeley.
HHMI's goal is simply to set up outstanding investigators with the environment and resources to push back the boundaries of life sciences. Of critical importance in that quest - especially given a scientific ensemble that is expected to drive the data-intensive fields of imaging, visualization, and bioinformatics - is the IT infrastructure. And so to design and build a data center for the ages, HHMI turned to the best - Marshall R. Peterson.
One would expect him to call Janelia Farm the most exciting project he has ever participated in. But coming from the man who oversaw the impressive IT infrastructure to assemble the human genome at Celera Genomics six years ago, that is particularly noteworthy. "I firmly believe this will rapidly be the premier research facility in the life sciences on the planet," says Peterson. "My marching orders are to give these people what they need to get the job done."
Asked to describe his career highlights, Peterson quietly mumbles something about being an aeronautical engineer and his time in Sweden building SAP. Nary a mention of his helicopter sorties in Vietnam or six Purple Hearts. It was another Vietnam vet, J. Craig Venter, who recruited Peterson to Celera, where as Vice President for Infrastructure Technology, he designed the Compaq Alpha network that assembled both the fruit fly and human genomes.
The fruit fly project was a curious collaboration between Venter, who had few friends in the academic community, and Berkeley geneticist Gerry Rubin. After Peterson left Celera to join the Venter Institute, Rubin hired Peterson to consult on the Janelia Farm project, and subsequently hired him (and several of his ex-Celera IT colleagues) in December 2005 (see "Rubin's Risky Business").Tour of Duty
It feels like a five-minute walk from Peterson's office, overlooking the duck pond, to the majestic 5,000-square-feet data center - and that's just half of the space at Peterson's disposal. He says it's a trip he shouldn't have to take too often if he's doing his job.
HHMI staff claim the data center network capacity is bigger than ESPN's, with the core matching Google and Yahoo as the fastest single-site networks. Indeed, with the exception of certain three-letter [US Government] agencies, Janelia Farm might be the biggest 10-Gb aggregation in the world. Visitors are unanimously impressed.
"We want to look professional - part of this is marketing too," Peterson admits. "As much as we like to think anyone who comes here wants to work here, there's lots of competition. They want to see that whatever they want, we're going to give them. I have a reputation for customer service - it's not going to stop."
The data center is completely fiber and boasts a multi 10-Gb network. "That's a constant question," says Peterson. "Am I going to get the data to my desktop fast? If I can't, then I'm going to start having people buying their own supercomputers and sliding it under their desk. I don't want that - it's not cost effective, and you can't manage it." He adds: "We're going to have very high-resolution graphics, and people are going to see it very fast. Just one set of microscopes will be generating 500 GB data/day. 24x7x365."
Although he says that, "networking changes a lot," Peterson relied extensively on his Celera and Venter Institute experience in designing the data center, as well as feedback from Myers, Eddy, and other group leaders. Janelia Farm has as much storage and "a lot more horsepower" in 2,000 square feet as Celera had in five times the space, for a lot less money.Four Vendors
Selecting the vendors involved extensive rounds of competition, with a view, says Peterson, to "obviously trying to get the best price and the best technology." Peterson stipulated the key requirements, notably 10 Gb in the core and stringent security, given the plethora of lab instruments, administrative systems, and visiting scientists. He challenged the vendors: "How would you design a network that is high performance, scalable, and by the way, if we don't pick you for the whole thing, make sure that it's modular so that we can select you for the core, someone else for distribution, wireless, and so on?"
The allure of being selected for HHMI's new campus meant Peterson benefited from some "extremely aggressive pricing." Says Peterson: "This is an incredible place to have technology. What a showcase!" He might even let vendors bring potential customers in for a tour. But he points out another attraction: "My team has a history of being extraordinarily successful."
Ultimately Peterson went with a four-vendor solution:
- Force 10 for the core and distribution
- Foundry for the edge
- Juniper for security firewalls
- Meru for wireless
A hallmark of the design - "signature Peterson data center" - is what's under the floor. "Nothing," says Peterson. "Stuff under the floor is tough to troubleshoot, blocks airflow. Power's overhead - the only thing under the floor is air."
The power supply runs to 200 Watts/square foot - not as high as Peterson wanted, "but for budget purposes that's what it is." Peterson says that could be doubled "without bringing down the data center," but he hopes that won't be necessary.
With some 1,200 64-bit Intel Xeon processors in all, cooling was a major concern. Peterson explains: "We ended up going with Dell and Xeons, which are hot, but we did a calculation: given the price we got with them and given the increased power requirements, it still came in price effective. Having said that, we're very interested in the new generation of Intels and obviously AMD." The data center uses 142 tons of air conditioning.
Everything in the data center is designed to be ripped out and replaced if needed. "The idea is to design infrastructure that is cost effective and easy to replace. We try to be open source - everything is Linux-based, low stress. It helps hugely with the maintenance."Storage Demands
Peterson selected three tiers and 150 TB of spinning disk storage from EMC. "We started small... seriously!" Peterson smiles. Tier 1 is 30 TB of SAN. Tier 2 is 70 TB of NAS. Tier 3 - the archive - consists of more NAS on disk plus tape. Peterson wants to expand tier 3. "We have capability of over 1 PB of tape," says Peterson. "I can grow to multi petabytes without adding another cabinet." He opens one of a long row of EMC cabinets to show rows of vacant racks.
"We have lots of empty space," says Peterson. "I can start adding storage incrementally. I want to match my demand curve with the cost curve." He wants to make it easy for his "customers" to move data back and forth. "We'll get reports on what they're using and there are budget issues, but what we hope to encourage them to make effective use of storage."
"In many respects this is like Celera - we don't know what we're going to do, and we're not sure how we're going to do it."
The IT staff is holding extensive meetings with the incoming group leaders. "But remember, a lot of what they want to do is stuff that has never been done before," says Peterson. "We're going to do things that are risky. I don't want to buy a bunch of stuff and find out that it's wrong. So we talked with the vendors about long-term levels, and working with them - it's more of an engineering relationship than a vendor relationship."
As for the scientists' desktop preferences, Peterson is agnostic "They can be anything they want," he says. "We give them Linux on the desktop, Mac, Windows... if you want X, we give them X. Our goal is to try to say, 'Don't tell us what you want, tell us what you want to do.'"
Peterson enthuses about Sun Grid Engine, which runs the compute cluster: "It's open source, we love it. Lots of people have experience with it. The idea is to develop a shared facility. It's hard to go to the COO and request a couple of million dollars worth of processors when you're only using 30 percent of what you've got. Using Sun Grid Engine, it's a shared facility. Stuff goes in a queue, maybe one time you get 10 processors, the next time 1,000 processors." And when there's a fault in a node, Sun Grid Engine re-routes jobs and pages the IT team.Visualize This
Peterson has barely filled half of the 5,000 square feet data center, but he has as much space again available when he needs it. "I've allocated a lot of money for visualization. I might take part of that other site and turn it into a cave."
Before long, Peterson hopes that Janelia Farm scientists will be virtually tracking neurons around 3-D representations of the fruit fly brain. Apple would be a logical partner in such an endeavor, but there are many others. "This is Phase 0," says Peterson. "One of Michael Dell's big things is video. They're really excited about working with us."
Peterson is clearly relishing his newfound freedom at HHMI. "We really are pure research," he says. "Thanks to [HHMI's] superb investment group, we can focus on enabling research, giving people the tools they need, and not dotting i's and crossing t's." A $14 billion endowment certainly buys a lot of freedom.
For now, Peterson says that file systems are his biggest concern. "EMC has a very interesting parallel file system. Panasas has a parallel file system, very fast. I've been testing this stuff for four years. Because of the demands of visualization, this is stuff I want to look at."
"The reason HHMI built this is because we want to give people resources they didn't have. In a lot of cases, they've never been exposed to someone coming saying, how many TB storage do you need?" He says he aims to give "Whatever they want within reason - and reason here is a capital R."
Ultimately, the question facing Peterson and colleagues will be: "How do we store and annotate this image data? Imagine 'flying' through an image that's 500,000 by 400,000 pixels, tracing a neuron, trying to see where it fires. The complexity we're facing is mind-boggling. In many respects, it makes sequencing and assembling the human genome look trivial!"