A North­eastern Uni­ver­sity under­grad­uate is leading the devel­op­ment of a new pro­gram that will make it pos­sible for cer­tain super­com­puters to save their data midway through a com­pu­ta­tion, pre­venting the loss of progress due to a com­puter crash or bug that would oth­er­wise require the machine to be restarted from the beginning.

Com­puters are like a car engine — the more com­pli­cated they are, the more likely they are to break,” said Greg Kerr, a sopho­more com­puter sci­ence major.
Kerr said that his pro­tocol applies to high-​​performance machines known as Infini­Band supercomputers.

Next month, he will present his research at Recon, a com­puter sci­ence con­fer­ence held annu­ally in Mon­treal, Canada. He has been selected to give an hour-​​long talk on the first day of the con­fer­ence, an honor, for an under­grad­uate, said Gene Coop­erman, a pro­fessor in the Col­lege of Com­puter and Infor­ma­tion Sci­ence, where Kerr is a research assistant.

If you give your talk on the first day, it means everyone who is there for the con­fer­ence knows who you are and can talk about your work in the later days,” said Kerr. “It shows that the orga­nizers believe this work is very impor­tant and will gen­erate a lot of interest among the attendees.”

Infini­Band is a rel­a­tively new com­puter net­work that has made high-​​performance com­puting cheaper and more acces­sible since it was devel­oped and released in the early 2000s.  Because the Infini­Band net­work is scal­able, it can be used on sys­tems ranging from small com­puter clus­ters to some of the world’s largest and most advanced supercomputers.

This is the net­working tech­nology behind some of the worlds largest com­puters, and yet the number of people who under­stand the inter­nals of the Infini­Band tech­nology is very small, largely because it is rel­a­tively new,” said Coop­erman, who urged Kerr to reach out to some of the top Infini­Band experts in the world as he began devel­oping his new software.

No one has been able restart an Infini­Band process mid­stream. This new work would allow sci­en­tists to more effi­ciently com­plete mas­sive cal­cu­la­tions on expen­sive com­puters in high demand.

This summer, Coop­erman and sev­eral of his doc­toral stu­dents are working at Oak Ridge National Lab­o­ra­tory, where some of the nation’s most advanced super­com­puters are located, and Kerr believes his work will soon be ready to be applied to those computations.

I think we’re close,” Kerr said. “We’ve got the main points proven and now we need the summer to iron every­thing out and work out the bugs.”