Designing Cloud and Big Data Platforms for Data-intensive Scientific Applications

Abstract

Nowadays, the exponential growth of data requires revolutionary process method for data management, analysis and accessibility. Advances in distributed and parallel computing have the potential to dramatically improve the ability of current domain-specific software to process large amounts of data across distributed hardware. Despite these advances, many scientific computing domains still rely on conventional algorithms and data processing methods that fail to fully exploit the benefits of these distributed approaches. æThe project will address two key research questions: How should todayÍs cloud and big data platforms evolve to best address the needs of data-intensive scientific applications and how should these applications evolve to fully exploit the benefits of these platforms? æTo answer the first question, we will incorporate mechanisms tailored for scientific data-intensive application into OpenStack and evaluate their efficacy through experiments. We will run Hadoop on OpenStack and then run MapReduce-versioned scientific application on this platform. We will touch upon several questions as finding the best suited elasticity mechanisms for such applications, designing distributed storage methods that decouple computation and storage, developing better monitoring strategies to understand application behavior both for analytics and resource management. æTo answer the second question; three scientific data-intensive domains will be introduced: life science, smart grids and weather prediction. Genomics and sequencing applications are typical representatives in life science applications. Smart grids application can be used to ñmineî smart meter data for the ñtop-kî energy consumers in an hour. ñNow-castingî can use weather data to develop short-term personalized weather forecasts.