DATA INTENSIVE RESEARCH – BIGGER, FASTER, CHEAPER
With more data being generated at a faster rate in areas such as biomedical, environmental, physical science and engineering, finance and social science research, besides computational needs, researchers also need to plan for storage requirements. There are three key requirements to consider – capacity, performance and manageability. Each of these in turn will have cost implications. Let’s look at how we can exploit the commodity hardware and open-source software together with some practical ideas to address the requirements and cost challenges.
Bigger and Cheaper
Three types of storage are required for most research projects – for data processing, long-term data storage and backup.
For small dataset (< 500GB), the storage capacity provided through the central shared HPC services is probably good enough for data processing. Researchers can use their desktop/laptop and portable storage for long term storage and backup. For large dataset (> 500GB), researchers will have to pay for the additional capacity required for data processing. If the enterprise storage is used for data processing, long-term data storage and backup, the cost is expected to be more than $100/TB/month.
To enable more affordable large-scale deployment, we need to explore the use of open-source software and commodity hardware to support high-performance data processing, long-term data storage and backup for research. The cost of such Utility Storage Service provided by Computer Centre is less than half of the enterprise storage system.
Faster and Cheaper
Parallel File System or NFS for data processing?
Parallel File System offers better performance through parallel I/O but it is probably overkill to use it for a small cluster of a few nodes. Today we are providing a more affordable NFS mounted storage of up to 50TB capacity for subscription by researchers. These storage can be mounted on the Condominium server nodes they subscribe or to their own server. An off-the-shelf Parallel File System will probably cost a few times more to acquire. We are exploring a more cost-effective implementation of Parallel File System using low-cost commodity storage to support large-scale data processing.
Putting Servers closer to Storage
The law of physics tells us that time taken is proportional to distance if the speed of travel is constant. Some of the high-speed low-latency network technologies, such as the Infiniband, also have distance limit that requires both the server and storage systems to be hosted within the same data centre. Therefore it is always better to put server and storage systems close to each other when higher I/O performance is required.
File Transfer Protocols
Assuming you have the necessary network bandwidth, the use of different file transfer protocols and tools can affect your file transfer speed. For example the use of UDP protocol (such as the Tsunami UDP server) provides greater throughput over the use of TCP protocol (e.g. FTP and HTTP). GridFTP protocol, which enables multiple simultaneous transfer streams, offers even greater performance. However, if the network available is congested, the use of higher throughput protocol may not help.
Bring Application to Data
Instead of transferring large dataset to application for processing, it is more cost effective (takes shorter time) to bring application to the data. This is the concept adopted by the Hadoop software framework for large-scale data analysis. We plan to explore Hadoop using low-cost commodity hardware for research application in the near future.
Affordable Large Data Management
If your research project involves large amount of data/file that needs to be stored, retrieved and shared among your research team, a data management system such as iRODS will probably come in handy. The iRODS data management service is currently available for trial by NUS researchers.
Cost Effective Data Intensive Research Support
Resources and services currently offered by Computer Centre include:
• Shared Parallel File System for data processing
• Low-cost storage for small cluster data processing
• Low-cost storage for long-term storage and backup
• iRODS for data management
Future Development:
• Low-cost Parallel File System
• Hadoop data processing system
• Fast file transfer system
Please contact us at ccehpc@nus.edu.sg if you are interested in any of the above services.