Success-Case

Researching Cellular Aging Mechanisms at Rey Juan Carlos University

by GIGABYTE
The speed, the precision, the vast amount of data need a computing cluster. Researchers Sergio Muñoz and Luis Bote at Rey Juan Carlos University worked with SIE and GIGABYTE to create a cluster composing of nodes: GPU, storage, compute, and head.
Sergio Muñoz discusses the research being done on the Talos cluster. 
Talos Tackles Aging
In early 2023, University of Rey Juan Carlos completed the installation of their "Talos" cluster, led by researchers Sergio Muñoz and Luis Bote. Talos, a name derived from Greek mythology's first non-organic artificial intelligence, provides significant computational power to the team's research on cellular aging mechanisms.
Learn More:  
《Read the Story on SIE's website: Leaders in Biomedical Research
《Glossary: Computing Cluster
《Glossary: Heterogeneous Computing
Ph.D Sergio Muñoz (URJC) and SIE's Raúl Díaz
The Research Team and Institution
Sergio Muñoz, who has a Ph.D. in machine learning and is a biomedical engineering professor at Rey Juan Carlos University in Spain, collaborates with the BigMed+ professors and researchers in designing AI and machine learning algorithms. University of Rey Juan Carlos (URJC) is a highly acclaimed university for research. With 46,000 students and five research groups spanning 31 fields of arts, sciences, and literature, the university boasts a vibrant academic environment.

In their research, algorithms are vital for not only providing solutions but also comprehending the underlying data, since understanding the data enables the algorithms to respond effectively to questions. In this field, black boxes, which cannot provide answers to these questions, are unwelcome. And while humans excel at certain perceptual tasks, they struggle with extracting hidden insights from vast amounts of data. Hence, processing this information and discovering concealed patterns to address the posed questions through AI and machine learning.

Health, particularly biomedical engineering, holds a central focus and great significance in their research. To design artificial intelligence algorithms, they require horizontal scaling architecture, especially in the domain of machine learning. As well as a solution to overcome the barrier of limited storage and infrastructure capacity for horizontal scaling and efficient algorithm execution.

Given that the research group specializes in designing space-temporal simulations, the GPUs had to perform well in double precision calculations. And with the application of the explainable AI algorithms they developed, deep learning techniques and generative models are emphasized because they necessitate the use of cutting-edge NVIDIA A100 Tensor Core GPUs built with NVIDIA Ampere architecture.

Overall, the needs of this research group are three:
• A significant amount of CPU cores to encourage parallel computing and apply their machine learning models.
• Double precision GPUs and the latest generation of explainable AI and simulation.
• Enough storage, especially for biomedical applications that also involve researchers across the globe.

Research Objective: Cellular Aging and Reprogramming
The research aims to comprehend the natural process of aging at cellular and molecular levels, both from young to old and in reverse sense through cell reprogramming. It encompasses various areas, including cardiology and the study of hereditary cardiac diseases.

Collaboration with esteemed research groups at the University of Murcia and the Virgen of Arrixaca University Clinical Hospital has been significant, utilizing their extensive collection of heart tissues and blood samples.
SIE success story: University of Barcelona
SIE success story: Climate Change
COVID: An Opportunity for Improvement
When COVID-19 emerged, this research group focused on understanding why older individuals and those with preexisting cardiopathy were more severely affected. This study led to participation in a REACT-EU project. Collaborating with prestigious research centers such as CNB-CSIC, CEMBio, Parque Científico de Madrid, and MIT, the research team explored the connection between cardiopathy, aging, and COVID-19. The team also developed an animal preclinical model to study the cytokine storm syndrome, creating a versatile platform for detecting targets and designing treatments not only for COVID-19 but also for future pandemics and diseases.

With comprehensive single-cell multioomics data from humans and animals, they established a powerful computing center in collaboration with SIE, leveraging their expertise in HPC and GIGABYTE platforms.

Future Focus
This group is not only focused on pursuing new knowledge, but also on transferring this knowledge. The supercomputing center benefits collaborators and society by sharing knowledge with partner universities. For instance, rapid data processing aids companies interested in machine learning.

Their future research will be focusing on two areas, one is the study of partial or transitory cellular reprogramming for enhanced quality of life, and the other is oncology.
Talos Cluster at URJC
Technical Breakdown
GIGABYTE servers, integrated by SIE, offer big computing power for researchers. To manage the cluster, they can count on GIGABYTE Server Remote Management (GSM), a proprietary multiple server remote management software platform provided for free by GIGABYTE.

The cluster comprises:
• Four GIGABYTE G492-ZD2 GPU nodes
• Two GIGABYTE R182-Z91 compute nodes
• One GIGABYTE S451-3R1 storage node
• One GIGABYTE R182-Z91 head node

GPU Nodes:
The G492-ZD2 is a server purpose built for the absolute best performance in GPU-centric workloads. It uses a dual chamber design in a 4U chassis, with the top 1U dedicated for the CPU platform and the bottom 3U dedicated for the GPUs, all while still having support for up to 10 low-profile NICs. This solution offers the best air cooling possible so that the system can sustain peak performance without compromising.

Each GPU node has two AMD EPYC 7282 processors for a combined 32 CPU cores and 160 PCIe 4.0 lanes. The heavy lifting and parallel processing come from NVIDIA HGX A100 SXM4 GPUs. Each GPU server has eight NVIDIA A100 GPUs. This innovative GPU cluster has impressive computing power seen in its 221,184 CUDA cores and 13,824 Tensor cores. And it achieves a theoretical FP64 Tensor Core performance just north of 600 TFLOPS. Connectivity is optimized for GPU-to-GPU direct data movement; NVIDIA A100 Tensor Core GPUs are interconnected through several NVIDIA® NVLink™ interconnects, which gives a rate of 600 GB/s of throughput between GPUs.

Compute Nodes:
The R182-Z91 is used for its compact dual socket design supporting up to 128 CPU cores from the AMD EPYC 7003 series processors. For storage, it has 8 x 2.5" SATA/SAS drive bays and 2 x U.2 NVMe drive bays, of which eight are used for SATA SSDs and one for OS on an NVMe PCIe 4.0 drive. And there is still room for two low-profile slots that are typically used for NICs.

Each of the compute nodes has two AMD EPYC 7763 processors at 2.45GHz with 64 cores (128 threads) and 256MB of L3 cache. Being a dual socket server with 8 memory channels, the system is also outfitted with 1024GB of DDR4 memory. A RAID controller is used for the eight fast SATA SSDs, which are ideal for the fastest possible access, while reducing maintenance costs and energy consumption. All in all, it delivers a level of performance that is greatly appreciated.

Head and Storage Nodes:
The single head gateway is the server or management node for this cluster. Again, the R182-Z91 was chosen, but this time without the need for a CPU core dense configuration. Instead, two low power AMD EPYC 7252 processors (120W TDP) were selected. In this cluster the server was selected for its future scalability, as half the memory DIMMs are populated allowing for future expansion to double the capacity.

Just like all storage servers, the focus is on storage capacity rather than computing performance. The S451-3R1 supports up to 36 x 3.5” SAS/SATA drives as well as 6 x 2.5” hybrid NVMe/SATA/SAS drive bays. In this cluster, the system has two Intel Xeon Silver 4210R processors for a total of twenty CPU cores, more than enough for a storage node and doing so with a low 100W CPU TDP. Again, a RAID controller is used to facilitate high-performance in a single volume from two RAID 6 for the 36 x 18TB HDDs for a total capacity of 576TB.

Also, the servers communicated via NVIDIA Quantum InfiniBand® networking platform through double redundant ports in NVIDIA® ConnectX®-6 cards.

SIE has configured all the GIGABYTE systems in this cluster based on the HPC LadonOS 8 ecosystem, an open-source OS based on Centos, that allows the researchers to work on the cluster without paying for proprietary software, greatly reducing the cost of the property. The main tools are:
a) Rocky Linux 8.7 is the chosen operating system because it is highly stable and offers securitization using an IP table.
b) SLURM for job scheduling or workflow, the same tool used by leading Spanish clusters such as Mare Nostrum or Hyperion.
c) A Docker container system, that allows to individualize every application in each library without virtualizing all the equipment.
d) Check MK, which is a management console system through IPMI and monitors devices through SNMP.
e) Easy Build is a software creation and installation framework that allows you to manage scientific software on HPC systems efficiently.

True Spectacle
Standing back and reflecting on the possibilities for this system is really something. It delivers and bests expectations that URJC's Sergio Muñoz and Luis Bote had. And we look forward to more discoveries coming out of their research to better the human condition.
Get the inside scoop on the latest tech trends, subscribe today!
Get Updates
Get the inside scoop on the latest tech trends, subscribe today!
Get Updates