Activity report : 2016
One of the major challenges of the coming years in preparing for the transition to systems with millions of computing cores remains the optimization of the interaction between the application layers and the machine layers, which requires working on several fronts: Partly by developing sophisticated tools to analyze what is happening at the level of the computing core and at the level of the communication network; On the other hand by working on Data Science or HPC applications in order to raise scaling locks. This expertise at the crossroads of the tools and the orientation for the rewriting of codes to take advantage of new architectures is at the heart of the ECR collaboration. Members of the ECR laboratory presented their work at the major events that marked 2016: the Teratec 2016 Forum with its own exhibition stand, the ISC 16 conference in Frankfurt, notably on the Intel stand and at SuperComputing 2016 in Salt Lake City.
MAQAO (www.maqao.org) is a modular performance analysis software for HPC applications. It allows very precise "diagnostics"
of the various performance problems (vectorization, cache, parallelism) within an application and it provides the user with
a synthesis to help him select the optimizations that are most profitable.
In 2016, work on MAQAO began with the enhancement of behavior analysis capabilities. Thus, at the level of basic analysis tools,
the Microbench module has been added, which quickly "calibrates" the performance of a memory hierarchy and thus helps to understand
its behavior and then the UFS module which, via a simulation Of the internal behavior of the heart analyzes the impact of
latency and bandwidth of memory access on performance.
At the level of diagnosis synthesis, MAQAO was enriched with the ONE VIEW module, which allows to orchestrate the execution of other MAQAO modules - time and value profiling (LProf and VProf), decremental analysis (DECAN), static analysis CQA) - depending on the depth of investigation desired and aggregate the results into a summary report in Excel format. Finally, MAQAO's major components have also been updated to support the Knights Landing architecture and tested on partner applications.
In addition, the MAQAO CQA and Microbench modules are being integrated into the Intel® Advisor tool to provide predictive analysis of the performance impact of the different cache levels available on the target machine. The MAQAO team at UVSQ has been an active partner of the VI-HPS community since its creation in 2011, as well as developers of other tools such as TAU, ScoreP, Scalasca, Vampir. In 2016, the MAQAO team participated in two major VI-HPS seminars in Garching and Lawrence Livermore, offering dozens of HPC code developers the chance to be trained and practiced during sessions on MAQAO. Feedback from the auditors gave excellent feedback on the relevance and quality of the diagnostics provided by the tool
In addition to MAQAO, we finalized in 2016 CERE (Codelet Extractor and REplayer, github.com/benchmark-subsetting/cere) which allows to automatically decompose a program into a set of elementary calculation cores called codelets. CERE allows each codelet to be replayed in isolation (without re-running the entire application) for different input datasets and varying different parameters (compilation options, target architecture, degree of parallelism, etc.) Codelet decomposition Accelerates optimization and performance measurement in HPC or embedded applications.
Programming and Execution Models
The optimization of executable media in a high-performance computing context is critical because they provide the link between hardware and applications. The MPC executive support (http://mpc.hpcframework.com), originally developed at CEA, was built with the objective of facilitating the development and optimization of parallel applications on multi / many core machine cluster. MPC provides unified programming models and MPI and OpenMP implementations that are commonly used by parallel applications. MPC also offers an HPC ecosystem with a multithreaded / NUMA memory allocator, user thread support (debugger, extended GCC compiler, ...), compiler extensions for data sharing, .... MPC also offers integration with newer programming models such as Intel TBB.
The work carried out in MPC gives rise to industrial collaborations, among others, with Intel which integrates MPC-specific extensions in its compiler (automatic privatization) or with Allinea DDT which natively supports MPC user threads. MPC is based on the latest generation of Intel (Haswell, KNL) processors and supports the InfiniBand (EDR, ...) and Portals (BXI) network architectures. The MPC team at CEA has been an active partner of the MPI community for 2 years. In 2016, the team participated in two plenary sessions of MPIForum (MPI standardization committee) and is the focal point for thread-based MPI aspects.
The work of analyzing and optimizing HPC applications from industry or academic partners has been carried out face to face, often hand in hand, with the development of tools. The application portfolio addresses the topics of combustion, turbulence, materials and astrophysics. Based on a strong collaborative approach between the developer and the laboratory, the aim is to pool expertise in order to optimize performance and to support the modernization of codes in order to tackle the challenges of massive parallelism.
As an active member of the AbInit community, the CEA initiated with the Intel team fundamental work on this code, mainly to explore the possibility of developing an abstraction layer to facilitate the work of scientists for specific architectures With the most efficiency.
POLARIS ® Developments made in 2016 on the POLARIS® code made it possible to carry out the first simulation (over 24 hours) of a virus capsid (assembly of proteins encapsulating the genome of a virus) in aqueous solution. This simulation was made possible by the use of a coarse-grained approach for modeling the solvent, sophisticated methods to take into account the polarization phenomena at inter-atomic interactions and a Fast Multipole Method To take full advantage of parallel computing architecture. The simulated system comprises 0.5 million atoms (capsid) and 2 million coarse particles for the solvent (water). The FMM method used was parametrized in high precision mode (force error less than 1%) and atomic motion equations were solved at the femto-second scale. The simulation was performed on 1792 cores of the CCRT COBALT machine. The "speed" of simulation was in the order of 0.75 ns per day. The POLARIS (R) code is the only code to perform high-precision microscopic simulations (especially for the treatment of inter-atomic interactions) for molecular systems with several million atoms with "velocities" of The order of the nanosecond per day. On this last point ("speed"), many improvements are still possible allowing to envisage "speeds" of several nanoseconds per day. Finally, its hybrid parallelisation scheme (OPENMP / MPI) makes it particularly suitable for the new generation of "manycore" systems. Macintosh HD:Users:mm173828:Desktop:CS-VC-2015:solvated_capsid_monomeric.tga
The capsid of the mosaic virus panicum virus capsid (about 0.5 million atoms) in an aqueous solution (2 million particles coarse solvent).
European Projects Since 2016, CEA and INTEL have been working on a joint project to demonstrate the use of Data Analysis and Automatic Learning for fault detection and preventive maintenance in HPC data centers. Participation in European Framework Programs:
- INTEL and the UVSQ, are partners of the EXA2CT project, which ended in 2016. www.exa2ct.eu
- Since 2015, INTEL is a member of the REEDEX consortium led by TU Dresden. The purpose of the READEX project is to improve the energy efficiency of HPC applications through dynamic autotuning, enabling users to automatically exploit the dynamic behavior of their applications by adjusting the system to the actual resource requirements. www.readex.eu