Senior Systems Engineer (HPC – High Performance Computing)

Location: Silver Spring, MD
Date Posted: 09-14-2017 #9921535
or

 
 
  • Rebuild/upgrade the HPC existing xCAT server. Provide planning and procedural guidance in the performance of the following subtasks:
  • Rebuild/upgrade existing xCAT server from RHEL/Centos 6.3 to RHEL/CentOS 7.2.
  • Upgrade xCAT from version 2.8 to version 2.12 or most current at time of service.
 
  • Deploy backup cluster capability. Provide planning and procedural guidance in the performance of the following subtasks:
  • Review the RHEL operating system level and configuration and direct installation of RHEL 7.2 if required.
  • Review the GPFS file system software installation and direct upgrade to 4.2.1 if required.
  • Review filesystem configurations and direct settings as required for best practices.
  • Direct configuration of multi-cluster mounts if required.
  • Verify GPFS installation and connectivity between then current production GPFS cluster and backup GPFS cluster, and direct tuning as required.
 
  • Transfer data from current GSS to backup GPFS cluster and back after rebuilding. Provide efficient and soundly engineered methods to execute the backup process. Subtasks include:
  • Provide guidance and scripting to facilitate data transfer.
  • Validate GSS backup and restore on rebuilt GSS.
 
  • Rebuild CDRH HPC Production GSS Storage. Provide planning, configuration, and procedural guidance to reinstall the storage system. Subtasks include:
  • Plan storage deployment based on best engineering practices for high performance computing environment.
  • Obtain compliant GSS update packages for upgrade.
  • Direct installation of GSS firmware and software to supported OS levels in best recipe provided by IBM.
  • Guide remediation of hardware issues as they arise.
 
  • Guide deployment of networking and xCAT for new rack of SuperMicro compute nodes. Guide integration of new rack of compute nodes. Subtasks include:
  • Configure xCAT service node to support SuperMicro compute nodes in new rack.
  • Install and integrate additional nodes as necessary.
  • Review Ethernet networking and assist with switch configurations for rack upload links.
 
Deliverables
1.            Rebuild the HPC xCAT server to CentOS 7.2 and xCAT 2.12.
2.            Deploy backup cluster capability, with up-to-date RHEL/CentOS 7.2 operating system and GPFS file system software version 4.2.1. Document filesystem configurations and settings.
3.            Provide backup scripts and associated guidance for use to perform backup.
4.            Rebuild the GSS rebuild report including final configuration, software packages and release levels, and documentation on hardware/firmware issues that arose and resolution.
5.            Debug deployment of SuperMicro compute node rack and its networking to the GSS storage. Document correct xCAT service node configuration and networking via the Mellanox core HPC Ethernet switch.
 
this job portal is powered by CATS