Menu

Principal GPU / AI System Engineer

at Oracle in Springfield, Illinois, United States

Job Description

Job Description

Oracle hardware platform development engineering is seeking a highly driven GPU Platform System Engineer at the Principal Engineer level. The GPU System Engineer will work within development engineering with a small team of talented engineers who lead the development and day-to-day engineering efforts for Oracle’s rapidly growing and successful Cloud AI platforms. You will participate in platform definition, platform development oversight as well as in house development, design reviews, system integration, performance testing and characterization. You will interact closely with third party GPU IC suppliers & partners as well as internal hardware and software development teams to help drive Oracle’s AI Cloud platform solution space. You will be a critical part of the team developing Oracle’s growing Cloud AI solutions. The team you will be joining has delivered the first and second generation of Oracle Cloud dedicated compute, AI platforms, and is working to build out the next generation of Cloud and Enterprise systems, with record breaking-performance, security, and world class quality, using the latest and greatest merchant silicon and technologies.

Department Description

The Hardware Engineering Organization, within the larger Oracle Hardware Development Organization (OHD), defines and develops the next generation of Oracle hardware platforms and solutions, upon which all of Oracle’s Cloud AI, Compute and Storage platforms, are built. These systems utilize leading edge technology to deliver record-breaking performance, simplified management, security, self-monitoring, diagnosis, as well as cost-saving efficiencies. The engineering organization defines, develops, implements, and owns the hardware designs, security lifecycle, and product, from concept through development, integration, introduction to production, and last level support in deployment.

Position Overview

Our Design Engineering organization is looking for a highly driven, capable, and dedicated Principal Engineer to join the team developing the next generation AI platform for Cloud. Our products feed solutions into the growing and successful Oracle Cloud for compute, AI and Storage. In the role you will help drive definition, development, integration, debug, characterization, and tuning of existing and new Cloud AI platforms. You will participate in evaluating AI platform architectures and assist with scaling & optimizations. You will help support solution operational health visibility development by other Oracle Engineering teams and guide these teams on the AI platforms. You will apply your solid expertise in hardware and system engineering, firmware familiarity, GPU and AI platform & tools knowledge, towards filling a system engineering role guiding engineering developers. Your role will include evaluation of merchant silicon for our next generation AI platforms and support tools for running these platforms effectively. You will grow to advise & guide Oracle hardware and software engineering teams, Oracle remote hardware support, and Oracle cloud teams on the use, debug, and in-service monitoring of our next generation AI platforms.

Career Level – IC4

Responsibilities

Responsibilities

You will be responsible for, and not limited to:

+ Review and assessment of third-party merchant silicon.

+ Evaluation of system architecture and proposed implementation path analysis.

+ You will participate in platform definition and analysis.

+ Provide platform development oversight for partners.

+ Work with in-house engineering functional experts on design and reviews.

+ Support and guide system integration, performance testing and characterization.

+ Support development program managers on technical assessments & planning.

+ You will interact closely with third party GPU IC suppliers & partners as well as internal hardware, software development, quality assurance, cloud orchestration, hardware and software security experts, and Oracle manufacturing teams.

+ You will document and specify design intent and design details where appropriate in collaboration with the appropriate engineering teams.

+ Participate in hardware platform security evaluations.

+ Guide partner internal Oracle teams on support needed to scale, monitor, and successfully deploy our products to the Cloud.

+ You will assist Oracle Cloud and Support teams in the root-cause of potential hardware or software bugs through firsthand lab replication debug, remote debug, and calls with the appropriate teams supporting our deployed products.

+ Work with Oracle manufacturing teams to ensure that Oracle hardware is secure, robustly evaluated, performing at peak capabilities and well qualified for deployment to our Cloud customers.

What This Role Looks Like

+ Work directly with hardware design and development teams on architecture, implementation, development, deployment, and troubleshooting of AI hardware platforms. Collaboration is also expected with the wider Oracle engineering and operations functional groups as well as our external partners.

+ Develop, implement, own, and run the day-to-day execution of AI platform development, both internally and in partnership with third-party design teams. Including reviews of design plans, schematics, board layout, test feature definition / guidance for subsystem test, as well as System validation plans. Oversee system integration, system test and qualification, define software diagnostics features and utilize third party as well as approved open-source AI platform qualification and test tools. Add to a roster of system characterization and performance testing capabilities and support definition of in-service system monitoring and error reporting needs.

+ Work closely and collaborate with hardware developers, System architects, System engineers, platform firmware developers, partners and AI chip / GPU suppliers, storage, networking and compute experts, on the product development and then with Manufacturing and external suppliers assisting across the new product introduction process out to production. You will also serve as the last level of engineering technical support when trained cloud and support teams require guidance and help in resolving complex deployed product issues.

Required Qualifications

+ Technical hands-on experience with market leading GPU (or alternate AI platforms) from the hardware and platform development, test, and characterization perspectives.

+ Solid knowledge of AI / GPU platform architecture and their capabilities.

+ A strong understanding and experience running firmware and system diagnostics tools using BMC firmware, UEFI/ BIOS and Linux tools. Skilled in scripting to customize tests.

+ Solid working experience with GPU supplier test code as well as open-source AI test / characterization tools.

+ Experience with the architecture, design, and implementation of modern server platforms consisting of multiple architectures and vendors, including x86 and ARM server architectures.

+ Experience with hardware development at the system, board, and FPGA level.

+ Required experience with board level tools and ability to reviews hierarchical schematics, multilayer advance board layout, cross board interconnect and end-to-end connectivity analysis.

+ Strong communications skills and ability to clearly communicate complex technical issue across engineering disciplines as well as clearly and succinctly articulate issues for executives.

+ Demonstrated experience debugging and root-causing complex issues that may have a mix of hardware and software causes.

+ Experience with early stage bring-up and power-on, platform firmware debugging, prototype GPU & CPU complex and memory complex debugging.

+ An ability to isolate a problem to the source and the required creativity & expertise to devise timely and robust solutions.

+ Experienc

To view full details and how to apply, please login or create a Job Seeker account
How to Apply Copy Link

Job Posting: JC262564584

Posted On: Jul 13, 2024

Updated On: Jul 31, 2024

Please Wait ...