Senior MLOps Engineer
THE AI ACCELERATOR
Most diseases are still poorly understood at a biological level. Despite decades of research, the causal mechanisms driving many conditions remain unclear, limiting our ability to identify the right targets, design the right interventions and bring the right medicines to patients.
The AI Accelerator exists to change that. Based in London and sitting within Computational Innovation (@computationalinnovation), a global organisation spanning computational biology, human genetics, data excellence and AI, the Accelerator’s mission is to build production-quality AI capabilities that deepen our understanding of disease biology and increase probability of success.
We do this by applying neural-based methods across the biomedical data landscape to integrate heterogeneous, multimodal data sources, infer biological relationships and embed causal thinking into what we build. The goal is not just to predict but to explain and understand why disease occurs.
It could be electronic health records and medical imaging to support patient segmentation. It could be ‘omics data to identify novel therapeutic targets. It could be predicting transcriptional change for a given disease-causing variant. It could be simulating the effect of modulating a target of interest.
A core component of the AI Accelerator is AI Enablement, that provides the support framework to make our ambitions a technical reality. It could be provisioning integrated, multimodal biomedical data for model training and inference. It could be managing the lifecycle of models provided by AI Systems. It could be working with IT to ensure the right infrastructure and tooling are in place. AI Enablement ensures that the model builders can focus on the technology and that Computational Innovation’s downstream users can leverage accelerator capabilities for real portfolio impact.
THE POSITION
We are looking for a Senior MLOps Engineer to join AI Enablement and play a central role in ensuring that the AI Accelerator’s models move from development to production reliably and keep performing. This is a hands-on operational role with real stakes. The models you deploy and manage will be used to make decisions about which indications to pursue, in which patient population and against which target. When your systems work well, science moves faster and portfolio decision-making gets better.
You will take full operational ownership of shipped models, managing deployment, monitoring, retraining and lifecycle end-to-end. You will make sure that the IT-provisioned experiment tracking and model registry systems are used effectively, that training and fine-tuning runs are consistently and correctly logged and that model artefacts are registered with full provenance from data through to prediction. You will work closely with ML engineers at model handover, reviewing documentation and signing off before accepting operational ownership.
This role is for someone who takes pride in operational excellence and who understands that the AI Accelerator models can only realise their impact on the portfolio if they are deployed and performing reliably in production.
Key Responsibilities
- Ensure experiment tracking and model registry systems are used effectively across the AI Accelerator with consistent and correct logging of training and fine-tuning runs and model artefacts registered with full provenance
- Configure, run and troubleshoot distributed training and fine-tuning jobs, ensuring efficient use of available compute and resolving job-level failures
- Participate in a structured model handovers with ML engineers, reviewing and signing off documentation before accepting full model operational ownership of shipped models
- Deploy, monitor and manage model serving endpoints, making technical decisions about serving configurations to meet performance requirements of downstream users
- Take full operational ownership of models in production, managing monitoring, retraining and lifecycle end-to-end.
- Uphold MLOps standards and practices across the AI Accelerator, contributing to their evolution based on operational experience and keeping teams current with relevant advances in MLOps tooling
Required Qualifications
- MSc in Machine Learning, Computer Science, Software Engineering or a related technical field; PhD preferred or the equivalent industry experience.
- Solid hands-on experience operating ML training and serving workflows in production environments
- Experience with distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP or Ray Train
- Experience operating experiment tracking systems and model registry systems such as MLflow, Weights and Biases or equivalent
- Familiarity with CI/CD tooling for ML workflows e.g. cloud-native pipeline services, GitHub Actions or equivalent
- Solid understanding of cloud infrastructure for ML (compute, storage, networking) that is sufficient to specify requirements clearly and diagnose infrastructure-related issues
- Awareness of large model training characteristics including memory footprint, compute scaling and parallelisation strategies
- Familiarity with infrastructure-as-code tooling such as Terraform or cloud-native equivalents
- Experience working closely with research and ML engineering teams as a platform operator
- Experience operating ML infrastructure for large foundation model training
- Familiarity with biomedical AI workloads, such as training foundation models on large- scale multimodal data
Second round interviews will take place 28th July – 6th August.
This is a hybrid role with approximately 3 days a week in the office.
WHY THIS IS A GREAT PLACE TO WORK
Boehringer Ingelheim has been recognised as a Top Employer in the UK, demonstrating our commitment to building an exceptional workplace through strong people practices and supportive HR policies.
To learn more about why BI is a great place to work, visit:
https://www.boehringer-ingelheim.co.uk/careers/uk-careers/why-great-place-work