Secondmind ML Platform Case Study

Strategic Context

Secondmind's competitive advantage was its team of around 20 elite machine learning researchers developing cutting-edge AI/ML solutions. However, the company's core R&D capability - its primary source of intellectual property - was being artificially gated by inadequate internal tooling. The research infrastructure had become the strategic bottleneck.

The Problem

Strategic Throttle on Innovation

The team of 20 elite, high-cost ML researchers was strategically throttled by unscalable, on-premise computing infrastructure, limiting the entire team to just one or two major experiments per day. This created a crippling "experiment queue", where a breakthrough idea on a Tuesday might not be testable until the weekend.

Catastrophic Failures

Experiment failures were devastating. A 12-hour job failing at the 11th hour represented a full day of lost progress and a non-trivial financial loss, burning the equivalent of a researcher's salary on pure waste each month.

Low-Value Work

Researchers were wasting at proportion of their time on low-value tasks like manually scheduling machine time and performing system maintenance, pulling them away from their core, high-impact work.

My Role & The Team

As Senior Product Manager, I identified this critical bottleneck and initiated the solution. I treated our internal ML researchers as my primary customers, working with:

20 ML researchers to understand their workflows and pain points
The Engineering team to build the cloud-based platform
Leadership to secure approval and budget for cloud infrastructure
DevOps to ensure ongoing cost controls and monitoring

The Process & Key Decisions

1. Deep Discovery

I conducted in-depth interviews with the research team to diagnose the core pain points beyond the technical limitations. I focused on their workflows and the human cost of the frustration - not just the technical metrics, but how it felt to wait days for an experiment slot or lose a day's work to a hardware failure.

2. Define Guiding Policy

I defined a guiding policy for the solution centered on "unlimited, reliable access." This prioritised parallelisation (running multiple experiments simultaneously), automated recovery from failures, and robust cost controls to de-risk the move to the cloud.

3. Shift Success Metrics

Through discovery, I diagnosed that the primary constraint wasn't compute time but the high failure rate on long experiments. I ensured the system was designed for resilience with automated restarts, shifting the key metric from "hours per experiment" to "successful experiments per week."

4. Address Leadership Concerns

I ensured the platform included proactive cost-control mechanisms to prevent unexpected expenditure - a critical concern for leadership and a key dependency for getting project approval. We built dashboards for cost monitoring and automatic shutdowns for idle resources.

5. Build for Self-Service

I worked with engineering to design a system allowing researchers to launch multiple, containerized experiments simultaneously, on demand, with automated restart capabilities for failed jobs. The goal was complete self-service with zero manual intervention required.

The Solution

We delivered a comprehensive cloud-based ML experimentation platform:

Parallel experimentation – launch multiple experiments simultaneously on-demand
Automated failure recovery – jobs automatically restart from checkpoints
Containerized workflows – consistent, reproducible environments
Cost monitoring & controls – dashboards and automatic shutdowns
Self-service interface – researchers control their own experiments
Elastic scaling – resources scale up and down based on demand

The Impact

10x throughput

Increased potential research throughput by over 10x, allowing parallel experiments instead of queuing

20% time saved

Eliminated tedious manual scheduling work and maintenance tasks, reclaiming approximately 20% of researchers' time for high-value innovation

Zero queue

Eliminated the 24-hour experiment queue that had been strangling research velocity

Auto-recovery

Automated failure recovery saved hundreds of hours of lost work from failed experiments

Strategic Transformation

The platform transformed the company's core R&D process from a high-risk, 24-hour bottleneck into a strategic asset. It acted as a force multiplier, removing technical constraints and allowing the team to focus exclusively on innovation—unlocking the full potential of the company's most valuable technical talent.

Key Learnings

Treat Internal Users as First-Class Customers

Building internal tooling requires the same customer discovery and empathy as external products. Understanding the human frustration and workflow impact was as important as the technical requirements. The researchers weren't just users - they were my customers.

Focus on the Right Metrics

The constraint wasn't compute time... it was reliability. By shifting focus to "successful experiments per week" instead of "hours per experiment" we prioritized automated recovery and built a much more valuable solution.

Internal Platforms are Force Multipliers

The highest-leverage investment isn't always in customer-facing features. Building platforms that unlock your team's productivity can multiply the impact of your most valuable talent - in this case, elite ML researchers.

Secondmind's ML Platform