Strategic Context
Secondmind's competitive advantage was its team of around 20 elite machine learning researchers developing cutting-edge AI/ML solutions. However, the company's core R&D capability - its primary source of intellectual property - was being artificially gated by inadequate internal tooling. The research infrastructure had become the strategic bottleneck.
The Problem
Strategic Throttle on Innovation
The team of 20 elite, high-cost ML researchers was strategically throttled by unscalable, on-premise computing infrastructure, limiting the entire team to just one or two major experiments per day. This created a crippling "experiment queue", where a breakthrough idea on a Tuesday might not be testable until the weekend.
Catastrophic Failures
Experiment failures were devastating. A 12-hour job failing at the 11th hour represented a full day of lost progress and a non-trivial financial loss, burning the equivalent of a researcher's salary on pure waste each month.
Low-Value Work
Researchers were wasting at proportion of their time on low-value tasks like manually scheduling machine time and performing system maintenance, pulling them away from their core, high-impact work.
My Role & The Team
As Senior Product Manager, I identified this critical bottleneck and initiated the solution. I treated our internal ML researchers as my primary customers, working with:
- 20 ML researchers to understand their workflows and pain points
- The Engineering team to build the cloud-based platform
- Leadership to secure approval and budget for cloud infrastructure
- DevOps to ensure ongoing cost controls and monitoring
The Process & Key Decisions
1. Deep Discovery
I conducted in-depth interviews with the research team to diagnose the core pain points beyond the technical limitations. I focused on their workflows and the human cost of the frustration - not just the technical metrics, but how it felt to wait days for an experiment slot or lose a day's work to a hardware failure.
2. Define Guiding Policy
I defined a guiding policy for the solution centered on "unlimited, reliable access." This prioritised parallelisation (running multiple experiments simultaneously), automated recovery from failures, and robust cost controls to de-risk the move to the cloud.
3. Shift Success Metrics
Through discovery, I diagnosed that the primary constraint wasn't compute time but the high failure rate on long experiments. I ensured the system was designed for resilience with automated restarts, shifting the key metric from "hours per experiment" to "successful experiments per week."
4. Address Leadership Concerns
I ensured the platform included proactive cost-control mechanisms to prevent unexpected expenditure - a critical concern for leadership and a key dependency for getting project approval. We built dashboards for cost monitoring and automatic shutdowns for idle resources.
5. Build for Self-Service
I worked with engineering to design a system allowing researchers to launch multiple, containerized experiments simultaneously, on demand, with automated restart capabilities for failed jobs. The goal was complete self-service with zero manual intervention required.
The Solution
We delivered a comprehensive cloud-based ML experimentation platform:
- Parallel experimentation β launch multiple experiments simultaneously on-demand
- Automated failure recovery β jobs automatically restart from checkpoints
- Containerized workflows β consistent, reproducible environments
- Cost monitoring & controls β dashboards and automatic shutdowns
- Self-service interface β researchers control their own experiments
- Elastic scaling β resources scale up and down based on demand
The Impact
Increased potential research throughput by over 10x, allowing parallel experiments instead of queuing
Eliminated tedious manual scheduling work and maintenance tasks, reclaiming approximately 20% of researchers' time for high-value innovation
Eliminated the 24-hour experiment queue that had been strangling research velocity
Automated failure recovery saved hundreds of hours of lost work from failed experiments
Strategic Transformation
The platform transformed the company's core R&D process from a high-risk, 24-hour bottleneck into a strategic asset. It acted as a force multiplier, removing technical constraints and allowing the team to focus exclusively on innovationβunlocking the full potential of the company's most valuable technical talent.
Key Learnings
Treat Internal Users as First-Class Customers
Building internal tooling requires the same customer discovery and empathy as external products. Understanding the human frustration and workflow impact was as important as the technical requirements. The researchers weren't just users - they were my customers.
Focus on the Right Metrics
The constraint wasn't compute time... it was reliability. By shifting focus to "successful experiments per week" instead of "hours per experiment" we prioritized automated recovery and built a much more valuable solution.
Internal Platforms are Force Multipliers
The highest-leverage investment isn't always in customer-facing features. Building platforms that unlock your team's productivity can multiply the impact of your most valuable talent - in this case, elite ML researchers.
Want to learn more?
Let's discuss how I build internal platforms that multiply team effectiveness.