The ML Shaping Guide

April 18, 2025

Shaping Machine Learning Projects That Actually Work

Have you ever launched an ML project with high hopes, only to watch it spiral into an endless research rabbit hole? Or perhaps you’ve seen teams deliver technically impressive models that somehow miss the mark on actual business needs? You’re not alone.

As managers of ML teams, we face a unique challenge: How do we structure machine learning work to balance scientific exploration with engineering pragmatism and business results?

Enter “Shape Up for ML” – an adaptation of Basecamp’s popular Shape Up methodology that’s been tailored to the specific needs of machine learning projects. Let me walk you through how this approach can transform the way your team works.

What is Shape Up, Anyway?

If you haven’t encountered it before, Shape Up is a project management methodology created by Ryan Singer at 37Signals. At its core, it’s about properly “shaping” work before handing it off to a team. As Singer puts it:

“Shaping isn’t writing. It’s not filling out a template or creating a document. It’s getting to that ‘a-ha’ moment together where the parts crystalize and we have something that will work.”

Ryan Singer- Shaping Isn’t Writing

The traditional Shape Up approach uses concepts like:

Appetite: How much time are we willing to spend? (Usually 6 weeks)
Boundaries: Explicit decisions about what’s in and out of scope
Breadboarding: A high-level sketch of system components
Fat marker sketches: Simple visual representations of the solution
Rabbit holes: Known areas that could consume unlimited time if not managed

It works beautifully for web development, but machine learning projects introduce a whole new level of complexity and uncertainty.

Why ML Projects Need a Different Approach

Machine learning isn’t just web development with fancy math. Here’s why traditional project management methods often fall flat with ML initiatives:

1. Data is the foundation, not an afterthought

In web development, you can often start building with placeholder data and replace it later. In ML, your data is your project. Without clearly defined data requirements, preparation plans, and an understanding of limitations, your project is built on sand.

Unlike web projects where features can be clearly specified upfront, ML projects depend on what patterns exist in your data – and you might not know what’s possible until you explore.

2. Uncertainty is the only certainty

When a web developer builds a button, there’s little doubt that the button will appear. When an ML engineer trains a model, there’s no guarantee it will reach the desired accuracy – or that the accuracy will translate to real-world performance.

This fundamental uncertainty means ML projects need explicit:

Fallback approaches when primary methods don’t pan out
Clear exit criteria to avoid endless optimization
Minimum viable performance thresholds
Experiment plans rather than rigid implementation roadmaps

3. Success metrics look different

Web projects often focus on user experience metrics like completion rates or engagement. ML projects have two parallel tracks of metrics:

Technical metrics: Loss functions, precision/recall, F1 scores, etc.
User-facing metrics: The actual impact on the product and business

Shaping ML projects requires defining both sets of metrics and establishing the relationship between them. What technical threshold translates to a meaningful user experience improvement?

4. The research-engineering balance

ML projects exist on a spectrum from pure research to straightforward engineering implementation. A well-shaped ML project explicitly acknowledges where on this spectrum it falls and establishes guardrails to prevent research projects from becoming open-ended or engineering projects from ignoring necessary exploration.

ML Project Shaping in Action

Let’s look at three examples that contrast poorly-shaped versus well-shaped ML projects to illustrate these principles.

Example 1: Predictive Maintenance System

🔴 Poorly Shaped Version

Overall Goals:

Build a system to predict equipment failures
Should be highly accurate and reduce maintenance costs

Data:

We’ll use sensor data from our machines
Maintenance logs will be helpful for training

Modeling Approach:

Use a deep learning model, possibly LSTM or Transformer
Could also try traditional machine learning algorithms

Metrics:

Model should accurately predict failures
Aim for high precision and recall

Rabbit Holes & No-Go’s:

Try to keep the project scope manageable

This vague, poorly-bounded project would likely lead to endless exploration, unclear priorities, and uncertain outcomes. Now let’s look at a well-shaped version:

🟢 Well-Shaped Version

Overall Goals:

Create a system that predicts potential equipment failures 48 hours before they occur for our top 5 most critical machine types
Target: 80% of critical failures predicted with fewer than 15% false alarms
Must integrate with existing maintenance workflow systems
Out of scope: Prescriptive maintenance recommendations, non-critical equipment

Data:

3 years of historical sensor data from 200 machines (already available in data warehouse)
Maintenance logs for all documented failures and repairs (needs cleaning)
Environmental data from factory monitoring systems
Known limitation: Incomplete labeling of subtle degradation issues vs. outright failures

Modeling Approach:

Primary: Time-series anomaly detection with transformer architecture for multivariate signals
Backup approach: Random Forest classifier on engineered features if deep learning approach doesn’t meet performance targets
Compute needs: Training on cloud instances, inference on edge devices near equipment

Metrics:

Technical: F1 score > 0.75, MTTD (mean time to detection) > 48 hours before failure
User-facing: Maintenance team acknowledges alerts within 30 minutes, < 10% alert fatigue reported
Minimum viable: Will ship with 3 machine types if performance targets met

Rabbit Holes & No-Go’s:

NO attempt to predict exact time-to-failure in this bet (binary classification only)
NO unsupervised approaches in this version
Time sink warning: Feature engineering could consume unlimited time - strict 2-week limit

Technical Breadboarding:

Data pipeline pulls sensor data every 15 minutes
Pre-processing module normalizes and segments data
Model inference runs hourly on rolling window
Alerts delivered to maintenance system via existing API
Feedback loop captures false positives/negatives

Experiment Plan:

Week 1: Data preparation and baseline models
Week 2-3: Model development and selection
Week 4: Integration with alerting system
Week 5-6: Testing and performance tuning
Exit criteria: If Week 3 models don’t reach 60% F1 score, pivot to simpler models

Notice how this example explicitly addresses data requirements, bounds the exploration, establishes clear metrics for both technical performance and business impact, and includes a concrete plan for experimentation.

Example 2: Automated Document Processing

🔴 Poorly Shaped Version

Overall Goals:

Create a system to extract information from business documents
Should work for various document types

Data:

We have lots of documents in our system
Will need to label some for training

Modeling Approach:

Use OCR and NLP techniques
Maybe try some recent document AI approaches

Metrics:

High accuracy is important
System should be reasonably fast

Rabbit Holes & No-Go’s:

Don’t get bogged down in too many document types

This approach leaves too many critical questions unanswered and would likely result in scope creep, unclear priorities, and potential project failure. Now for a well-shaped alternative:

🟢 Well-Shaped Version

Overall Goals:

Build system to automatically extract key fields from invoices
Target: 15 critical fields (invoice #, date, amount, line items, vendor details)
Must achieve 95% extraction accuracy on standard invoices, 80% on non-standard
Success: Reduce manual data entry time by 70%

Data:

10,000 processed invoices from our document management system (available)
2,000 manually labeled examples with field positions (in progress, will be complete before bet starts)
Known limitation: Training data skews toward our top 20 vendors

Modeling Approach:

Primary: Multi-stage pipeline with document layout analysis followed by field-specific extraction models
LayoutLM pre-trained model for document understanding
Custom NER models for field extraction
Rule-based validation layer for consistency checks

Metrics:

Technical: Field-level accuracy > 95% for standard formats, > 80% for non-standard
Processing time < 3 seconds per document
User-facing: Reduction in processing time from 4 minutes to < 1 minute per invoice
Confidence scoring to route uncertain documents for human review

Rabbit Holes & No-Go’s:

NO handwritten invoice support in this version
NO attempting to extract custom/rare fields beyond the defined 15
Exit criteria: If accuracy on non-standard invoices < 70% by week 4, reduce scope to standard invoices only

Technical Breadboarding:

Document ingestion API accepts PDF/TIFF/JPG
Preprocessing module for normalization and OCR
Multi-model pipeline for extraction
Validation layer checks for consistency
Human-in-the-loop interface for low-confidence results

Experiment Plan:

Week 1: Data preparation and baseline model training
Week 2: Layout analysis model development
Week 3: Field extraction model development
Week 4-5: Integration and system building
Week 6: Testing and threshold tuning
Learning goal: Determine if one generic model or vendor-specific models perform better

This example shows how a document processing system requires clear decisions about which document types and fields are in scope, establishes concrete accuracy thresholds, and includes a fallback plan if the more ambitious goals can’t be met.

Example 3: Content Recommendation Engine

🔴 Poorly Shaped Version

Overall Goals:

Build a recommendation system to improve user engagement
Better than our current basic recommendations

Data:

User interaction history
Content metadata

Modeling Approach:

Use collaborative filtering or neural networks
Incorporate content features for better recommendations

Metrics:

Increase click-through rates
Improve user satisfaction

Rabbit Holes & No-Go’s:

Focus on a manageable implementation timeline

This vague approach lacks specificity in almost every dimension and would likely lead to endless iterations without clear criteria for completion. Here’s how a well-shaped version looks:

🟢 Well-Shaped Version

Overall Goals:

Build personalized recommendation system for our content library
Target: 30% increase in average content consumption per user
6-week appetite focuses on algorithm development and testing, not full production deployment
Out of scope: Real-time updates, cold start handling (will use defaults for new users)

Data:

User interaction history (views, likes, time spent) - 18 months of data available
Content metadata including tags, categories, and engagement metrics
User profile information where available (optional)
Known limitation: Sparse data for niche content categories

Modeling Approach:

Primary: Two-tower neural network with separate encoders for users and content
Collaborative filtering baseline as fallback approach
Hybrid approach combining content-based and collaborative signals
Test matrix factorization vs. deep learning approaches

Metrics:

Technical: nDCG@10 > 0.65, diversity score > 0.4 (measure of recommendation variety)
A/B test metrics: CTR increase > 15%, average session duration increase > 20%
Minimum viable: Shipping with 15% CTR improvement if we can achieve lower latency

Rabbit Holes & No-Go’s:

NO building full feature store in this bet (use existing data pipeline)
NO attempting to incorporate real-time behavior in this version
Time sink warning: Feature engineering could consume unlimited time - strict 1-week timebox
Exit criteria: If two-tower approach doesn’t outperform baseline by 10% at week 3, pivot to enhanced collaborative filtering

Technical Breadboarding:

Batch processing pipeline for user embeddings
Content embedding generation from metadata and interaction patterns
Similarity computation service
Caching layer for recommendation sets
A/B testing framework integration

Experiment Plan:

Week 1: Data preparation and baseline model
Week 2: User embedding model development
Week 3: Content embedding and matching approach
Week 4: Integration and initial A/B test setup
Week 5-6: A/B testing and refinement
Fallback: If neural approach doesn’t converge, use matrix factorization with content features

This example demonstrates how recommendation systems, often seen as requiring extensive research, can be bounded with clear success criteria, experimental timelines, and decisions about what aspects (like cold start problems) are explicitly out of scope for the initial bet.

Key Takeaways for ML Team Managers

The contrast between the poorly-shaped and well-shaped examples is stark. In the poorly-shaped versions, we see vague goals, undefined data requirements, ambiguous modeling approaches, and fuzzy success criteria. These projects are set up for scope creep, missed deadlines, and disappointing results.

The well-shaped versions, on the other hand, provide clear boundaries, explicit metrics, defined data requirements, and concrete fallback plans that acknowledge the inherent uncertainty in ML work.

So what can we learn from these examples? Here are the key principles of effective ML project shaping:

Make data requirements explicit. Define exactly what data you have, what you need, and what steps are required to prepare it. Data is your foundation.
Establish minimum viable performance. What’s the minimum technical performance that delivers business value? This is your ship/no-ship threshold.
Identify explicit rabbit holes. Call out the areas where your team could spend unlimited time optimizing and set clear time boundaries.
Plan for uncertainty. Always have a fallback approach when your primary method doesn’t achieve the desired results.
Define exit criteria. At what point do you pivot from your primary approach? Make this decision explicit before the project starts.
Set dual success metrics. Define both technical metrics (for the model) and user-facing metrics (for the business impact).
Shape the system, not just the model. ML projects aren’t just about algorithms - they need data pipelines, monitoring, integration points, and more.

The magic of well-shaped ML work is that it provides structure without stifling exploration. It acknowledges the inherent uncertainty in machine learning while ensuring that this uncertainty doesn’t lead to open-ended research projects without business impact.

As Ryan Singer notes about Shape Up: “The shaped work itself looks like a mess. But the people who were there know exactly what it means.” Your goal is to create that shared understanding among the team, with enough clarity that the work can proceed with confidence within your appetite.

So the next time you’re planning an ML project, remember that it’s not just about the model - it’s about creating a framework that balances exploration and delivery, research and engineering, possibility and pragmatism.

That’s the art of shaping ML projects that actually work.

P.S. This post represents my current thinking on ML project shaping, but I want to be transparent that we’re still early in our journey with this methodology. As our team experiments with these approaches, I expect our processes and frameworks to evolve significantly. I’m sharing these thoughts not as definitive answers, but as a way to refine my own understanding and hopefully spark conversations with others facing similar challenges. The beauty of adapting methodologies like Shape Up to ML work is that it requires constant learning and adjustment. I’ll be updating these ideas as we learn more, encounter new obstacles, and discover what truly works for our team. If you have insights from your own experiences with ML project management, I’d be particularly grateful to hear them as we continue to develop this approach.

Notes mentioning this note

There are no notes linking to this note.