The ML Shaping Guide
Shaping Machine Learning Projects That Actually Work
Have you ever launched an ML project with high hopes, only to watch it spiral into an endless research rabbit hole? Or perhaps you’ve seen teams deliver technically impressive models that somehow miss the mark on actual business needs? You’re not alone.
As managers of ML teams, we face a unique challenge: How do we structure machine learning work to balance scientific exploration with engineering pragmatism and business results?
Enter “Shape Up for ML” – an adaptation of Basecamp’s popular Shape Up methodology that’s been tailored to the specific needs of machine learning projects. Let me walk you through how this approach can transform the way your team works.
What is Shape Up, Anyway?
If you haven’t encountered it before, Shape Up is a project management methodology created by Ryan Singer at 37Signals. At its core, it’s about properly “shaping” work before handing it off to a team. As Singer puts it:
“Shaping isn’t writing. It’s not filling out a template or creating a document. It’s getting to that ‘a-ha’ moment together where the parts crystalize and we have something that will work.”
The traditional Shape Up approach uses concepts like:
- Appetite: How much time are we willing to spend? (Usually 6 weeks)
- Boundaries: Explicit decisions about what’s in and out of scope
- Breadboarding: A high-level sketch of system components
- Fat marker sketches: Simple visual representations of the solution
- Rabbit holes: Known areas that could consume unlimited time if not managed
It works beautifully for web development, but machine learning projects introduce a whole new level of complexity and uncertainty.
Why ML Projects Need a Different Approach
Machine learning isn’t just web development with fancy math. Here’s why traditional project management methods often fall flat with ML initiatives:
1. Data is the foundation, not an afterthought
In web development, you can often start building with placeholder data and replace it later. In ML, your data is your project. Without clearly defined data requirements, preparation plans, and an understanding of limitations, your project is built on sand.
Unlike web projects where features can be clearly specified upfront, ML projects depend on what patterns exist in your data – and you might not know what’s possible until you explore.
2. Uncertainty is the only certainty
When a web developer builds a button, there’s little doubt that the button will appear. When an ML engineer trains a model, there’s no guarantee it will reach the desired accuracy – or that the accuracy will translate to real-world performance.
This fundamental uncertainty means ML projects need explicit:
- Fallback approaches when primary methods don’t pan out
- Clear exit criteria to avoid endless optimization
- Minimum viable performance thresholds
- Experiment plans rather than rigid implementation roadmaps
3. Success metrics look different
Web projects often focus on user experience metrics like completion rates or engagement. ML projects have two parallel tracks of metrics:
- Technical metrics: Loss functions, precision/recall, F1 scores, etc.
- User-facing metrics: The actual impact on the product and business
Shaping ML projects requires defining both sets of metrics and establishing the relationship between them. What technical threshold translates to a meaningful user experience improvement?
4. The research-engineering balance
ML projects exist on a spectrum from pure research to straightforward engineering implementation. A well-shaped ML project explicitly acknowledges where on this spectrum it falls and establishes guardrails to prevent research projects from becoming open-ended or engineering projects from ignoring necessary exploration.
ML Project Shaping in Action
Let’s look at three examples that contrast poorly-shaped versus well-shaped ML projects to illustrate these principles.
Example 1: Predictive Maintenance System
🔴 Poorly Shaped Version
Overall Goals:
- Build a system to predict equipment failures
- Should be highly accurate and reduce maintenance costs
Data:
- We’ll use sensor data from our machines
- Maintenance logs will be helpful for training
Modeling Approach:
- Use a deep learning model, possibly LSTM or Transformer
- Could also try traditional machine learning algorithms
Metrics:
- Model should accurately predict failures
- Aim for high precision and recall
Rabbit Holes & No-Go’s:
- Try to keep the project scope manageable
This vague, poorly-bounded project would likely lead to endless exploration, unclear priorities, and uncertain outcomes. Now let’s look at a well-shaped version:
🟢 Well-Shaped Version
Overall Goals:
- Create a system that predicts potential equipment failures 48 hours before they occur for our top 5 most critical machine types
- Target: 80% of critical failures predicted with fewer than 15% false alarms
- Must integrate with existing maintenance workflow systems
- Out of scope: Prescriptive maintenance recommendations, non-critical equipment
Data:
- 3 years of historical sensor data from 200 machines (already available in data warehouse)
- Maintenance logs for all documented failures and repairs (needs cleaning)
- Environmental data from factory monitoring systems
- Known limitation: Incomplete labeling of subtle degradation issues vs. outright failures
Modeling Approach:
- Primary: Time-series anomaly detection with transformer architecture for multivariate signals
- Backup approach: Random Forest classifier on engineered features if deep learning approach doesn’t meet performance targets
- Compute needs: Training on cloud instances, inference on edge devices near equipment
Metrics:
- Technical: F1 score > 0.75, MTTD (mean time to detection) > 48 hours before failure
- User-facing: Maintenance team acknowledges alerts within 30 minutes, < 10% alert fatigue reported
- Minimum viable: Will ship with 3 machine types if performance targets met
Rabbit Holes & No-Go’s:
- NO attempt to predict exact time-to-failure in this bet (binary classification only)
- NO unsupervised approaches in this version
- Time sink warning: Feature engineering could consume unlimited time - strict 2-week limit
Technical Breadboarding:
- Data pipeline pulls sensor data every 15 minutes
- Pre-processing module normalizes and segments data
- Model inference runs hourly on rolling window
- Alerts delivered to maintenance system via existing API
- Feedback loop captures false positives/negatives
Experiment Plan:
- Week 1: Data preparation and baseline models
- Week 2-3: Model development and selection
- Week 4: Integration with alerting system
- Week 5-6: Testing and performance tuning
- Exit criteria: If Week 3 models don’t reach 60% F1 score, pivot to simpler models
Notice how this example explicitly addresses data requirements, bounds the exploration, establishes clear metrics for both technical performance and business impact, and includes a concrete plan for experimentation.
Example 2: Automated Document Processing
🔴 Poorly Shaped Version
Overall Goals:
- Create a system to extract information from business documents
- Should work for various document types
Data:
- We have lots of documents in our system
- Will need to label some for training
Modeling Approach:
- Use OCR and NLP techniques
- Maybe try some recent document AI approaches
Metrics:
- High accuracy is important
- System should be reasonably fast
Rabbit Holes & No-Go’s:
- Don’t get bogged down in too many document types
This approach leaves too many critical questions unanswered and would likely result in scope creep, unclear priorities, and potential project failure. Now for a well-shaped alternative:
🟢 Well-Shaped Version
Overall Goals:
- Build system to automatically extract key fields from invoices
- Target: 15 critical fields (invoice #, date, amount, line items, vendor details)
- Must achieve 95% extraction accuracy on standard invoices, 80% on non-standard
- Success: Reduce manual data entry time by 70%
Data:
- 10,000 processed invoices from our document management system (available)
- 2,000 manually labeled examples with field positions (in progress, will be complete before bet starts)
- Known limitation: Training data skews toward our top 20 vendors
Modeling Approach:
- Primary: Multi-stage pipeline with document layout analysis followed by field-specific extraction models
- LayoutLM pre-trained model for document understanding
- Custom NER models for field extraction
- Rule-based validation layer for consistency checks
Metrics:
- Technical: Field-level accuracy > 95% for standard formats, > 80% for non-standard
- Processing time < 3 seconds per document
- User-facing: Reduction in processing time from 4 minutes to < 1 minute per invoice
- Confidence scoring to route uncertain documents for human review
Rabbit Holes & No-Go’s:
- NO handwritten invoice support in this version
- NO attempting to extract custom/rare fields beyond the defined 15
- Exit criteria: If accuracy on non-standard invoices < 70% by week 4, reduce scope to standard invoices only
Technical Breadboarding:
- Document ingestion API accepts PDF/TIFF/JPG
- Preprocessing module for normalization and OCR
- Multi-model pipeline for extraction
- Validation layer checks for consistency
- Human-in-the-loop interface for low-confidence results
Experiment Plan:
- Week 1: Data preparation and baseline model training
- Week 2: Layout analysis model development
- Week 3: Field extraction model development
- Week 4-5: Integration and system building
- Week 6: Testing and threshold tuning
- Learning goal: Determine if one generic model or vendor-specific models perform better
This example shows how a document processing system requires clear decisions about which document types and fields are in scope, establishes concrete accuracy thresholds, and includes a fallback plan if the more ambitious goals can’t be met.
Example 3: Content Recommendation Engine
🔴 Poorly Shaped Version
Overall Goals:
- Build a recommendation system to improve user engagement
- Better than our current basic recommendations
Data:
- User interaction history
- Content metadata
Modeling Approach:
- Use collaborative filtering or neural networks
- Incorporate content features for better recommendations
Metrics:
- Increase click-through rates
- Improve user satisfaction
Rabbit Holes & No-Go’s:
- Focus on a manageable implementation timeline
This vague approach lacks specificity in almost every dimension and would likely lead to endless iterations without clear criteria for completion. Here’s how a well-shaped version looks:
🟢 Well-Shaped Version
Overall Goals:
- Build personalized recommendation system for our content library
- Target: 30% increase in average content consumption per user
- 6-week appetite focuses on algorithm development and testing, not full production deployment
- Out of scope: Real-time updates, cold start handling (will use defaults for new users)
Data:
- User interaction history (views, likes, time spent) - 18 months of data available
- Content metadata including tags, categories, and engagement metrics
- User profile information where available (optional)
- Known limitation: Sparse data for niche content categories
Modeling Approach:
- Primary: Two-tower neural network with separate encoders for users and content
- Collaborative filtering baseline as fallback approach
- Hybrid approach combining content-based and collaborative signals
- Test matrix factorization vs. deep learning approaches
Metrics:
- Technical: nDCG@10 > 0.65, diversity score > 0.4 (measure of recommendation variety)
- A/B test metrics: CTR increase > 15%, average session duration increase > 20%
- Minimum viable: Shipping with 15% CTR improvement if we can achieve lower latency
Rabbit Holes & No-Go’s:
- NO building full feature store in this bet (use existing data pipeline)
- NO attempting to incorporate real-time behavior in this version
- Time sink warning: Feature engineering could consume unlimited time - strict 1-week timebox
- Exit criteria: If two-tower approach doesn’t outperform baseline by 10% at week 3, pivot to enhanced collaborative filtering
Technical Breadboarding:
- Batch processing pipeline for user embeddings
- Content embedding generation from metadata and interaction patterns
- Similarity computation service
- Caching layer for recommendation sets
- A/B testing framework integration
Experiment Plan:
- Week 1: Data preparation and baseline model
- Week 2: User embedding model development
- Week 3: Content embedding and matching approach
- Week 4: Integration and initial A/B test setup
- Week 5-6: A/B testing and refinement
- Fallback: If neural approach doesn’t converge, use matrix factorization with content features
This example demonstrates how recommendation systems, often seen as requiring extensive research, can be bounded with clear success criteria, experimental timelines, and decisions about what aspects (like cold start problems) are explicitly out of scope for the initial bet.
Key Takeaways for ML Team Managers
The contrast between the poorly-shaped and well-shaped examples is stark. In the poorly-shaped versions, we see vague goals, undefined data requirements, ambiguous modeling approaches, and fuzzy success criteria. These projects are set up for scope creep, missed deadlines, and disappointing results.
The well-shaped versions, on the other hand, provide clear boundaries, explicit metrics, defined data requirements, and concrete fallback plans that acknowledge the inherent uncertainty in ML work.
So what can we learn from these examples? Here are the key principles of effective ML project shaping:
-
Make data requirements explicit. Define exactly what data you have, what you need, and what steps are required to prepare it. Data is your foundation.
-
Establish minimum viable performance. What’s the minimum technical performance that delivers business value? This is your ship/no-ship threshold.
-
Identify explicit rabbit holes. Call out the areas where your team could spend unlimited time optimizing and set clear time boundaries.
-
Plan for uncertainty. Always have a fallback approach when your primary method doesn’t achieve the desired results.
-
Define exit criteria. At what point do you pivot from your primary approach? Make this decision explicit before the project starts.
-
Set dual success metrics. Define both technical metrics (for the model) and user-facing metrics (for the business impact).
-
Shape the system, not just the model. ML projects aren’t just about algorithms - they need data pipelines, monitoring, integration points, and more.
The magic of well-shaped ML work is that it provides structure without stifling exploration. It acknowledges the inherent uncertainty in machine learning while ensuring that this uncertainty doesn’t lead to open-ended research projects without business impact.
As Ryan Singer notes about Shape Up: “The shaped work itself looks like a mess. But the people who were there know exactly what it means.” Your goal is to create that shared understanding among the team, with enough clarity that the work can proceed with confidence within your appetite.
So the next time you’re planning an ML project, remember that it’s not just about the model - it’s about creating a framework that balances exploration and delivery, research and engineering, possibility and pragmatism.
That’s the art of shaping ML projects that actually work.
P.S. This post represents my current thinking on ML project shaping, but I want to be transparent that we’re still early in our journey with this methodology. As our team experiments with these approaches, I expect our processes and frameworks to evolve significantly. I’m sharing these thoughts not as definitive answers, but as a way to refine my own understanding and hopefully spark conversations with others facing similar challenges. The beauty of adapting methodologies like Shape Up to ML work is that it requires constant learning and adjustment. I’ll be updating these ideas as we learn more, encounter new obstacles, and discover what truly works for our team. If you have insights from your own experiences with ML project management, I’d be particularly grateful to hear them as we continue to develop this approach.
Notes mentioning this note
There are no notes linking to this note.