Logo

Critical Review: Experimental Design Issues

Critical Review: Experimental Design Issues

Purpose

This document identifies critical issues found in the experimental designs that could lead to erroneous conclusions. As emphasized: we must distinguish between genuine model limitations and experimental design flaws.


Experiment 5: Conservation Laws Discovery

Issues Identified

  1. Output Range Problem

    • Issue: Output velocities range from -121 to +128 with std ≈ 35
    • Impact: Large output variance makes learning difficult
    • Question: Is the low R² (0.28) due to model limitations or output scale?
    • Action: Should we normalize outputs or use a different loss function?
  2. Model Capacity

    • Issue: Model uses StandardScaler on inputs but outputs are not scaled
    • Impact: Ridge regression may struggle with large output ranges
    • Question: Would scaling outputs improve performance?
    • Action: Test with output scaling
  3. Brightness Parameter

    • Issue: brightness=0.001 may be too small for this problem
    • Impact: Features may be too saturated or too weak
    • Question: Has brightness been tuned for this specific problem?
    • Action: Hyperparameter search for brightness

Validation Needed

  • Test with scaled outputs (StandardScaler on y)
  • Test different brightness values
  • Compare with baseline that also has scaled outputs
  • Verify if the problem is genuinely difficult or just poorly scaled

Experiment 6: Quantum Interference

Critical Bug Found and Fixed

  1. Normalization Bug (FIXED)

    • Issue: probability = probability / np.sum(probability) * len(probability) when probability has only 1 element always gives 1.0
    • Impact: All outputs were 1.0, model learned to always predict 1.0
    • Result: R² = 1.0 was artificial - model wasn't learning anything
    • Fix: Only normalize when len(probability) > 1
    • Status: ✅ Fixed
  2. Post-Fix Results

    • Darwinian R²: 0.0225 (very poor)
    • Quantum Chaos R²: -0.0088 (worse than random)
    • Interpretation: The problem is genuinely difficult, not an artifact

Remaining Issues

  1. Problem Difficulty

    • Issue: Both models fail completely after bug fix
    • Question: Is the problem too difficult, or is there a design flaw?
    • Possible Causes:
      • The relationship is highly non-linear and complex
      • Input features may not be in the right representation
      • The cosine relationship may require explicit feature engineering
  2. Input Representation

    • Issue: Raw parameters (λ, d, L, x) may not be optimal
    • Question: Should we use derived features (phase, path difference)?
    • Action: Test with explicit phase features vs. raw inputs
  3. Output Range

    • Issue: Probabilities are in [0, 1] but may need different scaling
    • Question: Should we use log-probabilities or other transformations?
    • Action: Test different output transformations

Validation Needed

  • Test with explicit phase features as inputs
  • Test with different output transformations
  • Verify the physics simulation is correct
  • Check if the problem is learnable with more data

General Issues Across Experiments

1. Hyperparameter Tuning

Issue: Brightness and other hyperparameters are fixed across experiments

  • Experiment 1: brightness=0.001 (works well)
  • Experiment 5: brightness=0.001 (poor performance)
  • Experiment 6: brightness=0.001 (poor performance)

Question: Should brightness be tuned per experiment?

Action:

  • Document that brightness is not tuned
  • Note this as a limitation
  • Consider hyperparameter search for future experiments

2. Output Scaling

Issue: Some experiments scale outputs, others don't

  • Experiment 1: No output scaling (works)
  • Experiment 5: No output scaling (fails)
  • Experiment 6: No output scaling (fails)

Question: Is output scaling necessary for some problems?

Action: Test output scaling in failing experiments

3. Model Capacity

Issue: All experiments use same architecture (4096 features, Ridge readout)

  • May be overkill for simple problems
  • May be insufficient for complex problems

Question: Should we adapt architecture to problem complexity?

Action: Document this as a limitation

4. Data Generation Validation

Issue: Need to verify data generation is correct

  • Physics simulators may have bugs
  • Normalization may be incorrect
  • Edge cases may not be handled

Action:

  • Add validation checks to all simulators
  • Verify physical correctness
  • Test edge cases

Recommendations

Immediate Actions

  1. Fix Experiment 6: ✅ Done (normalization bug)
  2. Re-run Experiment 6: ✅ Done (now shows genuine difficulty)
  3. Test Experiment 5 with output scaling: ⏳ Pending
  4. Document all limitations: ⏳ In progress

Future Experiments

  1. Always validate data generation:

    • Check output ranges and distributions
    • Verify physical correctness
    • Test edge cases
  2. Hyperparameter documentation:

    • Document why specific values were chosen
    • Note if they were tuned or fixed
    • Acknowledge limitations
  3. Baseline comparisons:

    • Ensure baselines have same advantages (scaling, etc.)
    • Fair comparison is essential
  4. Critical review process:

    • Always question: "Is this a model limitation or design flaw?"
    • Test alternative designs
    • Document negative results honestly

Conclusion

The critical review process revealed:

  1. One critical bug in Experiment 6 (normalization bug - fixed) ✅
  2. Experiment 5 validated - low performance is genuine model limitation ✅
  3. Experiment 7 optimized - found optimal hyperparameters, improved from R²=-4.3 to R²=0.44 ✅
  4. Genuine architectural limitations identified through deep validation ✅

Experiment 5 Validation Summary

  • Output scaling tested: No improvement (R² = 0.28)
  • Hyperparameters optimized: brightness=0.001 is best
  • Baseline comparison: Polynomial achieves R² = 0.99 (problem is learnable)
  • Data validated: Physics correct (conservation errors < 1e-12)
  • More data tested: No improvement with 1600 samples

Conclusion: Experiment 5's low R² = 0.28 is a genuine model limitation, not a design flaw. The chaos model struggles with division operations compared to multiplication.

Experiment 7 Validation Summary

  • Metropolis convergence improved: 10×N → 50×N steps (better thermalization)
  • Brightness optimized: 0.001 → 0.0001 (optimal for high-dim problem)
  • Baseline comparison: Linear achieves R² = 1.0 (problem is learnable)
  • Deep validation performed:
    • Small lattice (25 spins): R² = 0.94 ✅
    • Non-linear target (M²): R² = 0.98 ✅
    • High-dim linear: R² = 0.44 ⚠️
  • Data validated: Physics correct, phase transition visible

Conclusion: Experiment 7's R² = 0.44 (after optimization) is a genuine architectural limitation with high-dimensional linear targets. The model works well with low dimensionality or non-linear targets, confirming the limitation is specific to high-dim + linear combinations.

Experiment 6 Validation Summary

  • Normalization bug fixed: Was causing artificial R² = 1.0
  • Post-fix results: Both models fail (R² < 0.03)
  • Genuine difficulty: Problem is inherently hard

Conclusion: Experiment 6's failure is genuine problem difficulty, not a design flaw.

Key Principle: Always distinguish between:

  • Model limitations: The model genuinely cannot learn the problem (Exp 5, Exp 7 partial)
  • Design flaws: The experiment is set up incorrectly (Exp 6 normalization bug - fixed)
  • Problem difficulty: The problem is inherently hard (Exp 6)
  • Hyperparameter issues: Model can work but needs tuning (Exp 7 - brightness optimized)

Critical Lesson: Deep validation revealed that:

  • Experiment 7's initial failure (R² = -4.3) was due to suboptimal hyperparameters
  • After optimization (brightness=0.0001, better Metropolis), R² improved to 0.44
  • This shows the importance of thorough hyperparameter search before concluding model limitations

We must be honest about which is which, and always optimize before concluding.

© 2025 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud