
The rapid evolution of military drone technology, driven by advancements in computing, materials, and autonomous systems, has fundamentally transformed modern warfare and reconnaissance. These systems offer significant strategic advantages, including persistent surveillance, reduced risk to human pilots, and operation in hazardous environments. However, the increasing complexity and performance demands of new military drone platforms have led to a concomitant rise in their development costs. Accurate forecasting of these costs at the early stages of a program is a critical challenge for defense budget planning, resource allocation, and program feasibility studies. The inherent difficulty lies in the multitude of interrelated technical parameters that influence cost, coupled with the severe scarcity of relevant, high-fidelity historical data due to the classified nature of military drone projects. Traditional multivariate regression techniques often fail under these conditions because they require a large number of observations relative to predictors and are highly sensitive to multicollinearity—a common occurrence where technical parameters of a military drone (e.g., mass, speed, endurance) are correlated with each other. This paper explores and advocates for the application of Partial Least Squares Regression (PLSR) as a potent and reliable methodology for military drone development cost prediction, particularly suited for small-sample, high-dimensional, and collinear data environments.
The core challenge in military drone cost estimation is formulating a predictive model from a dataset that is typically characterized by: 1) a limited number of completed programs (n), 2) a relatively large set of potential cost-driving parameters (p), and 3) strong interdependencies among these parameters. When \( p \) is comparable to or even greater than \( n \), conventional Ordinary Least Squares (OLS) regression becomes unstable or even impossible to compute. Furthermore, neural network approaches like Backpropagation (BP) or Radial Basis Function (RBF) networks, while powerful, often function as “black boxes,” provide limited insight into the variable relationships, and can be prone to overfitting on small datasets without careful regularization. PLSR elegantly addresses these limitations. It is a bilinear factor model that projects both the predictor variables \( \mathbf{X} \) (e.g., technical parameters) and the response variable \( \mathbf{Y} \) (development cost) onto a new, smaller set of orthogonal components. These components, also called latent vectors, are constructed to maximize the covariance between \( \mathbf{X} \) and \( \mathbf{Y} \). Thus, PLSR combines features from principal component analysis (PCA) and multiple regression, providing a robust framework for analysis.
Theoretical Foundation of Partial Least Squares Regression (PLSR)
The PLSR algorithm seeks to find a set of latent components \( \mathbf{T} \) (scores) that model \( \mathbf{X} \) and simultaneously predict \( \mathbf{Y} \). The model can be represented as:
$$
\mathbf{X} = \mathbf{T}\mathbf{P}^T + \mathbf{E}
$$
$$
\mathbf{Y} = \mathbf{T}\mathbf{Q}^T + \mathbf{F}
$$
where \( \mathbf{P} \) and \( \mathbf{Q} \) are the loadings for \( \mathbf{X} \) and \( \mathbf{Y} \), respectively, and \( \mathbf{E} \) and \( \mathbf{F} \) are the residual matrices. The primary goal is not to perfectly reconstruct \( \mathbf{X} \), but to find components from \( \mathbf{X} \) that are most predictive of \( \mathbf{Y} \).
Algorithmic Steps for PLSR Modeling
The standard algorithm for a single response variable \( y \) (univariate PLSR) proceeds iteratively as follows:
Step 1: Preprocessing. Center and scale both the predictor matrix \( \mathbf{X}_{n \times p} \) and the response vector \( \mathbf{y}_{n \times 1} \) to have zero mean and unit variance. This yields the standardized matrices \( \mathbf{E}_0 \) and \( \mathbf{F}_0 \).
Step 2: Iterative Extraction of Latent Components. For each component \( h = 1, 2, …, m \):
- Compute the weight vector \( \mathbf{w}_h \): \( \mathbf{w}_h = \mathbf{E}_{h-1}^T \mathbf{F}_{h-1} / \| \mathbf{E}_{h-1}^T \mathbf{F}_{h-1} \| \). This weight vector is proportional to the covariance between the residuals of \( \mathbf{X} \) and \( \mathbf{Y} \).
- Calculate the score vector (latent component) \( \mathbf{t}_h \): \( \mathbf{t}_h = \mathbf{E}_{h-1} \mathbf{w}_h \).
- Compute the \( \mathbf{X} \)-loadings \( \mathbf{p}_h \): \( \mathbf{p}_h = \mathbf{E}_{h-1}^T \mathbf{t}_h / (\mathbf{t}_h^T \mathbf{t}_h) \).
- Compute the regression coefficient \( r_h \) for \( \mathbf{y} \) on \( \mathbf{t}_h \): \( r_h = \mathbf{F}_{h-1}^T \mathbf{t}_h / (\mathbf{t}_h^T \mathbf{t}_h) \).
- Update the residual matrices: \( \mathbf{E}_h = \mathbf{E}_{h-1} – \mathbf{t}_h \mathbf{p}_h^T \) and \( \mathbf{F}_h = \mathbf{F}_{h-1} – \mathbf{t}_h r_h \).
Step 3: Determining the Optimal Number of Components (m). A critical step is to avoid overfitting. This is typically done via cross-validation, often using the metric \( Q^2 \). For component \( h \), \( Q_h^2 = 1 – \frac{PRESS_h}{SS_{h-1}} \), where \( PRESS_h \) is the Prediction Error Sum of Squares from cross-validation using \( h \) components, and \( SS_{h-1} \) is the residual sum of squares from the model with \( h-1 \) components. A common rule is to stop extracting components when \( Q_h^2 < 0.0975 \). The optimal \( m \) is the number that maximizes predictive ability.
Step 4: Formulating the Final Regression Model. After extracting \( m \) components, the regression of \( \mathbf{y} \) on the original (standardized) variables is derived from the model \( \mathbf{\hat{y}} = r_1 \mathbf{t}_1 + … + r_m \mathbf{t}_m \). Since each \( \mathbf{t}_h = \mathbf{E}_0 \mathbf{w}_h^* \), where \( \mathbf{w}_h^* \) is a function of all previous weights and loadings, we can express the prediction in terms of \( \mathbf{X} \):
$$
\mathbf{\hat{y}} = \mathbf{E}_0 \sum_{h=1}^{m} r_h \mathbf{w}_h^* = \mathbf{E}_0 \mathbf{b}_{PLS}
$$
where \( \mathbf{b}_{PLS} \) is the vector of PLSR coefficients for the standardized variables. These coefficients are then transformed back to the original scale of the data.
Diagnostic and Interpretive Tools in PLSR
PLSR offers a suite of tools for model diagnosis and interpretation, which is vital for understanding cost drivers for a military drone.
1. Variable Importance in Projection (VIP): This metric quantifies the contribution of each predictor variable \( x_j \) to the PLSR model. The VIP score for the \( j \)-th variable is calculated as:
$$
VIP_j = \sqrt{ \frac{p}{\sum_{h=1}^{m} SS_h} \sum_{h=1}^{m} SS_h (w_{hj}^*)^2 }
$$
where \( p \) is the number of predictors, \( SS_h \) is the sum of squares explained by component \( h \) in the \( \mathbf{y} \)-model, and \( w_{hj}^* \) is the normalized weight for variable \( j \) in component \( h \). Variables with a VIP score greater than 1 are generally considered significant contributors to the predictive model of the military drone’s cost.
2. Explained Variance: We can assess how much of the variation in \( \mathbf{X} \) and \( \mathbf{Y} \) is captured by each component.
- Variance explained in \( \mathbf{X} \) by component \( t_h \): \( R^2_X(t_h) = \frac{\|\mathbf{t}_h \mathbf{p}_h^T\|^2}{\|\mathbf{X}\|^2} \).
- Variance explained in \( \mathbf{Y} \) by component \( t_h \): \( R^2_Y(t_h) = \frac{(r_h \mathbf{t}_h^T)(r_h \mathbf{t}_h)}{\|\mathbf{Y}\|^2} \).
Cumulative values indicate the total explanatory power of the model.
3. Outlier Detection (Hotelling’s T²): The \( T^2 \) statistic for observation \( i \) across \( m \) components helps identify outliers that exert undue influence on the model:
$$
T_i^2 = \sum_{h=1}^{m} \frac{t_{hi}^2}{s_{t_h}^2}
$$
where \( t_{hi} \) is the score for observation \( i \) on component \( h \), and \( s_{t_h}^2 \) is the variance of \( t_h \). Observations with excessively high \( T^2 \) values may be outliers specific to the military drone dataset and require investigation.
Case Study: Military Drone Development Cost Prediction
To demonstrate the efficacy of PLSR, we apply it to a canonical military drone development cost problem. The dataset involves several historical military drone programs, characterized by six key technical parameters that are hypothesized to drive non-recurring engineering (development) costs. The parameters are: Length \( L \) (m), Maximum Take-off Mass \( W \) (kg), Cruise Speed \( V \) (km/h), Flight Altitude \( H \) (km), Endurance \( T \) (h), and Payload Capacity \( N \) (kg). The response variable is Development Cost \( C \) (in billions of USD). A small, illustrative dataset is presented in Table 1.
| Military Drone Model | L (x1) | W (x2) [kg] | V (x3) [km/h] | H (x4) [km] | T (x5) [h] | N (x6) [kg] | Cost C (y) [B$] |
|---|---|---|---|---|---|---|---|
| Model A | 13.50 | 11622 | 557 | 19.8 | 42 | 900 | 3.71 |
| Model B | 5.25 | 480 | 306 | 4.0 | 7 | 130 | 1.33 |
| Model C | 2.08 | 160 | 218 | 4.0 | 4 | 165 | 0.95 |
| Model D | 4.27 | 400 | 30 | 2.0 | 5 | 14.5 | 1.02 |
| Model E | 13.50 | 10395 | 648 | 20.4 | 46 | 905 | 4.19 |
| Model F | 4.60 | 3900 | 555 | 15.2 | 12 | 450 | 2.65 |
| Model K (Test) | 8.22 | 1020 | 139 | 7.3 | 40 | 204 | 2.07 |
The correlation matrix (Table 2) reveals the core challenge. The predictor variables for the military drone are highly intercorrelated, with many pairwise correlation coefficients exceeding 0.9. This severe multicollinearity renders standard multiple linear regression unreliable.
| W (x2) | V (x3) | H (x4) | T (x5) | N (x6) | Cost (y) | |
|---|---|---|---|---|---|---|
| L (x1) | 0.959 | 0.726 | 0.845 | 0.982 | 0.919 | 0.942 |
| W (x2) | 0.822 | 0.945 | 0.981 | 0.984 | 0.966 | |
| V (x3) | 0.942 | 0.788 | 0.869 | 0.913 | ||
| H (x4) | 0.901 | 0.976 | 0.983 | |||
| T (x5) | 0.966 | 0.952 | ||||
| N (x6) | 0.980 |
PLSR Model Construction and Results
We apply the PLSR algorithm to the training data (Models A-F). The first latent component \( t_1 \) is extracted. The cross-validation results indicate that one component is sufficient, as the \( Q^2 \) for a potential second component falls below the threshold. The first component explains a substantial portion of the variance in both \( \mathbf{X} \) and \( \mathbf{Y} \).
The VIP scores for the first component are calculated and presented in Table 3. All VIP scores are close to or above 1, confirming that all six technical parameters contribute meaningfully to explaining the development cost of the military drone. This is a more comprehensive insight than stepwise regression might provide.
| Metric | Value |
|---|---|
| Number of Components (m) | 1 |
| \( R^2_X \) (for t1) | 92.5% |
| \( R^2_Y \) (for t1) | 98.0% |
| Cross-validated \( Q^2 \) | > 0.0975 |
| Variable Importance in Projection (VIP) | |
| Length (x1) | 1.033 |
| Mass (x2) | 1.030 |
| Speed (x3) | 1.015 |
| Altitude (x4) | 1.000 |
| Endurance (x5) | 0.961 |
| Payload (x6) | 0.959 |
The final PLSR regression coefficients, transformed back to the original scale of the military drone data, yield the following cost estimation equation:
$$
\begin{aligned}
\hat{C} = & -0.5463 + 0.0469 \cdot L + 4.71 \times 10^{-5} \cdot W + 9.73 \times 10^{-4} \cdot V \\
& + 0.0296 \cdot H + 0.0126 \cdot T + 6.35 \times 10^{-4} \cdot N
\end{aligned}
$$
where \( \hat{C} \) is the predicted development cost in billions of USD.
Model Validation and Comparative Analysis
To validate the model, we predict the cost for the hold-out test sample, Military Drone Model K. The prediction and its error are calculated. We compare the performance of the PLSR model against three alternative approaches commonly used or suggested for such problems: Stepwise Multiple Regression (SMR), a Backpropagation Neural Network (BPNN), and a Radial Basis Function Neural Network (RBFN). The comparative results are summarized in Table 4.
| Prediction Method | Predicted Cost [B$] | Absolute Error [%] | Key Characteristics |
|---|---|---|---|
| Partial Least Squares (PLSR) | 1.962 | 5.24% | Handles collinearity, small-n, interpretable, stable. |
| Stepwise Multiple Regression (SMR) | 1.718 | 17.00% | Unstable with collinearity, may exclude relevant military drone parameters. |
| Backpropagation Neural Network (BPNN) | 1.890 | 8.70% | Risk of overfitting, local minima, “black-box” nature. |
| Radial Basis Function Network (RBFN) | 1.960 | 5.30% | Good performance, but less interpretable than PLSR. |
| Actual Cost | 2.070 | – | – |
The results clearly demonstrate the superiority of the PLSR approach in this specific context. The SMR model, struggling with multicollinearity, produced an inferior model with higher prediction error. While the RBFN achieved accuracy similar to PLSR, it lacks the transparent, analytical framework of PLSR. The PLSR model not only provides the most accurate point prediction for this military drone but also offers invaluable diagnostic information (VIP scores, explained variance) that can guide engineers and cost analysts in understanding the primary cost drivers. For instance, the VIP scores suggest that length, mass, and speed are among the most influential parameters for this class of military drone, which aligns with intuitive engineering judgment about system complexity.
Conclusion and Implications
The development of a modern military drone is a complex engineering undertaking with significant financial implications. Accurate early-stage cost prediction is essential for prudent defense acquisition and portfolio management. The common hurdles of small sample sizes and highly correlated technical parameters render many conventional statistical tools inadequate. This analysis has shown that Partial Least Squares Regression (PLSR) is a particularly well-suited methodology for overcoming these hurdles in military drone cost estimation.
PLSR’s strength lies in its dual focus: it reduces the dimensionality of the predictor space by constructing latent components that maximize the explained covariance with the cost variable, thereby effectively managing multicollinearity. Its built-in cross-validation mechanism guards against overfitting, a critical concern with limited data. Furthermore, unlike neural networks, PLSR provides a wealth of interpretative outputs—such as VIP scores, component explained variance, and outlier diagnostics—that transform it from a mere prediction tool into an analytical framework for understanding cost structures.
The comparative case study confirms that PLSR can achieve prediction accuracy superior to stepwise regression and comparable to or better than advanced neural networks, while offering far greater insight. For program managers and cost estimators working on the next generation of military drone systems, incorporating PLSR into their analytical toolkit can lead to more reliable, justifiable, and informative cost forecasts. Future work could explore nonlinear extensions of PLSR or its integration with detailed technical risk assessments to further enhance the predictive modeling of military drone development expenditures.
