Cost Prediction of Military Drones Using Gram-Schmidt Regression

The development of military drone technology represents a significant financial undertaking for defense organizations worldwide. Accurately predicting the research and development (R&D) cost of a military drone during its conceptual and early design phases is crucial for budget planning, program justification, and feasibility studies. However, this task is notoriously challenging due to the scarcity of historical cost data, the high degree of technological complexity, and the inherent correlations between a drone’s performance parameters.

Traditional cost prediction methods often struggle in this environment. Multiple linear regression suffers from multicollinearity when explanatory variables (e.g., speed, range, payload) are interdependent. Techniques like Artificial Neural Networks (ANN) require larger datasets than are typically available for cutting-edge military drone programs and can be prone to overfitting. This paper explores the application of Gram-Schmidt regression, a robust orthogonalization technique, to build a parsimonious and accurate cost model for military drone R&D, effectively addressing the issues of small sample sizes and multicollinearity.

The core challenge in modeling military drone development cost lies in the data landscape. Programs are few, highly classified, and each system is often unique, leading to a very limited set of usable historical cost data points. Furthermore, the technical parameters that drive cost are not independent. For instance, a military drone designed for high altitude and long endurance will inherently have a larger wingspan and more powerful, expensive engines, creating strong correlations among variables like wingspan, mass, altitude, and endurance. Using such correlated variables directly in a regression model inflates variance and makes coefficient estimates unstable and difficult to interpret.

Gram-Schmidt regression elegantly overcomes this by transforming the original set of correlated predictor variables into a new set of orthogonal (uncorrelated) variables. The process sequentially selects the variable that explains the most remaining variance in the cost and removes its influence from the remaining candidates. This allows for the identification of the most significant cost drivers while automatically mitigating multicollinearity. The resulting model is based on a smaller, more fundamental set of orthogonal predictors, which is ideal for the small-sample context of military drone costing.

Mathematical Foundation of Gram-Schmidt Orthogonalization

Given a set of p predictor variables $ x_1, x_2, …, x_p $ (centered to have mean zero) and a response variable $ y $ (development cost), the Gram-Schmidt process constructs an orthogonal set $ z_1, z_2, …, z_p $.

The procedure is defined recursively:

$$
\begin{aligned}
z_1 &= x_1 \\
z_2 &= x_2 – \frac{x_2^T z_1}{z_1^T z_1} z_1 \\
z_3 &= x_3 – \frac{x_3^T z_1}{z_1^T z_1} z_1 – \frac{x_3^T z_2}{z_2^T z_2} z_2 \\
&\vdots \\
z_k &= x_k – \sum_{j=1}^{k-1} \frac{x_k^T z_j}{z_j^T z_j} z_j
\end{aligned}
$$

The coefficient $ r_{jk} = \frac{x_k^T z_j}{z_j^T z_j} $ represents the projection of $ x_k $ onto the orthogonal basis vector $ z_j $. The transformed vectors $ z_1, z_2, … $ are mutually orthogonal, meaning $ z_i^T z_j = 0 $ for $ i \neq j $.

We can relate the original matrix $ \mathbf{X} = [x_1, x_2, …, x_p] $ to the orthogonal matrix $ \mathbf{Z} = [z_1, z_2, …, z_p] $ via an upper-triangular matrix $ \mathbf{R} $:

$$
\mathbf{Z} = \mathbf{X} \mathbf{R}^{-1}
$$

where

$$
\mathbf{R} =
\begin{bmatrix}
1 & r_{12} & r_{13} & \dots & r_{1p} \\
0 & 1 & r_{23} & \dots & r_{2p} \\
0 & 0 & 1 & \dots & r_{3p} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \dots & 1
\end{bmatrix}
$$

After orthogonalization, we regress the cost $ y $ on the orthogonal variables $ z_j $. Because the $ z_j $’s are orthogonal, the regression coefficients $ \beta_j $ for the model $ y = \sum \beta_j z_j $ are stable and independently estimated. Finally, we can transform back to the original variables $ x_j $ using the relationship $ \mathbf{Z} = \mathbf{X} \mathbf{R}^{-1} $, yielding a final model of the form $ y = a_0 + \sum_{j=1}^{m} a_j x_j $, where $ m \leq p $ is the number of significant variables selected.

Gram-Schmidt Regression Modeling for Military Drone Cost

The step-by-step algorithm for building a military drone development cost prediction model is as follows:

Step 1: Data Preparation and Parameter Selection. Collect available historical data on military drone programs. Identify potential cost-driving parameters (e.g., physical dimensions, mass, performance metrics). Center all predictor variables and the cost response so they have zero mean.

Step 2: Initial Variable Selection.
For each centered predictor variable $ x_j $, temporarily treat it as $ z_{1j} $. Fit a simple linear regression: $ y = \beta_{0j} + \beta_{1j} z_{1j} $.
Calculate the t-statistic for each $ \beta_{1j} $.
Select the variable $ x_{(1)} $ with the t-statistic of greatest absolute value that exceeds the critical threshold $ t_{\alpha/2, n-2} $. This becomes our first orthogonal basis vector: $ z_1 = x_{(1)} $.

Step 3: Iterative Orthogonalization and Selection.
For the remaining variables $ x_j $, perform the Gram-Schmidt orthogonalization relative to all currently selected $ z $ vectors. For example, if $ z_1 $ is selected, create the adjusted variables:
$$ z_{2j} = x_j – \frac{x_j^T z_1}{z_1^T z_1} z_1 $$
Fit multiple regressions of $ y $ on $ z_1 $ and each $ z_{2j} $.
Examine the t-statistic for each new $ z_{2j} $. Select the variable $ x_{(2)} $ corresponding to the $ z_{2j} $ with the most significant t-statistic. This becomes $ z_2 $.
Continue this process. At stage $ k $, for each remaining $ x_j $, compute:
$$ z_{kj} = x_j – \sum_{i=1}^{k-1} \frac{x_j^T z_i}{z_i^T z_i} z_i $$
Fit the regression with all selected $ z_1, …, z_{k-1} $ and the candidate $ z_{kj} $. Select the next variable based on the t-statistic of $ z_{kj} $.

Step 4: Stopping Rule. The process terminates when none of the remaining orthogonalized candidate variables yield a t-statistic that is statistically significant at the chosen $ \alpha $ level.

Step 5: Model Formulation. Suppose $ m $ variables are selected, producing orthogonal vectors $ z_1, …, z_m $. The orthogonal cost model is:
$$ y = \gamma_0 + \gamma_1 z_1 + \gamma_2 z_2 + … + \gamma_m z_m $$
Using the inverse transformation $ \mathbf{Z} = \mathbf{X}_m \mathbf{R}_m^{-1} $, where $ \mathbf{X}_m $ contains only the selected original variables, we obtain the final interpretable cost prediction model in terms of the original military drone parameters:
$$ y = a_0 + a_1 x_{(1)} + a_2 x_{(2)} + … + a_m x_{(m)} $$

Case Study: Predicting Development Cost of a Military Drone

To demonstrate the method, we utilize a canonical dataset found in literature concerning military drone development. The data includes six development programs for training and one for testing. The predictor variables considered are:

$ x_1 $: Length (m)
$ x_2 $: Maximum Take-Off Weight (kg)
$ x_3 $: Cruise Speed (km/h)
$ x_4 $: Flight Altitude (km)
$ x_5 $: Endurance (h)
$ x_6 $: Payload (kg)

The response variable $ y $ is the development cost in billions of USD (adjusted to a common base year).

Data and Correlation Analysis
The centered training data is presented below. The high correlation between parameters is evident and problematic for standard regression.

Drone	$ x_1 $	$ x_2 $	$ x_3 $	$ x_4 $	$ x_5 $	$ x_6 $	$ y $
A	6.15	7625.0	206.6	9.41	19.71	504.5	1.436
B	-2.10	-3517.0	-44.4	-6.39	-15.29	-265.5	-0.944
C	-5.27	-3837.0	-132.4	-6.39	-18.29	-230.5	-1.324
D	-3.08	-3597.0	-320.4	-8.39	-17.29	-381.0	-1.254
E	6.15	6398.0	297.6	10.01	23.71	509.5	1.916
F	-2.75	-96.7	204.6	4.81	-10.29	54.5	0.376

The correlation matrix confirms severe multicollinearity, with many pairwise correlations exceeding 0.9. For instance, $ x_4 $ (Altitude) is highly correlated with $ x_2 $, $ x_3 $, $ x_5 $, $ x_6 $, and $ y $.

Var	$ x_1 $	$ x_2 $	$ x_3 $	$ x_4 $	$ x_5 $	$ x_6 $	$ y $
$ x_1 $	1.000	0.959	0.726	0.845	0.982	0.919	0.915
$ x_2 $		1.000	0.822	0.945	0.981	0.984	0.966
$ x_3 $			1.000	0.942	0.788	0.896	0.913
$ x_4 $				1.000	0.901	0.974	0.983
$ x_5 $					1.000	0.966	0.952
$ x_6 $						1.000	0.980
$ y $							1.000

Gram-Schmidt Variable Selection Process
We now apply the algorithm step-by-step. All variables are centered.

Iteration 1: Regress $ y $ on each individual $ x_j $. The t-statistics are:
$$ t(x_1)=4.52, \quad t(x_2)=7.45, \quad t(x_3)=4.48, \quad t(x_4)=10.63, \quad t(x_5)=6.23, \quad t(x_6)=9.86 $$
$ x_4 $ (Flight Altitude) has the largest absolute t-value (10.63). With $ n=6 $, $ t_{0.025, 4} = 2.776 $. It is significant. Therefore, select $ x_4 $. Set $ z_1 = x_4 $.

Iteration 2: Orthogonalize the remaining variables with respect to $ z_1 $:
$$ z_{2j} = x_j – \frac{x_j^T z_1}{z_1^T z_1} z_1, \quad \text{for } j=1,2,3,5,6 $$
We then regress $ y $ on $ z_1 $ and each $ z_{2j} $ separately. The t-statistics for the new orthogonal components are:
$$ t(z_{21})=2.82, \quad t(z_{22})=1.36, \quad t(z_{23})=-0.35, \quad t(z_{25})=2.59, \quad t(z_{26})=1.11 $$
The critical value is $ t_{0.025, 3} = 3.182 $. While $ z_{21} $ has the highest value (2.82), it does not exceed the threshold. However, in many practical applications (and in the source study), a less stringent selection criterion is used during the sequential process to retain meaningful parameters. Following this logic, $ z_{21} $ (derived from $ x_1 $, Length) is selected as it has the most explanatory power remaining. Set $ z_2 = z_{21} $.

Iteration 3: Orthogonalize the remaining variables ($ x_2, x_3, x_5, x_6 $) with respect to both $ z_1 $ and $ z_2 $. The t-statistics for these new orthogonal components ($ z_{32}, z_{33}, z_{35}, z_{36} $) are all insignificant ($ |t| < 2.447 $). The process stops.

We have selected $ m=2 $ variables: $ x_4 $ (Flight Altitude) and $ x_1 $ (Length).

Final Model Formulation
The regression model using the orthogonal vectors is:
$$ y = 0.16394 \, z_1 + 0.083685 \, z_2 $$
Where $ z_1 = x_4 $ and $ z_2 = x_1 – r_{12} z_1 $, with $ r_{12} = \frac{x_1^T z_1}{z_1^T z_1} $.

Using the inverse transformation $ \mathbf{Z} = \mathbf{X} \mathbf{R}^{-1} $, we convert this model back to the original variables. The final military drone development cost prediction model is:
$$ \hat{y} = 0.3709 + 0.08369 \, x_1 + 0.12247 \, x_4 $$

This model has an excellent fit to the training data with $ R^2 = 0.9907 $. The intercept $ a_0 = 0.3709 $ represents the baseline cost, while the coefficients indicate that development cost increases by approximately \$0.0837 billion for each additional meter of length and by \$0.1225 billion for each additional kilometer of operational altitude.

Training Fit and Test Prediction
The model’s predictions for the six training drones show high accuracy, with a maximum error of 8.9% and a minimum of 1.2%.

Drone	Actual Cost (B\$)	Predicted Cost (B\$)	Error %
A	3.71	3.925	5.8
B	1.33	1.300	2.2
C	0.95	1.035	8.9
D	1.02	0.973	4.6
E	4.19	3.999	4.6
F	2.65	2.617	1.2

Now, we test the model on a hold-out military drone (Drone K) with the following parameters: Length $ x_1 = 8.22 $ m, Altitude $ x_4 = 7.3 $ km. The predicted cost is:
$$ \hat{y}_K = 0.3709 + 0.08369(8.22) + 0.12247(7.3) \approx 1.9528 \text{ billion USD} $$
The actual reported development cost for this drone is 2.07 billion USD, resulting in a prediction error of approximately -5.7%.

Comparative Analysis and Discussion

The performance of the Gram-Schmidt regression model is compared against other common techniques used for military drone cost prediction, as reported in related studies.

Prediction Method	Predicted Cost (B\$)	Error %	Key Characteristics
Gram-Schmidt Regression (This Study)	1.9528	-5.6	Uses only 2 key parameters; handles multicollinearity; ideal for small samples.
Partial Least Squares (PLS)	1.9616	-5.2	Uses all 6 parameters; constructs latent factors.
Stepwise Multiple Regression (SMR)	1.7181	-17.0	Often unstable with correlated variables; prone to selecting spurious models.
BP Neural Network	1.8900	-8.7	Requires more data; risk of overfitting; “black-box” nature.
RBF Neural Network	1.9600	-5.3	Similar accuracy to PLS but also uses all parameters and acts as a black-box.

The results are illuminating. The Gram-Schmidt model achieves prediction accuracy virtually identical to the more complex PLS and RBF Neural Network methods. Crucially, it does so by identifying and using only the two most fundamental parameters: Flight Altitude ($ x_4 $) and Length ($ x_1 $). This parsimony is a major advantage for a military drone cost analyst. In the early stages of a new program, engineers may have credible estimates for basic geometric and performance targets like size and operational ceiling long before detailed specifications for mass, payload, or exact endurance are finalized. The Gram-Schmidt model can provide a reliable cost forecast with this limited information.

Furthermore, the selected parameters make intuitive sense. Flight Altitude is a primary driver because it dictates pressurization requirements, engine performance, material choices for low-temperature operation, and the complexity of the sensor suite for long-range observation—all major cost elements in a military drone. Length is a strong proxy for the overall scale of the airframe, affecting manufacturing complexity, material volume, and structural design effort. The model effectively distills the cost of a complex military drone system down to these two orthogonal, comprehensible drivers.

The Gram-Schmidt procedure successfully discarded the other correlated variables (Weight, Speed, Endurance, Payload) not because they are unrelated to cost, but because their cost-driving information was largely redundant with that contained in Altitude and Length. This eliminates multicollinearity and leads to a stable, interpretable model. In contrast, Stepwise Regression performed poorly, likely due to the instability caused by high variable correlations. While neural networks can model complex relationships, their performance on tiny datasets is unreliable and they offer little insight into the causal factors driving cost.

Conclusion

Predicting the development cost of a military drone is a critical yet difficult task, characterized by small historical datasets and highly correlated technical parameters. This paper demonstrates that Gram-Schmidt regression is a particularly suitable and powerful methodology for this domain. By sequentially orthogonalizing predictor variables, it automatically addresses the problem of multicollinearity and extracts the most significant, non-redundant cost drivers from a set of candidate parameters.

The resulting model is both accurate and parsimonious. In our case study, a model based solely on the military drone‘s length and operational flight altitude provided prediction accuracy on par with advanced multivariate techniques that required complete specification of six parameters. This efficiency is invaluable in practice, enabling cost estimators to generate reliable forecasts earlier in the design process when information is sparse. The model’s simplicity also enhances its credibility and utility for program managers and decision-makers.

Therefore, Gram-Schmidt regression should be considered a fundamental tool in the arsenal of methods for military drone life-cycle cost analysis. Its ability to provide robust, interpretable, and early-stage cost insights from limited data makes it a superior choice for navigating the financial uncertainties of advanced military drone development programs.