Vision-Based Target Localization for Camera Drones Using Monocular Depth Estimation

Camera drones face significant challenges in target localization due to their reliance on prior environmental knowledge or target dimensions. This research introduces a monocular vision method eliminating such dependencies. The approach leverages deep learning for target detection, motion-based depth estimation, and coordinate transformation to achieve precise positioning without active ranging sensors.

The target detection framework utilizes an enhanced YOLOv5-Lite architecture. Key modifications include:

Replacement of Focus layers with Conv-BatchNorm-ReLU blocks
Integration of Squeeze-and-Excitation attention mechanisms
Optimized memory access in the FPN+PAN neck structure

Performance comparisons demonstrate significant advantages over conventional models:

Detection Model	Accuracy (%)	Recall (%)	mAP (%)	Speed (fps)
Improved YOLOv5-Lite	93.3	92.5	93.1	156.1
YOLOv5-Lite	91.2	90.3	90.5	140.5
YOLOv4-Tiny	88.4	87.4	85.9	165.2
Faster R-CNN	91.7	91.2	90.8	31.8

Monocular ranging exploits the relationship between bounding box dimensions and camera drone motion. The fundamental perspective projection is expressed as:

$$ x = f_x \frac{X}{Z_i} + c_x, \quad y = f_y \frac{Y}{Z_i} + c_y $$

Target width in image space relates to physical width through:

$$ w_i = f_x \frac{W}{Z_i} $$

For consecutive frames during UAV movement along the optical axis:

$$ w_i Z_i = w_j (Z_i + \Delta Z_{ij}) = f_x W $$

The depth estimation equation becomes:

$$ Z_j = \frac{\Delta Z_{ij}}{\frac{w_j}{w_i} – 1} $$

Least squares refinement combines multiple observations:

$$ \begin{bmatrix}
w_1 & 1 & 0 \\
h_1 & 0 & 1 \\
w_2 & 1 & 0 \\
\vdots & \vdots & \vdots \\
h_n & 0 & 1
\end{bmatrix}
\begin{bmatrix}
\hat{Z_j} \\
-f\hat{x}W \\
-f\hat{y}H
\end{bmatrix}
=
\begin{bmatrix}
w_1 \Delta Z_{1j} \\
h_1 \Delta Z_{1j} \\
w_2 \Delta Z_{2j} \\
\vdots \\
h_n \Delta Z_{nj}
\end{bmatrix} $$

Coordinate transformation involves seven reference frames:

Pixel coordinates (u,v)
Image coordinates (x,y)
Camera frame (X_c,Y_c,Z_c)
Body frame (X_b,Y_b,Z_b)
Local NED frame (X_v,Y_v,Z_v)
ECEF frame (X_e,Y_e,Z_e)
Geodetic coordinates (B,L,H)

The complete transformation from camera to ECEF coordinates is:

$$ \begin{bmatrix} x^e_T \\ y^e_T \\ z^e_T \\ 1 \end{bmatrix} = \mathbf{A} \mathbf{B} \mathbf{C} \begin{bmatrix} x^c_T \\ y^c_T \\ z^c_T \\ 1 \end{bmatrix} $$

Where transformation matrices account for:

Camera gimbal angles (matrix C)
UAV attitude (matrix B)
Geodetic position (matrix A)

Simulation experiments within ROS/Gazebo validated the approach. A camera UAV approached targets at 3 m/s while performing localization. Depth estimation demonstrated rapid convergence:

True Distance (m)	Measured (m)	Refined (m)	Error (m)
27.97	32.71	36.80	8.83
17.52	17.07	18.50	0.98
11.62	9.41	11.83	0.21
6.31	6.33	6.22	0.09

Positioning accuracy improved as the camera drone reduced distance:

$$ \Delta \lambda < 0.00001^\circ, \quad \Delta \phi < 0.00001^\circ, \quad \Delta H < 0.2 \text{ m} $$

Error analysis across 10 trials showed consistent performance:

Initial distance errors: 8-12% (20-30m range)
Stabilized errors: 1-3% within 15m range
Final positioning error: <0.5m absolute

This monocular approach enables camera UAVs to conduct passive target localization without prior dimensional knowledge. The method demonstrates particular value for small UAVs operating in GPS-denied environments where active sensors are impractical. Future work will address dynamic targets and adverse weather conditions.