
Camera drones face significant challenges in target localization due to their reliance on prior environmental knowledge or target dimensions. This research introduces a monocular vision method eliminating such dependencies. The approach leverages deep learning for target detection, motion-based depth estimation, and coordinate transformation to achieve precise positioning without active ranging sensors.
The target detection framework utilizes an enhanced YOLOv5-Lite architecture. Key modifications include:
- Replacement of Focus layers with Conv-BatchNorm-ReLU blocks
- Integration of Squeeze-and-Excitation attention mechanisms
- Optimized memory access in the FPN+PAN neck structure
Performance comparisons demonstrate significant advantages over conventional models:
| Detection Model | Accuracy (%) | Recall (%) | mAP (%) | Speed (fps) |
|---|---|---|---|---|
| Improved YOLOv5-Lite | 93.3 | 92.5 | 93.1 | 156.1 |
| YOLOv5-Lite | 91.2 | 90.3 | 90.5 | 140.5 |
| YOLOv4-Tiny | 88.4 | 87.4 | 85.9 | 165.2 |
| Faster R-CNN | 91.7 | 91.2 | 90.8 | 31.8 |
Monocular ranging exploits the relationship between bounding box dimensions and camera drone motion. The fundamental perspective projection is expressed as:
$$ x = f_x \frac{X}{Z_i} + c_x, \quad y = f_y \frac{Y}{Z_i} + c_y $$
Target width in image space relates to physical width through:
$$ w_i = f_x \frac{W}{Z_i} $$
For consecutive frames during UAV movement along the optical axis:
$$ w_i Z_i = w_j (Z_i + \Delta Z_{ij}) = f_x W $$
The depth estimation equation becomes:
$$ Z_j = \frac{\Delta Z_{ij}}{\frac{w_j}{w_i} – 1} $$
Least squares refinement combines multiple observations:
$$ \begin{bmatrix}
w_1 & 1 & 0 \\
h_1 & 0 & 1 \\
w_2 & 1 & 0 \\
\vdots & \vdots & \vdots \\
h_n & 0 & 1
\end{bmatrix}
\begin{bmatrix}
\hat{Z_j} \\
-f\hat{x}W \\
-f\hat{y}H
\end{bmatrix}
=
\begin{bmatrix}
w_1 \Delta Z_{1j} \\
h_1 \Delta Z_{1j} \\
w_2 \Delta Z_{2j} \\
\vdots \\
h_n \Delta Z_{nj}
\end{bmatrix} $$
Coordinate transformation involves seven reference frames:
- Pixel coordinates (u,v)
- Image coordinates (x,y)
- Camera frame (Xc,Yc,Zc)
- Body frame (Xb,Yb,Zb)
- Local NED frame (Xv,Yv,Zv)
- ECEF frame (Xe,Ye,Ze)
- Geodetic coordinates (B,L,H)
The complete transformation from camera to ECEF coordinates is:
$$ \begin{bmatrix} x^e_T \\ y^e_T \\ z^e_T \\ 1 \end{bmatrix} = \mathbf{A} \mathbf{B} \mathbf{C} \begin{bmatrix} x^c_T \\ y^c_T \\ z^c_T \\ 1 \end{bmatrix} $$
Where transformation matrices account for:
- Camera gimbal angles (matrix C)
- UAV attitude (matrix B)
- Geodetic position (matrix A)
Simulation experiments within ROS/Gazebo validated the approach. A camera UAV approached targets at 3 m/s while performing localization. Depth estimation demonstrated rapid convergence:
| True Distance (m) | Measured (m) | Refined (m) | Error (m) |
|---|---|---|---|
| 27.97 | 32.71 | 36.80 | 8.83 |
| 17.52 | 17.07 | 18.50 | 0.98 |
| 11.62 | 9.41 | 11.83 | 0.21 |
| 6.31 | 6.33 | 6.22 | 0.09 |
Positioning accuracy improved as the camera drone reduced distance:
$$ \Delta \lambda < 0.00001^\circ, \quad \Delta \phi < 0.00001^\circ, \quad \Delta H < 0.2 \text{ m} $$
Error analysis across 10 trials showed consistent performance:
- Initial distance errors: 8-12% (20-30m range)
- Stabilized errors: 1-3% within 15m range
- Final positioning error: <0.5m absolute
This monocular approach enables camera UAVs to conduct passive target localization without prior dimensional knowledge. The method demonstrates particular value for small UAVs operating in GPS-denied environments where active sensors are impractical. Future work will address dynamic targets and adverse weather conditions.
