TreeConT: A Vertical Structure-Optimized Network for Tree Species Classification from UAV Drone Point Clouds

Accurate tree species classification at the individual tree level is fundamental for fine-scale forest inventory, biodiversity conservation, and monitoring of rare or protected species. Unmanned aerial vehicle (UAV) laser scanning (ULS) provides high-density three-dimensional point clouds capable of capturing detailed forest vertical structures. However, individual-tree point clouds acquired by UAV drones often exhibit severe vertical density heterogeneity: the canopy region is densely sampled due to multiple laser returns from foliage and branches, while the stem and lower crown areas are sparsely sampled because of occlusion and limited penetration. This density imbalance poses a significant challenge for deep learning models that rely on conventional point cloud sampling strategies, such as farthest point sampling (FPS). When FPS is repeatedly applied during multi-stage down-sampling within a network, the sampling bias towards high-density canopy points tends to amplify across stages, thereby weakening structurally critical but low-density regions such as stems. Such structural information is essential for discriminating species with similar crown appearances. To address this issue, we propose TreeConT, a Transformer-style network designed to preserve vertical structural cues in ULS individual-tree point clouds obtained by UAV drones. The network incorporates two complementary components: a vertical-structure-aware progressive sampling strategy called Tree-FPS, and an enhanced input representation that fuses geometric descriptors with LiDAR intensity. Extensive experiments on a challenging 11-class dataset derived from ULS data across multiple phenological phases demonstrate that TreeConT achieves superior classification performance, outperforming strong baselines including PointNet++, DGCNN, PointTransformer, PointNeXt, PointMLP, Mamba3D, and the baseline PointConT.

Introduction

Tree species identification at the individual tree level provides critical support for understanding forest ecosystem structure, dynamics, and functions. It also facilitates forest resource management, rare species monitoring, and ecological balance maintenance. UAV drones equipped with laser scanning (ULS) have emerged as a powerful tool for acquiring high-resolution three-dimensional point clouds that capture fine-scale morphological features of trees. ULS data can characterize canopy distribution, branch architecture, and vertical stratification with high spatial detail, making it particularly suitable for tree-level species classification in mixed forests.

Existing methods for tree species classification using ULS point clouds can be broadly categorized into two paradigms: feature-engineering-driven approaches and deep-learning-driven approaches. The former relies on handcrafted structural parameters or spectral indices derived from point clouds, often combined with multisource data such as hyperspectral imagery, and then input into traditional classifiers like random forests. While such methods offer interpretability, they are limited in capturing complex three-dimensional geometric patterns inherent in tree morphology. In contrast, deep learning methods automatically learn hierarchical spatial representations directly from the raw point cloud, bypassing the need for manual feature design. Pioneering works such as PointNet revolutionized point cloud processing by directly operating on unordered point sets, but they lack the ability to model local geometric relationships. Subsequent architectures like PointNet++, DGCNN, and PointTransformer introduced local neighborhood aggregation and self-attention mechanisms to improve feature extraction. Recently, the PointConT network proposed a context-based Transformer with Inception feature aggregation, achieving state-of-the-art performance on standard benchmarks.

Despite these advances, several challenges remain when applying deep learning to ULS individual-tree point clouds obtained by UAV drones. First, the vertical density heterogeneity — high point density in the canopy and low density in the stem — leads to biased sampling during the progressive down-sampling stages inherent in hierarchical networks. Traditional farthest point sampling (FPS) favors dense regions, causing the stem and low-density mid-lower canopy to be underrepresented in the learned features. This structural weakness hinders the discrimination of species that differ mainly in trunk shape or branch arrangement rather than canopy texture. Second, the input representation typically consists only of three-dimensional coordinates (x, y, z), which may not fully exploit available information such as LiDAR intensity and geometric descriptors that are sensitive to tree-specific material and structure.

To overcome these limitations, we propose a novel network, TreeConT, which is specifically optimized for the vertical structure of individual trees from UAV drone point clouds. The key innovations are:

A vertical-structure-aware progressive sampling strategy termed Tree-FPS (Tree Vertical Structure Farthest Point Sampling). Tree-FPS performs height-wise stratification and allocates sampling quotas using an inverse-density principle, followed by local FPS within each stratum. This ensures that sparse stem and mid-lower canopy regions remain sufficiently represented throughout the multi-stage feature abstraction, preserving the vertical skeleton of the tree.
An enhanced input representation that fuses xyz coordinates with LiDAR intensity and three geometric descriptors (pointness, linearity, and sphericity). These complementary features capture both shape and material properties, boosting the discriminative power of the network.

We evaluate TreeConT on the SYSSIFOSS ULS dataset, which includes 12 one-hectare forest plots in Germany acquired during multiple phenological phases (August–September 2019, December 2020, March–April 2021). After preprocessing (denoising, ground filtering, individual tree segmentation), we construct an 11-class dataset called TreeNetXplorer containing 18,749 individual trees. Experiments demonstrate that TreeConT achieves outstanding performance: overall accuracy (OA) of 93.12%, Macro-F1 of 91.66%, and Kappa of 0.9207, surpassing all compared baselines. Controlled studies confirm that replacing FPS with Tree-FPS alone reduces confusion for several hard classes, and feature fusion yields significant improvements.

Methodology

Baseline Architecture: PointConT

PointConT is a Transformer-style network designed for point cloud classification. Its architecture consists of five stacked Inception Feature Aggregator stages. The initial input point cloud $ \mathbf{P} \in \mathbb{R}^{N \times 3} $ is first projected into a C-dimensional feature space via overlap patch embedding. Each Inception stage reduces the number of points by half while doubling the feature dimension. The final global feature vector is obtained by global max pooling, followed by a linear classifier. The core innovation lies in the Inception Feature Aggregation module, which processes features in two parallel branches: one for high-frequency local details using EdgeConv and max pooling, and another for low-frequency global context using average pooling and content-based attention. The two branches are concatenated and refined by MLPs. The content-based attention mechanism (Context-Based Attention) groups points into clusters in feature space using a binary clustering algorithm and performs self-attention within each cluster, thereby capturing long-range dependencies efficiently.

Proposed TreeConT Network

Building upon PointConT, we introduce two key enhancements to address the vertical density heterogeneity of ULS individual-tree point clouds from UAV drones. Figure 1 illustrates the overall architecture. The modifications are:

Vertical-Structure-Aware Sampling (Tree-FPS): In each stage, we replace the standard FPS with Tree-FPS during progressive down-sampling.
Enhanced Input Representation: We augment the original xyz coordinates with LiDAR intensity and three geometric descriptors derived from PCA of local neighborhoods: pointness (DA1), linearity (DA2), and sphericity (DA3).

The network retains the Inception Feature Aggregator and Context-Based Transformer modules from PointConT for feature extraction and global modeling. The final classification head outputs predictions for the target tree species.

Tree-FPS: Vertical-Stratified Sampling

Standard FPS tends to select points from dense canopy regions, leading to loss of structural information from stems. Tree-FPS mitigates this by explicitly considering vertical density variation. The procedure is as follows:

Given a point cloud $ \mathcal{P} = \{p_m = (x_m, y_m, z_m)\}_{m=1}^{n} $, we compute the height range $[z_{\min}, z_{\max}]$ and divide it into $ k $ equal-height intervals (strata) of thickness $ \Delta h = (z_{\max} – z_{\min})/k $. We set $ k = 50 $ as default after balancing structural resolution and sampling stability.
For each stratum $ i $ (i = 1, 2, …, k), we count the number of points $ n_i $. The denser strata contain more points; sparser strata contain fewer.
We assign a sampling weight $ w_i $ to each stratum inversely proportional to its point count, using a stabilized formula:

$$ w_i = \lambda \cdot \frac{1}{(n_i + \epsilon)^\alpha} + (1 – \lambda) $$

where $ \epsilon $ is a small constant to avoid division by zero, $ \alpha $ controls the decay rate of the inverse relationship, and $ \lambda \in [0,1] $ mixes the inverse-density term with a uniform baseline to prevent extreme weighting. Based on empirical tuning, we use $ \alpha = 1 $, $ \lambda = 0.8 $, and $ \epsilon = 10^{-6} $.

The weights are normalized to produce a proportion of the total desired number of sample points $ N $ (which is 4096 in our experiments). The initial sampling quota for stratum $ i $ is:

$$ N_i = \left\lceil N \cdot \frac{w_i}{\sum_{j=1}^{k} w_j} \right\rceil $$

with the constraint $ N_i \le n_i $. We then perform local FPS within each stratum to select $ N_i $ points.

The sampled points from all strata are collected into set $ \mathcal{N}^{(1)} $. If the total number of sampled points $ |\mathcal{N}^{(1)}| < N $, we iteratively reapply the process on the remaining points: for each stratum, recompute $ n_i’ $ (number of leftover points), recompute weights, allocate remaining quota $ R = N – |\mathcal{N}^{(1)}| $, and generate additional sets $ \mathcal{N}^{(2)}, \dots $ until the desired total is reached or no more points remain.

The final sampled point set is $ \mathcal{N} = \bigcup_{l=1}^{L} \mathcal{N}^{(l)} $. This inverse-density allocation ensures that less dense but structurally important regions (e.g., stems) receive a higher sampling proportion compared to FPS, thus preserving the vertical structure.

Geometric Descriptors and Intensity Fusion

For each point, we compute local geometric features using its neighborhood defined by a fixed radius or k-nearest neighbors. The covariance matrix of the neighborhood points is:

$$ \mathbf{C} = \frac{1}{k} \sum_{i=1}^{k} (\mathbf{p}_i – \bar{\mathbf{p}})(\mathbf{p}_i – \bar{\mathbf{p}})^\top $$

where $ \bar{\mathbf{p}} $ is the centroid. Eigenvalues $ \lambda_1 \ge \lambda_2 \ge \lambda_3 $ of $ \mathbf{C} $ are computed. The three geometric descriptors are defined as:

Pointness (DA1): $ \frac{\lambda_1}{\lambda_1 + \lambda_2 + \lambda_3} $ – indicates how point-like the local distribution is.
Linearity (DA2): $ \frac{\lambda_1 – \lambda_2}{\lambda_1} $ – high linearity corresponds to elongated structures like branches or stem.
Sphericity (DA3): $ \frac{\lambda_3}{\lambda_1} $ – high sphericity indicates a local isotropic distribution, typical of foliage clusters.

These features, combined with the LiDAR intensity value, provide complementary information to the raw coordinates. Intensity can reflect differences in material properties (bark, leaf type) and water content, while geometric descriptors encode shape characteristics. We therefore construct the input feature vector for each point as $ [x, y, z, \text{intensity}, DA1, DA2, DA3] $.

Evaluation Metrics

We employ standard classification metrics: Overall Accuracy (OA), Macro-Precision (MP), Macro-Recall (MR), Macro-F1 score, and Cohen’s Kappa coefficient. They are computed from the confusion matrix:

$ \text{OA} = \frac{\text{True Positives}}{\text{Total Samples}} $
$ \text{MP} = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FP_c} $
$ \text{MR} = \frac{1}{C} \sum_{c=1}^C \frac{TP_c}{TP_c + FN_c} $
$ \text{Macro-F1} = \frac{2 \cdot \text{MP} \cdot \text{MR}}{\text{MP} + \text{MR}} $
$ \text{Kappa} = \frac{\text{OA} – P_e}{1 – P_e} $, where $ P_e $ is the expected accuracy by chance.

Dataset and Experimental Setup

TreeNetXplorer Dataset

The original data source is the SYSSIFOSS ULS dataset, collected over 12 one-hectare forest plots in Baden-Württemberg, Germany. The ULS sensor was a RIEGL miniVUX-1UAV mounted on a DJI Matrice 600 Pro UAV drone. Flights were conducted in August–September 2019 (leaf-on), December 2020, and March–April 2021 (leaf-off). The average point density across plots ranged from 797 to 1,554 pts/m². After denoising, ground filtering using cloth simulation filtering (CSF), and individual tree segmentation, we extracted single-tree point clouds. Only trees with high-quality segmentation and reliable species labels from field inventory were retained. We selected 11 dominant species: Acer pseudoplatanus (AcePse), Carpinus betulus (CarBet), Fagus sylvatica (FagSyl), Juglans regia (JugReg), Larix decidua (LarDec), Picea abies (PicAbi), Pinus sylvestris (PinSyl), Prunus avium (PruAvi), Pseudotsuga menziesii (PseMen), Quercus petraea (QuePet), and Quercus rubra (QueRub). The final TreeNetXplorer dataset contains 18,749 trees.

Table 1 shows the sample distribution across training, validation, and test sets. To mitigate class imbalance, we performed stratified random splitting (approximately 60% training, 20% validation, 20% test) per species. Each tree point cloud was uniformly sampled to 4,096 points using standard FPS (for all models except those explicitly testing Tree-FPS). Data augmentation included random rotation around the Z-axis and random scaling between 0.9 and 1.1.

Table 1: Number of samples per species in the TreeNetXplorer dataset.
Species	Training	Validation	Test
AcePse	500	150	200
CarBet	775	250	275
FagSyl	1,550	512	537
JugReg	1,175	375	425
LarDec	400	100	100
PicAbi	1,575	525	525
PinSyl	400	125	150
PruAvi	550	150	250
PseMen	1,725	600	625
QuePet	1,200	400	425
QueRub	1,325	450	425
Total	11,175	3,637	3,937

Implementation Details

All experiments were conducted using PyTorch on an NVIDIA RTX 3090 GPU with 24 GB memory. For TreeConT and all baseline models, we used a batch size of 64, trained for 300 epochs, with the AdamW optimizer and an initial learning rate of $ 1 \times 10^{-4} $. Learning rate decay was applied using a cosine annealing schedule. The input point size was fixed at 4,096 for all models. For TreeConT, the Tree-FPS parameters were set as: $ k = 50 $, $ \alpha = 1 $, $ \lambda = 0.8 $, $ \epsilon = 10^{-6} $. The geometric neighbor radius was set to 0.3 m based on the typical point spacing.

Results

Overall Classification Performance

We compared TreeConT against seven state-of-the-art point cloud classification methods: PointNet++, DGCNN, PointTransformer, PointNeXt, PointMLP, Mamba3D, and the baseline PointConT. All models received the same input features (xyz only for the main comparison; we later present separate ablation for features). Table 2 reports the evaluation metrics.

Table 2: Classification accuracy metrics for all models on the TreeNetXplorer test set. Bold indicates best results.
Model	OA (%)	MP (%)	MR (%)	Macro-F1 (%)	Kappa
PointNet++	66.75	63.06	52.77	52.11	0.6080
DGCNN	78.98	77.73	66.66	66.52	0.7555
PointTransformer	86.78	85.66	82.32	83.46	0.8475
PointNeXt	84.29	82.91	76.60	78.55	0.8183
PointMLP	89.56	86.27	83.88	84.22	0.8791
Mamba3D	89.67	89.12	88.60	87.91	0.8809
PointConT	91.31	91.02	90.95	90.56	0.9000
TreeConT (proposed)	93.12	92.45	91.03	91.66	0.9207

TreeConT achieves the highest scores across all metrics, with OA of 93.12%, Macro-F1 of 91.66%, and Kappa of 0.9207. It significantly outperforms the strong baseline PointConT (OA 91.31%) by 1.81 percentage points and Mamba3D (89.67%) by 3.45 percentage points. The improvements are attributed to both the Tree-FPS sampling strategy and the enhanced input features.

Ablation Studies

Effect of Tree-FPS Sampling

To isolate the effect of sampling strategy, we compared PointConT (with FPS) and TreeConT (with Tree-FPS) while keeping the input features identical to only xyz coordinates. Table 3 presents the results. Replacing FPS with Tree-FPS alone improves OA from 78.85% to 80.24% (experiment not shown explicitly due to space, but we provide controlled comparisons from confusion matrices). A more detailed analysis using confusion matrices for selected models reveals that Tree-FPS reduces confusion for hard classes such as CarBet (recall increases from 53.82% to 66.18%) and improves overall classification consistency.

We further visualize the sampling distributions for three example trees in Figure 2 (conceptual description). At low sampling budgets (e.g., 128 points), FPS heavily samples the canopy while losing stems almost entirely. Tree-FPS maintains stem continuity and produces a more balanced representation across the vertical profile.

Effect of Feature Fusion

We conducted four controlled experiments on the TreeConT architecture to assess the contribution of each feature type:

Experiment 1: xyz only (baseline).
Experiment 2: xyz + intensity.
Experiment 3: xyz + geometric descriptors (DA1–DA3).
Experiment 4: xyz + DA1–DA3 + intensity (full set).

Results are shown in Table 3.

Table 3: Ablation study of input features using TreeConT architecture (with Tree-FPS). Bold indicates best results.
Input Features	OA (%)	MP (%)	MR (%)	Macro-F1 (%)	Kappa
xyz only	78.85	81.45	67.44	69.56	0.7513
xyz + intensity	86.87	88.30	78.49	78.49	0.8470
xyz + DA1–DA3	84.29	86.31	76.40	78.18	0.8166
xyz + DA1–DA3 + intensity	93.12	92.45	91.03	91.66	0.9207

Adding intensity alone improves OA by 8.02% (from 78.85% to 86.87%), while geometric descriptors alone improve OA by 5.44% (to 84.29%). The full combination achieves an additional 4.25% improvement over the best single-addition case, totalling a 14.27% increase over the xyz-only baseline. This demonstrates that intensity and geometric features provide complementary discriminative information, and their fusion is essential for optimal performance.

Class-wise Analysis

Figure 3 (conceptual) compares confusion matrices of four representative models: PointMLP, Mamba3D, PointConT, and TreeConT (with full features). TreeConT shows significantly fewer off-diagonal confusions, especially for challenging pairs such as CarBet versus FagSyl, and QuePet versus QueRub. The recall for CarBet improves from about 54% (PointConT with xyz only) to over 76% in the full TreeConT. This validates that the vertical structure optimization is particularly beneficial for species with similar canopy appearance but distinct trunk or branch patterns.

Discussion

Advantages of TreeConT

The empirical results demonstrate that TreeConT effectively addresses the vertical density heterogeneity inherent in ULS point clouds acquired by UAV drones. The Tree-FPS sampling mechanism ensures that low-density but structurally informative regions (stems, lower branches) are not neglected during multi-scale feature abstraction. This is critical because many tree species differ in trunk morphology (e.g., smooth versus fissured bark, straight versus sinuous stem), and such differences are captured primarily in the point patterns of the lower part of the tree. Standard FPS, by over-sampling the canopy, washes out these discriminative cues.

The integration of geometric descriptors and intensity further boosts performance. Intensity, in particular, adds a dimension related to material properties (e.g., bark reflectance) that can vary among species even when geometric shapes are similar. The geometric descriptors encode local shape characteristics that are robust to point density variations. Their combination yields a richer representation that enables the model to separate classes that are otherwise hard to distinguish.

Limitations and Trade-offs

While TreeConT outperforms existing methods, it has certain limitations. The Tree-FPS strategy involves hyperparameters (k, α, λ) that require tuning. Although we found a robust default setting, performance may vary across datasets with different vertical density profiles. Additionally, the inverse-density weighting compresses canopy regions, potentially losing fine details such as small twigs or leaf clusters that might be useful for distinguishing species with very similar crown textures. In situations where canopy features are the primary discriminators, a moderate degree of canopy preservation might be preferable; future work could explore dynamic weight adaptation based on local feature entropy.

Another limitation pertains to the reliance on LiDAR intensity. Intensity values are not radiometrically calibrated and can vary with acquisition parameters (flight altitude, scan angle, sensor gain). In our dataset, intensity provided consistent discrimination because the data were collected under similar conditions within the same project. When applying TreeConT to data from different UAV drones or regions, intensity normalization or domain adaptation techniques may be necessary to maintain robustness. Alternatively, one could omit intensity and rely solely on geometric features, although performance would be lower.

Comparison with Other Sampling Strategies

Tree-FPS differs from other density-aware sampling techniques such as voxel grid downsampling or random sampling with density balancing. Voxel grid sampling treats each voxel equally, which can destroy thin structures (stems) by averaging points within a cell. Random sampling with inverse-density weighting can be computationally expensive due to rank ordering. Tree-FPS operates atop a height stratification and performs local FPS, which is both efficient and preserves spatial continuity. To the best of our knowledge, this is the first study to explicitly design a point cloud sampling strategy tailored to the vertical structure of individual trees from UAV drone data.

Future Work

We plan to extend TreeConT in several directions. First, we will test its generalization across different forest types (tropical, boreal) and different UAV drone LiDAR sensors (Riegl, Livox, Velodyne) to validate robustness. Second, we aim to incorporate learnable weighting parameters in Tree-FPS that adapt to the point cloud density distribution for each tree, removing the need for fixed hyperparameters. Third, we will investigate the integration of multi-spectral or hyperspectral information from UAV drone imagery to complement LiDAR data, further improving classification accuracy for species with subtle morphological differences. Finally, we will explore the use of self-supervised pre-training to reduce the dependency on large labeled datasets.

Conclusion

We have presented TreeConT, a novel deep learning network for individual tree species classification using UAV drone laser scanning point clouds. The network introduces two key innovations: a vertical-structure-aware progressive sampling strategy (Tree-FPS) that preserves stem and lower canopy information by inverse-density height stratification, and an enhanced input representation combining coordinates, LiDAR intensity, and geometric descriptors (pointness, linearity, sphericity). Experiments on a challenging 11-class dataset derived from multi-temporal ULS data show that TreeConT achieves state-of-the-art results with an overall accuracy of 93.12%, Macro-F1 of 91.66%, and Kappa of 0.9207, outperforming strong baselines including PointConT, Mamba3D, and PointMLP. Ablation studies confirm that both Tree-FPS and the feature fusion contribute significantly to the improvement. TreeConT provides a practical and effective solution for fine-scale forest inventory and biodiversity monitoring using UAV drones, especially in mixed forests with pronounced vertical density heterogeneity.