EverWatch benchmark: training and evaluation data for detection and species classification of Everglades wading birds from airborne imagery (Q8062)

From MaRDI portal
Dataset published at Zenodo repository.
Language Label Description Also known as
English
EverWatch benchmark: training and evaluation data for detection and species classification of Everglades wading birds from airborne imagery
Dataset published at Zenodo repository.

    Statements

    0 references
    Training and evaluation data to support the development of wading bird species detection models in the Everglades. Dataset contents Training data: 5128 images with 50491 annotations Evaluation data: 197 images with 4113 annotations Dataset structure All data is combined in everwatch_benchmark.zip. This archive contains image crops (stored as PNG files) and their associated annotations (stored in train.csv and test.csv). The CSV files with the annotations contain the following columns. image_path - basename of the image label - Species label xmin, xmax, ymin, ymax - coordinates for the ground truth bounding box on the image Labels 'Great Egret' 'Great Blue Heron' 'White Ibis' 'Roseate Spoonbill' 'Wood Stork' 'Snowy Egret' 'Anhinga' 'Unknown' (test only) Annotation methods To provide data for training and testing computer vision models we used a pragmatic combination of labels from a variety of different approaches. Seven species were identified when labeling: White Ibis (Eudocimus albus), Great Egret (Ardea alba), Great Blue Heron (Ardea herodias), Snowy Egret (Egretta thula), Wood Stork (Mycteria americana), Roseate Spoonbill (Platalea ajaja), and Anhinga (Anhinga anhinga). Anhinga have not traditionally been included in airborne counts because they are difficult to see from the air, but were included in our labeling as a nuisance class because they were occasionally detected by the models and classified as Great Blue Herons. Crop level annotations Our primary approach labeled birds in 3145 15001500 pixels crops from surveys conducted in 2020 and 2021. Crops were automatically extracted from the orthomosaics for each survey. Labeling was conducted using a private Zooniverse project and every bird in each crop was annotated as a point location (the center of the bird) with a species label. The point-based labels were converted into bounding boxes for model training and evaluation by centering a bounding box on each point using a 50x50 pixel box for larger species (Great Egret, Great Blue Heron, Wood Stork, Roseate Spoonbill, Anhinga) and a 36x36 pixel for smaller species (White Ibis, Snowy Egret). We combined these point-based data with explicit bounding box-based labels created using a combination of QGIS and LabelStudio. We selected imagery for this labeling to help address small sample sizes for some species in the initial training datasets. Imagery was selected using a combination of expert field knowledge of where less common species were nesting and automated detection of likely rare species. For automated detection we ran a preliminary model based on the crop data over large amounts of survey imagery and identified crops that the model thought included less common species. These crops were then provided to annotators for labeling. This type of active learning, where an existing model is used to identify images for labeling is increasingly recognized as a valuable approach for rapidly producing improved machine learning models (Kellenberger et al. 2019, 2021, Norouzzadeh et al. 2021). In combination these crop-based annotations included 2948 of 1500x1500m crops, including 41173 birds, that were used for model training. The remaining 197 crops, including 4113 birds, were reserved for model evaluation. These annotations are indicated by 8 digit numeric file names. Colony level annotations We combined the crop-based labels with full colony counts from 2022 conducted as part of reporting maximum counts. This labeling was conducted using Photoshop to label every bird in an entire colony as a point location with a species label, but due to the scale of labeling the point locations are not placed carefully on the center of the bird. To convert these point locations into bounding boxes needed for model training and evaluation, we used the bird detector module in the DeepForest Python package (Weinstein et al. 2022) to predict the location of birds within the image. We then associated the species label from the photoshop points with the overlapping bounding box prediction. For the minority of boxes that did not overlap a predicted bird, we drew a 1m buffer around the point. We used data from 13 full colony counts, totaling 9318 birds, which were cropped into 1983 1500x1500 pixel crops. Full colony counts did not overlap with existing training and testing data from the 15001500 pixels crops because they were from different years. These annotations are indicated by variable length file names that start with a site name, followed by a date, followed by a number. Difficult to distinguish species Model training for most object detection algorithms requires that all birds in an image are labeled. Therefore, for all labeling approaches, the annotators best estimate of the species label was included even in cases where the species identity could not be conclusively determined from the imagery. The most common difficulty for annotators in distinguishing between species is White Ibis and Snowy Egrets. These birds both have white feathers, are of similar size, and appear similar in airborne imagery. As a result, annotators often use information beyond what is visibly apparent from individual birds to distinguish between these species, including what is known from the field about which species are present in the associated colonies and patterns in how the species aggregate in space. Cleaning of evalutation dataset Test data for evaluating the performance of the model was based exclusively on the crop-based labels. To ensure that the test data was as accurate as possible, all labels in the test set were checked and adjusted if necessary by the lead of the labeling effort (Lindsey Garner). This included both checking the species identification, adjusting any bounding boxes that did not properly contain the associated bird, and adding any labels that were missing entirely. To allow for understanding when it was not possible for a human to label a bird confidently to species, the labels for birds in the test set that could not be identified by the labeling lead were changed to indicate that the species was unknown (n = 94 birds out of 4113 test labels).
    0 references
    13 May 2024
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    v0.1.0
    0 references

    Identifiers

    0 references