Skip to content
Snippets Groups Projects
Commit b2bcf69e authored by Nathan Tarr's avatar Nathan Tarr
Browse files

Accidental overwritten with text from feature-branch

parent 34311299
No related branches found
No related tags found
No related merge requests found
# The Wildlife Wrangler
The abundance of wildlife occurrence datasets that are currently accessible can be valuable for efforts such as species distribution modeling and range delineation. However, the task of downloading and filtering occurrence records is often complex due to errors and uncertainties that are present in datasets (Tessarolo et al. 2017). This repository provides a high-level framework for acquiring and filtering occurrence data that are freely available through the Global Biodiversity Information Facility ([GBIF](https://gbif.org)) API and eBird Basic Dataset ([EBD](https://ebird.org/science/use-ebird-data/)).
The abundance of wildlife occurrence datasets that are currently accessible can be valuable for efforts such as species distribution modeling and range delineation. However, the task of downloading and filtering occurrence records is often complex due to errors and uncertainties that are present in datasets. This repository provides a high-level framework for acquiring and filtering occurrence data that are freely available through the Global Biodiversity Information Facility ([GBIF](https://gbif.org)) API and eBird Basic Dataset ([EBD](https://ebird.org/science/use-ebird-data/)). wildlife-wranger was designed with wildlife occurrence data in mind, and it accounts for numerous issues and challenges related to the application of occurrence records to species distribution modeling. Features that support transparency and build confidence for analyses and evaluations of species distributions are described in the User's Guide and a [published abstract] (https://biss.pensoft.net/article/93823/).
In the wildlife-wrangler framework, records are requested from occurrence datasets and filtered according to species- and query-specific parameters. Filtered occurrence records are saved in a database. Options are available to store the details of taxa concepts and filter parameters as JSON files for reuse and reference. Additionally, Jupyter Notebook documents are created that describe the filtered data sets for the sake of documentation and filter set refinement.
## USGS Software Release Information
* A newer version of this software may be available. See https://code.usgs.gov/sas/bioscience/wildlife-wrangler/-/releases to view all releases.
......@@ -19,40 +22,6 @@ Development : Nathan Tarr (nmtarr@ncsu.edu, [orcid:0000-0003-2925-8948](https://
Contributors : Alexa McKerrow (amckerrow@usgs.gov, [orcid:0000-0002-8312-2905](https://orcid.org/0000-0002-8312-2905)),
Matthew Rubino (mjrubino@ncsu.edu, [orcid:0000-0003-0651-3053](https://orcid.org/0000-0003-0651-3053)), Jill Adelstein, Curtis Belyea (cbelyea@usgs.gov, [orcid:0000-0001-5141-579X](https://orcid.org/0000-0001-5141-579X)).
## Framework
Records are requested from occurrence datasets and filtered according to species- and query-specific parameters. Filtered occurrence records are saved in a database. The details of taxa concepts and filter parameters can be stored as JSON files for reuse and reference. Additionally, Jupyter Notebook documents are created that describe the filtered datasets for the sake of documentation and filter set refinement.
## Valuable Features
This framework has certain features that support transparency and build confidence for analyses and evaluations of species distributions.
* __Automation__ -- The potential volume of data involved necessitates the processes be automated. Automation also reduces subjectivity in decision making, enables thorough documentation, and ensures repeatability. However, some aspects of data wrangling and filtering are unavoidably analog.
* __Open source components__ -- All processes are run with Python, but are coded in Python, R, and SQL and use sqlite3, a built-in Python package.
* __Transparency__ -- Summaries of occurrence data and models using empirical data include subjectivity in the form of choices regarding parameters and rules for curating input data. This framework is meant to provide a way to document those choices.
* __Detailed parameterization__ -- Queries and filters can be parameterized on a per-species and per-event basis. Rules do not have to be applied generally across large numbers of species or evaluations, but parameters are saved in JSON files that can easily be reused ("taxon_json" and "filter_set_json").
* __Geospatial processing__ -- Some geospatial operations are performed including re-projecting and buffering points. A helper function is provided that creates a shapefile of points or occurrence record footprints from an output database. Spatial filtering is supported: queries can be limited to within geometries ("taxon_polygon" and "query_polygon"), and spatial restrictions can be assigned to taxa's' extents of occurrence. The user can also specify a continent and/or country within which to return records.
* __Source filtering__ -- Global Biodiversity Information Facility ([GBIF](https://gbif.org)) aggregates records from many datasets, collections, and institutions. The user can specify collections and institutions to omit with the "institutions_omit" and "collection_codes_omit" parameters.
* __Duplicate handling__ -- Queries commonly include duplicates based on the latitude, longitude, and date fields. The user can opt to keep or exclude duplicates. Subsequent version will support more flexible duplicate testing capabilities. See the User's Guide for more details.
* __Data summarization__ -- The attributes of occurrence records are summarized in a Jupyter Notebook document.
* __Record weighting__ -- Output databases include fields for weighting records based on their attributes. The wrangler does not currently offer the capability to adjust weights, but the user can do so after the output is generated; manually or with SQL. Weighting can be used to quantify data degradation, and thus improve models and other applications (Tessarolo et al. 2017).
* __Wildlife-centric design__ -- The framework addresses several filter criteria that are especially relevant to species occurrence records for studies of wildlife distributions.
* _Occurrence year_ -- Species' true distributions and taxonomic concepts can change over time, which warrants a way to select records from within user-defined time periods with the "years_range" parameter.
* _Occurrence month_ -- Relevant for investigations of migratory species and individual seasons. The "months_range" parameter can be used for this.
* _Locational uncertainty_ -- Data providers associate occurrence records with geographic locations through a process of georeferencing. However, the spatial accuracy, precision, and other details of georeferences can vary immensely among records (Chapman and Wieczorek 2020). There are varying degrees of uncertainty regarding the locations of individuals recorded and/or the areas sampled by observers ("locational uncertainty"). Within GBIF and eBird, georeferences are often limited in detail, so researchers must make assumptions or best guesses. The framework provides a means of documenting and explaining those choices and includes processes that interpret and approximate georeferences. It is also possible to filter on locational uncertainty with the "max_coordinate_uncertainty" parameter. See the user's guide for more information on this complex topic.
* _Detection distance_ -- The distance between an observer and an individual animal that is recorded can limit the scales of analyses that are possible. That is because detection distance adds to the locational uncertainty associated with records from other factors such as the precision of the Global Positioning System (GPS) reading and observer movement during transects and traveling counts. This is not a concern for all taxa because different taxa may be sampled with different methods. The uncertainty surrounding the given locations of individuals recorded can vary among methods, and, in some cases, be zero or negligible. For example, small mammals that are captured in traps can confidently be assigned to the trap's location, but loud-singing birds detected in an auditory survey could be hundreds of meters away from the observer, and thus, the coordinate associated with the record. Some researchers choose not to address this, while others structure analyses around it. The wildlife-wrangler allows for either strategy and anything in between, and the decision can be documented and justified. A default maximum detection distance can be specified for the taxa concepts with the "detection_distance_m" parameter. See section 2.3 in Chapman and Wieczorek (2020), Buckland et al. (2001), or any of a number of books and articles that cite Buckland et al. (2001) for more on this topic.
* _Dynamism in occurrence datasets_ -- Some datasets enable data contributors to revise attributes of occurrence records after they are added to datasets. In addition, historic records may be added that change the set of records associated with a past time period. That is to say that a query of years past that was run today may be different than the same query run tomorrow. This represents a challenge for provenance. The method used to support handling of the issue in the wildlife-wrangler is to document taxon concepts and parameters for filtering data as JSON objects ("taxon_json" and "filter_set_json") that can be saved, documented, and reused. Additionally, records included in the wildlife-wrangler output are linked to the filter sets used to acquire them.
* _Taxonomic issues_ -- Failure to account for taxonomic issues, such as species name changes, synonyms, homonyms, and taxon concept changes (with or without name changes), can create problems for studies of species' geographic distributions that use species occurrence records (Tessarolo et al. 2017). The potential consequences of these errors include commission errors, inflated omission rates, and missed opportunities for model validation. Careful specification of taxon concepts in the wildlife-wrangler can help avoid such errors.
* _Known issues_ -- Some records have known issues that limit their value. The framework enables users to exclude records based on issues identified by providers with the "issues_omit" parameter.
* _Basis of record and sampling protocols_ -- datasets accessed through GBIF include a variety of types of records, such as preserved specimens and fossil records. Additionally, different sampling protocols may have been employed. The user can choose which types to filter out with "bases_omit" and "sampling_protocols_omit".
## Recent changes
* Support for pygbif 0.6.2 and changes to the GBIF occurrence download API.
......@@ -94,11 +63,4 @@ Unless otherwise noted, This project is in the public domain in the United State
Additionally, we waive copyright and related rights in the work worldwide through the CC0 1.0 Universal public domain dedication.
This software is preliminary or provisional and is subject to revision. It is being provided to meet the need for timely best science. The software has not received final approval by the U.S. Geological Survey (USGS). No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. The software is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the software.
## References
Buckland ST, Anderson DR, Burnham KP, Laake JL, Borchers DL, Thomas L. 2001. Introduction to distance sampling estimating abundance of biological populations. Oxford University Press, Oxford. pp. 432.
Chapman, A.D. & Wieczorek, J.R. 2020 Georeferencing Best Practices. Copenhagen: GBIF Secretariat. https://doi.org/10.15468/doc-gg7h-s853
Tessarolo, G., Ladle, R., Rangel, T., and Hortal, J. 2017. Temporal degradation of data limits biodiversity research. Ecology and Evolution. 7:6863–6870. https://doi.org/10.1002/ece3.3259
This software is preliminary or provisional and is subject to revision. It is being provided to meet the need for timely best science. The software has not received final approval by the U.S. Geological Survey (USGS). No warranty, expressed or implied, is made by the USGS or the U.S. Government as to the functionality of the software and related material nor shall the fact of release constitute any such warranty. The software is provided on the condition that neither the USGS nor the U.S. Government shall be held liable for any damages resulting from the authorized or unauthorized use of the software.
\ No newline at end of file
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment