Skip to content
Snippets Groups Projects
Commit 3e0b26ae authored by Nathan Tarr's avatar Nathan Tarr
Browse files

Removed some duplicated text.

parents b2bcf69e 1bc81e03
No related branches found
No related tags found
No related merge requests found
......@@ -57,6 +57,55 @@ The user begins by creating a copy of the *Query_Form_TEMPLATE.ipynb*, which the
### Where to Find Help
* The wrangler uses the [pygbif](https://pygbif.readthedocs.io/en/latest/) and [Auk]("https://cornelllabofornithology.github.io/auk/") packages, so their documentations can help explain some aspects. These are the GBIF fields currently used to answer key questions about records:
* What? -- "record_id", "gbif_id", "individualCount", "identificationQualifier"
* When? -- "eventDate", "retrievalDate"
* Where? -- "coordinateUncertaintyInMeters", "decimalLatitude", "decimalLongitude", "footprintWKT", "geodeticDatum"
* Who provided? -- "collectionCode", "institutionID", "datasetName"
* How obtained? -- "basisOfRecord", "samplingProtocol", "establishmentMeans", "source"
* Issues, notes, comments -- "issue" or "issues", "locality", "eventRemarks", "locationRemarks", "occurrenceRemarks"
## Important Issues and Features
### General Overview of the Workflow
The user begins by creating a copy of the *Query_Form_TEMPLATE.ipynb*, which they then customize by adding a taxon concept, filter parameter sets to use, and paths to relevant directories. When the customized notebook document is run, code that is stored in *"wrangler_functions.py"* and the notebook document itself retrieves records from the Global Biodiversity Information Facility ([GBIF](https://gbif.org)) and/or eBird Basic Dataset ([eBird](https://ebird.org/science/use-ebird-data/)), filters out unsuitable records, creates an output database where suitable records are stored along with documentation and summaries of record attributes before and after filtering. Additionally, various summaries of data attributes within the notebook document are performed. Thus, the primary results of running the code are 1) the notebook document with documentation and data summaries and 2) the output SQLite database containing suitable records.
### Key Components of the Framework
* __Filter sets__ -- saving filter sets (criteria) and taxa concepts as unique items (JSON files) makes it much easier to explore combinations of taxa definitions and filtering criteria. For example, if you want to use the same criteria for 20 species, you can call the same criteria each of the 20 times with just the codes. This setup was chosen with the running of hundreds of queries over time in mind.
* __Query_Form.ipynb__ -- this is where you control/run the wrangler. The file is a combined form and report all in one. You can copy the notebook document, rename it, fill it out, and run it to perform queries/downloads.
* __wranglerconfig.py__ -- this is a .py file where you store some personal information that you wouldn't want saved in the notebook document: your email address and password for your GBIF account, which is needed in order to request downloads from GBIF. A file path to an eBird Basic Dataset is saved here as well.
* __wrangler_functions.py__ -- a python module containing the core functions of the wrangler. DO NOT CHANGE THIS! Much of the necessary code is kept here to avoid having a thousand lines of code in the report.ipynb. You can call some helper functions from this by importing the module in ipython (i.e., "import wrangler_functions as wranglers"). That can be handy for using the "get_GBIF_code" and "generate_shapefile" functions.
### Detailed Instructions
1. Copy "__Query_Form_TEMPLATE.ipynb__" to a location outside of the wrangler repo, say to your project directory. Rename the notebook document to whatever you like.
2. In conda, activate your wrangler environment. Open Jupyter Notebook from the conda shell and navigate to your renamed copy of __"Query_Form_TEMPLATE.ipynb"__.
3. Fill out the notebook document and run it. A JSON file of the taxon information and filter parameters ("filter set") will be saved when the notebook is run. Alternatively, you can specify existing JSON's to use. NOTE: Run time can range from a few seconds to several hours.
4. When you have completed a query/notebook document, you can export the notebook document as an html file and archive it for reference later. The html versions can be exported within Jupyter Notebook or by adapting and running the following command line code: jupyter nbconvert --to html --TemplateExporter.exclude_input=True --output-dir="C:/YourFolder/" NOTEBOOK_NAME_HERE.ipynb, which is included in the last cell of the query notebook.
5. Open the output database in DB Browser or elsewhere and adjust the record weights as desired and/or further clean the records by hand or in broad strokes with R, Python, or SQL scripts.
### Tips
* Taxonomic issues -- Failure to account for taxonomic issues, such as species name changes, synonyms, homonyms, and taxon concept changes (with or without name changes), can create problems for studies of species' geographic distributions that use species occurrence records (Tessarolo et al. 2017). The potential consequences of these errors include commission errors, inflated omission rates, and missed opportunities for model validation. Careful specification of taxon concepts in the wildlife-wrangler can help avoid such errors.
* Known issues -- Some records have known issues that limit their value. The framework enables users to exclude records based on issues identified by providers with the "issues_omit" parameter.
* Basis of record and sampling protocols -- Data sets accessed through GBIF include a variety of types of records, such as preserved specimens and fossil records. Additionally, different sampling protocols may have been employed. The user can choose which types to filter out with "bases_omit" and "sampling_protocols_omit".
* Dynamism in occurrence data sets -- Some data sets enable data contributors to revise attributes of occurrence records after they are added to data sets. In addition, historic records may be added that change the set of records associated with a past time period. That is to say that a query of years past that was run today may be different than the same query run tomorrow. This represents a challenge for provenance. The method used to support handling of the issue in the wildlife-wrangler is to document taxon concepts and parameters for filtering data as JSON objects ("taxon_json" and "filter_set_json") that can be saved, documented, and reused. Additionally, records included in the wildlife-wrangler output are linked to the filter sets used to acquire them.
* Occurrence year -- Species' true distributions and taxonomic concepts can change over time, which warrants a way to select records from within user-defined time periods with the "years_range" parameter.
* Occurrence month -- Relevant for investigations of migratory species and individual seasons. The "months_range" parameter can be used for this.
### Where to Find Help
* The wrangler uses the [pygbif](https://pygbif.readthedocs.io/en/latest/) and [Auk]("https://cornelllabofornithology.github.io/auk/") packages, so their documentations can help explain some aspects. These are the GBIF fields currently used to answer key questions about records:
* What? -- "record_id", "gbif_id", "individualCount", "identificationQualifier"
......@@ -87,12 +136,15 @@ Chapman and Wieczorek (2020) identify several key components for georeferencing
* Detection distance -- Many species, especially birds, are observed at considerable distances from the observer's recorded location. The maximum amount of uncertainty from this component can differ across species. Detection distance is not provided, so the user must rely on an estimate of the maximum detection distance for a study species. That value is stored in the "detection_distance_m" field.
* Coordinate uncertainty -- For the point-radius method, the appropriate buffer radius for a record is stored in the Darwin Core field named "coordinateUncertaintyInMeters". A well-georeferenced record would have a coordinate uncertainty value that encompasses all of the previously listed components. In reality, this field is often empty and the user is left to approximate or estimate a value. The user can opt to provide a standard, default coordinate uncertainty value to use when "coordinateUncertaintyInMeters" is empty or rely upon built in processes to approximate an appropriate radius.
The following processes are relevant to incorporating and accounting for uncertainties in the locations of recorded individuals.
1. Decimal degrees values are rounded to five or fewer digits because few GPS devices have precision that truly exceeds that (1.1m).
2. During shapefile creation with spatial_output(), if "footprintWKT" is provided, the shape method is used and the record is mapped with the polygon that it delineates. Otherwise, the point-radius method is used.
3. When the point-radius method must be used and a "coordinateUncertaintyInMeters" value is provided, that value is used for the buffer radius. When no "coordinateUncertaintyInMeters" value is provided, a point-buffer radius is approximated by summing values from other fields and is stored in the "radius_m" field. Which fields are used to calculate the radius length depends upon the data source. The following table summarizes which fields are summed for each data source.
| Source | coordinateUncertaintyInMeters | GPS_accuracy_m | effort_distance_m | detection_distance_m |
......@@ -109,16 +161,24 @@ In summary, mapping occurrence records involves approximating non-point geometri
### Removing Duplicate Records
Queries commonly include duplicate records, potentially of different types. Some types of duplicates are automatically removed, whereas the user can specify whether to drop others. First, The eBird Basic Dataset includes records from individual birders that participated in checklists as a group, which creates a type of duplicate. Currently, when querying the EBD with wildlife-wrangler, only one checklist from each group of checklists is retained. Checklists are dropped according to methods present in the [auk_unique()](https://rdrr.io/cran/auk/man/auk_unique.html) and read_ebd() functions from eBird's auk package. Second, duplication based upon the latitude, longitude, and date fields may occur. Such coordinate-date duplicates are more complicated. If the users choose to exclude them with the "duplicate_coord_date_OK" parameter set to "False", then a multi-step process is triggered to account for two major issues. One, the values of latitude and longitude for a record may have different numbers of digits to the right of the decimal (i.e., latitude has eight decimals but longitude has six). Two, not all records have the same number of digits to the right of the decimal for latitude and longitude (i.e., one record may have two for latitude and longitude while another has 12). The process used is as follows:
### Removing Duplicate Records
Queries commonly include duplicate records, potentially of different types. Some types of duplicates are automatically removed, whereas the user can specify whether to drop others. First, The eBird Basic Dataset includes records from individual birders that participated in checklists as a group, which creates a type of duplicate. Currently, when querying the EBD with wildlife-wrangler, only one checklist from each group of checklists is retained. Checklists are dropped according to methods present in the [auk_unique()](https://rdrr.io/cran/auk/man/auk_unique.html) and read_ebd() functions from eBird's auk package. Second, duplication based upon the latitude, longitude, and date fields may occur. Such coordinate-date duplicates are more complicated. If the users choose to exclude them with the "duplicate_coord_date_OK" parameter set to "False", then a multi-step process is triggered to account for two major issues. One, the values of latitude and longitude for a record may have different numbers of digits to the right of the decimal (i.e., latitude has eight decimals but longitude has six). Two, not all records have the same number of digits to the right of the decimal for latitude and longitude (i.e., one record may have two for latitude and longitude while another has 12). The process used is as follows:
1. Latitude and longitude values of each record are rounded to the shorter of the two in cases where they differ.
2. If duplicates occur after that step, then the one with the smallest buffer radius is kept, or the sorted first record if radii are the same.
3. Records are identified that are a duplicate of a record with higher precision (e.g. (10.123, -10.123) would be flagged as a duplicate of (10.1234, -10.1234)).
4. Duplicates are again removed.
Removing coordinate-date duplicates would theoretically handle cases where the same record is entered twice into the data set. However, users may wish to remove duplicates based upon all or other fields. That would be a third type of duplication and support for that filtering will be added to a subsequent release.
### Valuable Features
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment