Geocoding

Geocoding is the process of identifying a geographic location represented by a description of a place. Geocoding software attempts to match the description (e.g., an address) with geographic reference data. When the quality of the description or reference data is poor, the software may only identify a poor-quality match or no match whatsoever. Luckily, the geocoding software used by GeoMarker provides information on the quality of each geocode. After receiving data from GeoMarker, you should review this information to determine the quality of the geocoded data. The information on this page will help you interpret the geocoding quality data and identify and address any issues.

Geocoding Quality and Biases

Geocoding works better for some types of addresses than others. Addresses that clearly represent a location of interest and that are likely to have matching counterparts in reference feature datasets work best.

Types of addresses that commonly cause problems for geocoding include:

  • PO Boxes - Because PO Boxes are physically located in post offices, rather than at the actual residence location of an individual, geocoding to the location of a PO Box typically does not return meaningful results in terms of where an individual lives. In addition, many reference datasets do not include PO Boxes.
  • Rural addresses - Rural addresses do not necessarily refer to the physical location of a house.
  • Commercial addresses - Multiple commercial addresses may be contained within the same physical building

Identifying and handling poorly geocoded addresses

GeoMarker output includes two fields that give an indication of geocoding quality: NAACCRGISQualityName indicates the type of reference feature to which the address was matched. MatchType indicates the exactness of the match between the address and the reference feature. You should examine these fields to determine which addresses did not geocode well. An address may require further attention if:

  • NAACCRGISQualityName is not AddressPoint, Parcel, or StreetSegmentInterpolation.
    Addresses matched to address points, parcels, or street segments indicate that the input street number and street name were found in a reference dataset and the geocoding was based on that information. Other NAACCRGISQualityName values, such as AddressZIPCentroid or CityCentroid, indicate that the match was not based on street name and number, but just on a larger area in which the address is located. Such matches are likely to be less accurate.
  • MatchType is not Exact.
    MatchType values other than Exact indicate that the input address was similar to a feature in the reference dataset, but either did not match all fields or a string was somewhat different. Inexact matches can occur when there are errors in the input data, such as a typo, a missing or incorrect street directional (N/S/E/W), a missing or incorrect street suffix (e.g., St, Rd, Ave), or placing an address in the wrong city.

If you identify addresses that did not geocode well, the preferred way of handling them is to try to identify any errors in the input address that may have caused the problem. Sometimes looking up an address in an online mapping tool (e.g. Google maps) can reveal issues. Referring to original source material (such as a survey form), if available, can also help. Local knowledge of the area in which the address is located is often useful.

You may wish to remove addresses that were poorly geocoded from further analysis. However, you should be aware that doing so can introduce selection bias because not all addresses are equally likely to geocode poorly.

Feature Matching

Feature matching is at the heart of geocoding. In this process, the geocoder searches for a match to an input address in one or more reference datasets. Reference datasets provide the link between addresses and locations, since they contain features with known geographic locations described in terms of addresses.

Reference Feature Datasets

The Texas A&M Geocoder uses several reference feature datasets. Most addresses will be matched to a feature in one of the following reference datasets:

Finding Inexact Matches

The geocoder uses three techniques to search for near matches if no exact matches to an input address is found in a reference feature dataset: attribute relaxation, Soundex matching, and substring matching. If any of these techniques are used, they are indicated in the MatchType field in the output.

Attribute relaxation involves leaving out elements of the parsed input address. This allows features in the reference dataset that match the remaining elements of the input address but have a difference in the omitted element(s) to be considered matches. Elements are omitted through a series of queries, one at a time and in combination. Elements subject to relaxation, from first to last omitted, are:

  • Street Predirectional
  • Street Postdirectional
  • Street Suffix
  • City
  • Zip

Soundex matching is based on a simplified representation of the sounds included in a word, and can help to find matches when there are misspellings in the input or reference attributes.

Substring matching allows a match between an address and a reference feature to be made if a string in the address matches part of the corresponding string in the reference feature.