Give Me Some Latitude… and Longitude

https://www.percona.com/blog/wp-content/uploads/2022/09/Geo-locations-in-MySQL.pngGeo locations in MySQL

Geo locations in MySQLGeo locations are a cornerstone of modern applications. Whether you’re a food delivery business or a family photographer, knowing the closest “something” to you or your clients can be a great feature.

In our ‘Scaling and Optimization’ training class for MySQL, one of the things we discuss is column types. The spatial types are only mentioned in passing, as less than 0.5% of MySQL users know of their existence (that’s a wild guess, with no factual basis). In this post, we briefly discuss the POINT type and how it can be used to calculate distances to the closest public park.

Import the data

To start off, we need a few tables and some data. The first table will hold the mapping between the zip code and its associated latitude/longitude. GeoNames has this data under the Creative Commons v3 license, available here for most countries. The data files are CSV and the readme.txt explains the various columns. We are only interested in a few of the columns.

CREATE TABLE usazips (
 id int unsigned NOT NULL AUTO_INCREMENT,
 zipCode int unsigned NOT NULL COMMENT 'All USA postal codes are integers',
 state varchar(20) NOT NULL,
 placeName varchar(200) NOT NULL,
 placeDesc varchar(100) DEFAULT NULL,
 latLong point NOT NULL /*!80003 SRID 4326 */,
 PRIMARY KEY (id));

There are a couple of things about this schema that are not typical in MySQL.

Firstly, the latLong column type is POINT which can store an X and Y coordinate. By default, these coordinates could be on any plane. They could represent latitude and longitude, but they could also be centimeters up and down on the surface of your desk, or yards left and right of the biggest tree at your local park. How do you know which? You need a Spatial Reference System. Thankfully, MySQL comes preloaded with about 5,100 of these systems (See INFORMATION_SCHEMA.ST_SPATIAL_REFERENCE_SYSTEMS). Each SRS has an associated Spatial Reference ID (SRID).

The SQL comment on the POINT column above ties the data in this column with a specific SRID, 4326. We can see in the I_S table noted earlier, this SRID maps to the ‘World Geodetic System 1984’ (aka WGS84) which is the same SRS used in the GeoNames dataset. By having our column and the dataset aligned, we won’t need to do any geo-transformation later on.

The second thing to note in the schema is the use of a SPATIAL index on the latLong column. A SPATIAL index in MySQL is created using R-Trees which are geospatial-specific data structures.

Let’s load this data into the table. There are 12 fields in the CSV but we only care about six of them. The LOAD DATA command allows you to associate each CSV field, positionally, with either the column name in the table or a user variable. The first field in our CSV is the country code, which we don’t care about, so we assign that field to the @dummy variable. The second field is the zip code, and we want that to go directly into the table-column so we specify the column name. We do the same for the third, fifth, sixth, and ninth fields.

The 10th and 11th CSV fields are the latitude and longitude. We need to convert those two VARCHAR fields into a POINT. We can apply some SQL transformations using the SET command which can reference user variables assigned earlier. These two fields are assigned to @lat and @lon. The remaining fields all go into @dummy since they are not used.

LOAD DATA INFILE '/var/lib/mysql-files/US.txt'
INTO TABLE usazips
FIELDS TERMINATED BY '\t' (@dummy, zipCode, placeName, @dummy, state, placeDesc, @dummy, @dummy, @dummy, @lat, @lon, @dummy)
SET id = NULL, latLong = ST_PointFromText(CONCAT('POINT(', @lat, ' ', @lon, ')'), 4326);

Unfortunately, the POINT() function in MySQL always returns an SRID of 0. Since we specified our column to be a specific SRID, a direct import will fail.

mysql> LOAD ... SET latLong = POINT(@lat, @lon);
ERROR 3643 (HY000): The SRID of the geometry does not match the SRID of the column 'latLong'.
The SRID of the geometry is 0, but the SRID of the column is 4326. Consider changing the SRID
of the geometry or the SRID property of the column.

Instead, we must “go the long route” by creating a string representation of a POINT object in the Well-Known-Text format, which MySQL can parse into a Point column type with an associated SRID.

mysql> LOAD ... SET latLong = ST_PointFromText(CONCAT('POINT(', @lat, ' ', @lon, ')'), 4326);

Yes, this was very confusing to me as well:

“POINT(123 56)” <— WKT Format (a string)
POINT(123, 56) <— MySQL Column Type (function that returns data)

Here’s a quick verification of our data with an additional column showing the lat-long binary data being converted back into WKT format.

mysql> SELECT *, ST_AsText(latLong) FROM usazips WHERE zipCode = 76455;
+-------+---------+-------+-----------+-----------+------------------------------------------------------+-------------------------+
| id    | zipCode | state | placeName | placeDesc | latLong                                              | ST_AsText(latLong)      |
+-------+---------+-------+-----------+-----------+------------------------------------------------------+-------------------------+
| 34292 |   76455 | TX    | Gustine   | Comanche  | 0xE6100000010100000068B3EA73B59958C0B84082E2C7D83F40 | POINT(31.8468 -98.4017) |
+-------+---------+-------+-----------+-----------+------------------------------------------------------+-------------------------+
1 row in set (0.04 sec)

More data to load

Now that we have all this zip code + latitude and longitude data, we next need the locations of where we want to find our distances. This could be a list of grocery stores or coffee shops. In this example, we will use a list of public parks in San Antonio, TX. Thankfully, San Antonio has all of this data openly available. I downloaded the ‘Park Boundaries’ dataset in GeoJSON format since the CSV did not contain any latitude/longitude coordinates.

CREATE TABLE parks (
 parkId int unsigned NOT NULL PRIMARY KEY,
 parkName varchar(100) NOT NULL,
 parkLocation point NOT NULL /*!80003 SRID 4326 */);

The GeoJSON data is one giant JSON string with all of the needed data nested in various JSON objects and arrays. Instead of writing some Python script to ETL the data, I decided to try out the JSON import feature of the MySQL Shell and first import the JSON directly into a temporary table.

$ mysqlsh appUser@127.0.0.1/world --import Park_Boundaries.geojson tempParks
Importing from file "Park_Boundaries.geojson" to collection `world`.`tempParks` in MySQL Server at 127.0.0.1:33060
..1..1
Processed 3.29 MB in 1 document in 0.3478 sec (1.00 document/s)
Total successfully imported documents 1 (1.00 document/s)

The MySQL Shell utility created a new table called ‘tempParks’, in the ‘world’ database with the following schema. One row was inserted, which was the entire dataset as one JSON object.

CREATE TABLE `tempParks` (
 `doc` json DEFAULT NULL,
 `_id` varbinary(32) GENERATED ALWAYS AS (json_unquote(json_extract(`doc`,_utf8mb4'$._id'))) STORED NOT NULL,
 `_json_schema` json GENERATED ALWAYS AS (_utf8mb4'{"type":"object"}') VIRTUAL,
 PRIMARY KEY (`_id`),
 CONSTRAINT `$val_strict_75A2A431C77036365C11677C92B55F4B307FB335` CHECK (json_schema_valid(`_json_schema`,`doc`)) /*!80016 NOT ENFORCED */
);

From here, I extracted the needed information from within the JSON and inserted it into the final table, shown above. This SQL uses the new JSON_TABLE function in MySQL 8 to create a result set from nested JSON data (strangely, there were a couple of duplicate parks in the dataset, which is why you see the ON DUPLICATE modifier). Note: that this dataset does longitude at index 0, and latitude at index 1.

INSERT INTO parks
  SELECT k.parkID, k.parkName, ST_PointFromText(CONCAT('POINT(', parkLat, ' ', parkLon, ')'), 4326)
  FROM tempParks, JSON_TABLE(
    tempParks.doc,
    "$.features[*]" COLUMNS (
      parkID int PATH "$.properties.ParkID",
      parkName varchar(150) PATH "$.properties.ParkName",
      parkLat varchar(20) PATH "$.geometry.coordinates[0][0][0][1]",
      parkLon varchar(20) PATH "$.geometry.coordinates[0][0][0][0]"
    )
  ) AS k ON DUPLICATE KEY UPDATE parkId = k.parkID;

Query OK, 391 rows affected (0.11 sec)
Records: 392  Duplicates: 0  Warnings: 0

What’s the closest park?

Now that all the information is finally loaded, let’s find the nearest park to where you might live, based on your zip code.

First, get the reference location (as MySQL POINT data) for the zip code in question:

mysql> SELECT latLong INTO @zipLocation FROM usazips WHERE zipCode = 78218;

Then, calculate the distance between our reference zip code and all of the park locations. Show the five closest (not including any school parks):

mysql> SELECT p.parkName, ST_AsText(p.parkLocation) AS location, ST_Distance_Sphere(@zipLocation, p.parkLocation) AS metersAway
FROM parks p WHERE parkName NOT LIKE '%School%' ORDER BY metersAway LIMIT 5;
+-----------------------+----------------------------------------------+--------------------+
| parkName              | location                                     | metersAway         |
+-----------------------+----------------------------------------------+--------------------+
| Robert L B Tobin Park | POINT(29.501796110000043 -98.42111988)       |   1817.72881969296 |
| Wilshire Terrace Park | POINT(29.486397887000063 -98.41776879999996) | 1830.8541086553364 |
| James Park            | POINT(29.48168424800008 -98.41725954999998)  |  2171.268012559491 |
| Perrin Homestead Park | POINT(29.524477369000067 -98.41258342199997) |  3198.082796589456 |
| Tobin Library Park    | POINT(29.510899863000077 -98.43343147299998) |  3314.044809806559 |
+-----------------------+----------------------------------------------+--------------------+
5 rows in set (0.01 sec)

The closest park is 1,817.7 meters away! Time for a picnic!

Conclusion

Geo locations in MySQL are not difficult. The hardest part was just finding the data and ETL’ing into some tables. MySQL’s native column types and support for the OpenGIS standards make working with the data quite easy.

Several things to note:

1) The reference location for your zip code may be several hundred or a thousand meters away. Because of this, that park that is right behind your house might not show up as the actual closest because another park is closer to the reference.

2) No SPATIAL indexes were added to the tables as they would be of no use because we must execute the distance function against all parks (ie: all rows) to find the closest park, hence the ‘metersAway’ query will always perform a full table scan.

3) All USA postal codes are integers. You can expand this to your country by simply changing the column type.

Planet MySQL