Saturday, August 29, 2020

Using k-Means Clustering to Identify Locations to Visit that are Similar to the Houston Heights

 

Introduction

My wife and I retired 5 years ago. We got rid of 99% of our stuff, sold our house, and bought a fifth wheel trailer to travel the United States and Canada with our two pugs, Pancho and Lefty. Over the past four years, we have visited over one hundred locations, generally spending between four to seven days at each stop. While it has been an amazing ride, we are starting to think about where we might want to settle down. By settle down, we mean to continue living in the 5th wheel but stay in a location from one to three months.

Earlier this year, I began taking classes on Coursera to qualify for the IBM Data Science Professional Certificate. The certification program consists of nine courses which cover a variety of data science topics including: open source tools and libraries, methodologies, Python, databases and SQL, data visualization, data analysis, and machine learning. The final course is a capstone project where the student applies the skills they developed in the previous eight courses. For my capstone project, I decided to identify locations Deb and I had not visited but that we might want to spend at least a month.

Before we retired, we lived in a neighborhood called the Houston Heights in Houston, Texas. We loved the area because of its convenient location (less than 5 miles from Downton), the friendly neighbors, and the neighborhood’s eclectic and quirky shops and restaurants.

For the project, I elected to find areas in the United States with RV parks that are like the Houston Heights neighborhood we lived in. The criteria we defined is areas with RV parks that are less than 60 miles from a Costco Warehouse and have similar venues as our old neighborhood within a 5‑mile radius. We keep all our prescriptions with Costco pharmacy and like to visit Costco at least once a month to buy bulk supplies. The 5‑mile radius allows for a short drive to local venues.

Data

I relied on three primary data sources to identify RV parks within 60 miles of a Costco Warehouse and located within the contiguous 48 United States in areas like the Houston Heights.

·         Costco_USA_Canada.csv, a list of Costco Warehouses located in the United States and Canada. I found this dataset on POI Factory

·        GoodSam.csv, a list of campgrounds in the United States and Canada that offer discounts to members of Good Sam Club. I found this dataset on POI Factory

·         Demographic data from the Census Bureau's American Community Survey 5-Year Data (2009-2018) (“ACS”).

The datasets for the location of RV Parks and Costco Warehouses included Zip Codes. However, the dataset for Census data used ZCTA as key. To insure all three datasets were based on the same key, I used the US Zip Codes Database provided by Simple Maps to cross reference the ZCTA selected in the demographic analysis to the related Zip Codes, and modify the Census Bureau dataset to present Zip Codes instead of ZCTAs.

To identify which locations are most common with the Houston Heights, I compared the 5 most popular venues within a five-mile location of the center of each zip code. Data for each zip code was provided by Foursquare, a social location service that allows users to explore the world around them.

Methodology

Good Sam RV Parks Dataset

My first step was to remove from the Good Sam RV Parks dataset any RV Parks outside of the 48 contiguous United States and any entries that did not have location data, such as latitude, longitude, city, and state. The GoodSam.csv dataset consisted of four columns – “Latitude”, “Longitude”, “Description”, and “Address”. To make the dataset usable for my analysis, I disaggregated the Description column and the Address column. I divided the Description column into two columns named ‘Park Name’ and ‘Location’, then eliminated the Location column as its data was partially redundant with data derived from the Address column. I divided the Address Column into five columns titled “Address", "City", "State, “Zip", and "Phone Number”. Upon completion of my data preparation, the Good Sam database included 2,216 unique RV Parks located in the 48 Contiguous United States.

Costco Warehouse Dataset

I removed from the Costco Warehouse Dataset all Costco Warehouses located outside of the 48 contiguous United States and any entries that did not have location data, such as latitude, longitude, city, and state. Similar to the GoodSam.csv dataset, the Costco_USA_Canada.csv dataset consisted of four columns – “Latitude”, “Longitude”, “Description”, and “Address”. To make the dataset usable for my analysis, I disaggregated the Description column and the Address column. I divided the Description column into two columns named ‘Park Name’ and ‘Location’, then eliminated the Location column as its data was partially redundant with data derived from the Address column. I divided the Address Column into five columns titled “Address", "City", "State, “Zip", and "Phone Number”. Upon completion of my data preparation, the Good Sam database included 543 Costco Stores located in the 48 Contiguous United States.

Find RV Parks within 60 miles of a Costco Warehouse

After cleaning and reformatting the Good Sam RV Park dataset and the Costco Warehouse dataset, I wanted to identify the distance from each RV Park to the closest Costco Warehouse. To do so, I designed an algorithm which calculated for each RV Park the geodesic distance between it and each Costco Warehouse, saving the lowest distance calculated. The analysis found 1,473 RV Parks in 984 cities and 1,146 unique zip codes that are within 60 miles of a Costco.


Demographic Data

My wife and I discussed factors that we considered most important to us to identify areas which we would like to spend more time. We reviewed the variables reported in the Census Bureau's American Community Survey to select the specific demographic characteristics we wanted to consider. The ACS is an ongoing annual survey covering a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population. We decided to analyze data provided in the ACS 5-Year Data “Data Profiles”. Data Profiles is the smallest dataset in the ACS and includes over 1,000 variables covering a broad range of social, economic, housing, and demographic information presented as population counts and percentages.

The list of locations in the United States to consider will be Zip Code Tabulation Areas (“ZCTA”). Per the Census Bureau, ZCTA "are generalized areal representations of United States Postal Service (USPS) ZIP Code service areas." ZCTA is the smallest geographical area for which the Census Bureau provides demographic data. As such, I believe ZCTA and Zip Codes best represent neighborhoods within given locations.

The four demographic variables we chose to consider were the estimated median age of the ZCTA population (DP05_0018E), estimated percentage of the ZCTA population over 25  with a Bachelor's degree or higher (DP02_0067PE), estimated median household income for each ZCTA (DP03_0062E), and the estimated median value of owner-occupied residences (DP04_0089E). We chose estimated median age because, although we are retired, we wanted to be in an area filled with a range of ages like the Houston Heights. We selected population with a Bachelor’s degree or higher because we both have graduate degrees and like being around people with whom we can discuss issues and new ideas. We chose estimated median household income and median home value to represent housing affordability.

Using the Census API, I retrieved the four variables for every ZCTA. For my next step, I cleaned the Census data by dropping all rows where one or more values were less than zero or blank. The table below presents a summary of basic statistical details of the Census data.


The chart below presents a box plot of the four demographic criteria considered.


The table below presents the demographic data for the ZCTA in which our Houston Heights neighborhood was located.


For three demographic criteria (Percentage of population with a bachelor’s degree or higher, Median Household Income, and Median Home Value), the Houston Heights is in an outlier while the estimated Median Age is in the bottom quartile.

As can be seen in the table below, Median Household Income, Median Home Value, and % Bachelor’s Degree or higher are highly correlated.


To filter the Census demographic data, I eliminated ZCTAs (a) which were outliers[1] of the Median Age, (b) where the percentage of the population over 25 with a Bachelor's degree or higher were not in the fourth quartile, (c) where Median Household Income was in the first quartile, and (d) where the Median Home Value was greater than the Median Home in ZCTA 77008 or in the  first quartile. The table below presents the summary statistics of the filtered Census demographic dataset.


As discussed above, the dataset of RV Parks within 60 miles of a Costco Warehouse uses Zip Codes as a reference, while the dataset for Census data uses ZCTAs as a reference. I used an inner merge of the filtered Census demographic dataset and the Simple Maps US Zip Codes dataset based on ZCTA, then dropped the ZCTA column of the merged dataset to yield a Census demographic dataset with Zip Codes instead of ZCTA, and the latitude and longitude of each Zip Code.

Combine Filtered Demographic Data and the Combined RV Parks and Costco Dataset

Next, I inner merged the filtered Census demographic data with Zip Codes dataset and the dataset of RV Parks within 60 miles of a Costco Warehouse, which yielded a dataset (the “Final Dataset”) of 198 unique zip codes in 186 cities. The map below presents those locations.


Download Venue Data for Selected Zip Codes from Foursquare

To identify the locations most common with the Houston Heights, I compared the most popular venues within a five mile location of the center of each zip code in the Final Dataset. I collected such data for each zip code using the Foursquare API. From Foursquare, I requested the 100 most popular venues in each Zip Code within a five-mile radius. I then removed any Zip Code which had less than 50 venues. For each remaining Zip Code, I determined the 5 most popular venues. The screening found 371 unique venues over 110 unique Zip Codes with 50 venues or more. The table below presents the information for Zip Code 77008, the zip code for the Houston Heights neighborhood in which we lived.


Identify Zip Codes Similar to the Houston Heights

To identify zip codes most similar to the Houston Heights, I used the k-means clustering method by using. The k-means clustering algorithm identifies k number of centroids, then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. It is one of the simplest unsupervised machine learning algorithms and is highly suited for this project. For this analysis, I iterated the number of k-means to cluster the Zip Codes until 35 or fewer unique counties with Zip Codes with venues similar to the Houston Heights were identified. The search identified 34 unique locations.

Results

The map below shows the final 34 locations identified by my analysis.


The table below is list of the locations sorted by state presented in the final map. The ID number corresponds to the number on the map.


Discussion

Locations identified using the k-means clustering identified 34 unique locations in 18 states. Several locations we have visited and enjoyed like Tucson, Arizona, Tampa, Florida, Coeur D’Alene, Idaho, Swannanoa, North Carolina, Albuquerque New Mexico, Portland, Oregon, Bend, Oregon, Nashville, Tennessee, Austin, Texas, Fredericksburg, Texas, North Salt Lake, Utah, and the area around Seattle, Washington (La Conner, Poulsbo and Bellingham).

The clustering identified several areas that we had not visited and had not considered like Glenwood Springs, Colorado, the Denver, Colorado area (Colorado Springs, Estes Park, and Englewood), Ypsilanti, Michigan, Frankenmuth, Michigan, Reno, Nevada, Ulster County, New York, Dutchess County, New York, Ithaca, New York, Greenville, South Carolina, and Greenwood, Virginia. Since the clustering identified several places that we have enjoyed and have considered staying at for at least a month, we plan to include these places that are new to us in our travels to see how much we like them.

My sense is the final results would differ if the number of venues used to cluster areas was increased to 10, the maximum number of unique counties to end the k-clustering algorithm had been reduced or increased, the demographic data was included in clustering process. Results may have been different if I had been able to measure the drive time between the RV Parks and Costco Warehouses instead of the distance. It would be interesting to use a larger geographic area like congressional districts or Standard Metropolitan/Micropolitan Statistical Area to filter the demographic data and to identify the 100 most popular venues in order to find areas to visit.

I believe the analysis would have been faster if I had screened the demographic data first then merged that resulting data with the RV Park data. I expect that such a step would have reduced the number of RV Parks to find measure the distance to the closest Costco Warehouse.

We enjoy spending two to three months during the summer in Canada. I would like to prepare a similar analysis using Canadian demographic data, RV Parks, Costco Warehouses, and Foursquare data.

Conclusion

I used the k-means clustering technique to identify areas similar to the Houston Heights neighborhood where we lived before we retired. I used location data for RV Parks and Costco Warehouses in the United States along with demographic data for ZCTAs from the Census Bureau to identify 198 unique locations to consider. I then retrieved data for each locations 100 most popular venues. The k-means clustering method identified 34 unique areas with similar venues as the Houston Heights neighborhood. The clustering identified several locations that we have visited, have enjoyed, and have considered spending at least one month in the future. As a result, I believe it is highly likely that we would enjoy the areas identified which we have not visited identified by the k-means clustering method.


[1] Outliers are points outside of the range from the 1st quartile – 1.5*IQR to 3rd quartile +1.5 IQR, where IQR is the difference between the 3rd quartile and the 1st quartile.