Introduction
My wife and I retired 5 years ago. We got rid of 99% of our
stuff, sold our house, and bought a fifth wheel trailer to travel the United
States and Canada with our two pugs, Pancho and Lefty. Over the past four
years, we have visited over one hundred locations, generally spending between four
to seven days at each stop. While it has been an amazing ride, we are starting
to think about where we might want to settle down. By settle down, we mean to
continue living in the 5th wheel but stay in a location from one to three
months.
Earlier this year, I began taking classes on Coursera to qualify for the IBM Data
Science Professional Certificate. The certification program consists of
nine courses which cover a variety of data science topics including: open
source tools and libraries, methodologies, Python, databases and SQL, data
visualization, data analysis, and machine learning. The final course is a
capstone project where the student applies the skills they developed in the
previous eight courses. For my capstone project, I decided to identify
locations Deb and I had not visited but that we might want to spend at least a
month.
Before we retired, we lived in a neighborhood called the Houston Heights in
Houston, Texas. We loved the area because of its convenient location (less than
5 miles from Downton), the friendly neighbors, and the neighborhood’s eclectic
and quirky shops and restaurants.
For the project, I elected to find areas in the United States with RV parks that
are like the Houston Heights neighborhood we lived in. The criteria we defined
is areas with RV parks that are less than 60 miles from a Costco Warehouse and
have similar venues as our old neighborhood within a 5‑mile radius. We keep all
our prescriptions with Costco pharmacy and like to visit Costco at least once a
month to buy bulk supplies. The 5‑mile radius allows for a short drive to local
venues.
Data
I relied on three primary data sources to identify RV parks
within 60 miles of a Costco Warehouse and located within the contiguous 48
United States in areas like the Houston Heights.
·
Costco_USA_Canada.csv,
a list of Costco Warehouses located in the United States and Canada. I found
this dataset on POI Factory
· GoodSam.csv,
a list of campgrounds in the United States and Canada that offer discounts to
members of Good Sam Club.
I found this dataset on POI
Factory
·
Demographic data from the Census Bureau's American
Community Survey 5-Year Data (2009-2018) (“ACS”).
The datasets for the location of RV Parks and Costco Warehouses included Zip
Codes. However, the dataset for Census data used ZCTA as key. To insure all
three datasets were based on the same key, I used the US Zip Codes Database provided by
Simple Maps to cross reference the ZCTA
selected in the demographic analysis to the related Zip Codes, and modify the Census
Bureau dataset to present Zip Codes instead of ZCTAs.
To identify which locations are most common with the Houston Heights, I
compared the 5 most popular venues within a five-mile location of the center of
each zip code. Data for each zip code was provided by Foursquare, a social location service that
allows users to explore the world around them.
Methodology
Good Sam RV Parks Dataset
My first step was to remove from the Good Sam RV Parks
dataset any RV Parks outside of the 48 contiguous United States and any entries
that did not have location data, such as latitude, longitude, city, and state.
The GoodSam.csv dataset consisted of four columns – “Latitude”, “Longitude”, “Description”,
and “Address”. To make the dataset usable for my analysis, I disaggregated the
Description column and the Address column. I divided the Description column
into two columns named ‘Park Name’ and ‘Location’, then eliminated the Location
column as its data was partially redundant with data derived from the Address
column. I divided the Address Column into five columns titled “Address", "City",
"State, “Zip", and "Phone Number”. Upon completion of my data
preparation, the Good Sam database included 2,216 unique RV Parks located in
the 48 Contiguous United States.
Costco Warehouse Dataset
I removed from the Costco Warehouse Dataset all Costco
Warehouses located outside of the 48 contiguous United States and any entries
that did not have location data, such as latitude, longitude, city, and state.
Similar to the GoodSam.csv dataset, the Costco_USA_Canada.csv dataset consisted
of four columns – “Latitude”, “Longitude”, “Description”, and “Address”. To
make the dataset usable for my analysis, I disaggregated the Description column
and the Address column. I divided the Description column into two columns named
‘Park Name’ and ‘Location’, then eliminated the Location column as its data was
partially redundant with data derived from the Address column. I divided the
Address Column into five columns titled “Address", "City", "State,
“Zip", and "Phone Number”. Upon completion of my data preparation,
the Good Sam database included 543 Costco Stores located in the 48 Contiguous United
States.
Find RV Parks within 60 miles of a Costco Warehouse
After cleaning and reformatting the Good Sam RV Park dataset
and the Costco Warehouse dataset, I wanted to identify the distance from each
RV Park to the closest Costco Warehouse. To do so, I designed an algorithm
which calculated for each RV Park the geodesic distance between it and each
Costco Warehouse, saving the lowest distance calculated. The analysis found 1,473
RV Parks in 984 cities and 1,146 unique zip codes that are within 60 miles of a
Costco.
Demographic Data
My wife and I discussed factors that we considered most
important to us to identify areas which we would like to spend more time. We
reviewed the variables reported in the Census Bureau's American
Community Survey
to select the specific demographic characteristics we wanted to consider.
The ACS is an ongoing annual survey covering a broad range of topics about
social, economic, demographic, and housing characteristics of the U.S.
population. We decided to analyze data provided in the ACS 5-Year Data “Data
Profiles”. Data Profiles is the smallest dataset in the ACS and includes
over 1,000 variables covering a broad range of social, economic, housing, and
demographic information presented as population counts and percentages.
The list of locations in the United States to consider will
be Zip
Code Tabulation Areas (“ZCTA”). Per the Census Bureau, ZCTA
"are generalized areal representations of United States Postal Service
(USPS) ZIP Code service areas." ZCTA is the smallest geographical area for
which the Census Bureau provides demographic data. As such, I believe ZCTA and
Zip Codes best represent neighborhoods within given locations.
The four demographic variables we chose to consider were the
estimated median age of the ZCTA population (DP05_0018E), estimated percentage
of the ZCTA population over 25 with a Bachelor's
degree or higher (DP02_0067PE), estimated median household income for each ZCTA
(DP03_0062E), and the estimated median value of owner-occupied residences (DP04_0089E).
We chose estimated median age because, although we are retired, we wanted to be
in an area filled with a range of ages like the Houston Heights. We selected
population with a Bachelor’s degree or higher because we both have graduate
degrees and like being around people with whom we can discuss issues and new
ideas. We chose estimated median household income and median home value to
represent housing affordability.
Using the Census API, I retrieved
the four variables for every ZCTA. For my next step, I cleaned the Census data
by dropping all rows where one or more values were less than zero or blank. The
table below presents a summary of basic statistical details of the Census data.
The chart below presents a box plot of the four demographic
criteria considered.
The table below presents the demographic data for the ZCTA
in which our Houston Heights neighborhood was located.
For three demographic criteria (Percentage of population with
a bachelor’s degree or higher, Median Household Income, and Median Home Value),
the Houston Heights is in an outlier while the estimated Median Age is in the
bottom quartile.
As can be seen in the table below, Median Household Income,
Median Home Value, and % Bachelor’s Degree or higher are highly correlated.
To filter the Census demographic data, I eliminated ZCTAs (a)
which were outliers[1]
of the Median Age, (b) where the percentage of the population over 25 with a Bachelor's
degree or higher were not in the fourth quartile, (c) where Median Household
Income was in the first quartile, and (d) where the Median Home Value was
greater than the Median Home in ZCTA 77008 or in the first quartile. The table below presents the
summary statistics of the filtered Census demographic dataset.
As discussed above, the dataset of RV Parks within 60 miles
of a Costco Warehouse uses Zip Codes as a reference, while the dataset for
Census data uses ZCTAs as a reference. I used an inner merge of the filtered
Census demographic dataset and the Simple Maps US Zip Codes dataset based on
ZCTA, then dropped the ZCTA column of the merged dataset to yield a Census
demographic dataset with Zip Codes instead of ZCTA, and the latitude and
longitude of each Zip Code.
Combine Filtered Demographic Data and the Combined RV Parks and Costco Dataset
Next, I inner merged the filtered Census demographic data
with Zip Codes dataset and the dataset of RV Parks within 60 miles of a Costco
Warehouse, which yielded a dataset (the “Final Dataset”) of 198
unique zip codes in 186 cities. The map below presents those locations.
Download Venue Data for Selected Zip Codes from Foursquare
To identify the locations most common with the Houston
Heights, I compared the most popular venues within a five mile location of the
center of each zip code in the Final Dataset. I collected such data for each
zip code using the Foursquare API. From Foursquare, I requested the 100 most
popular venues in each Zip Code within a five-mile radius. I then removed any
Zip Code which had less than 50 venues. For each remaining Zip Code, I
determined the 5 most popular venues. The screening found 371 unique venues
over 110 unique Zip Codes with 50 venues or more. The table below presents the
information for Zip Code 77008, the zip code for the Houston Heights
neighborhood in which we lived.
Identify Zip Codes Similar to the Houston Heights
To identify zip codes most similar to the Houston Heights, I
used the k-means clustering method by using. The k-means
clustering algorithm identifies k number of centroids, then allocates
every data point to the nearest cluster, while keeping the centroids as small
as possible. It is one of the simplest unsupervised machine learning algorithms
and is highly suited for this project. For this analysis, I iterated the number
of k-means to cluster the Zip Codes until 35 or fewer unique counties
with Zip Codes with venues similar to the Houston Heights were identified. The
search identified 34 unique locations.
Results
The map below shows the final
34 locations identified by my analysis.
The table below is list of
the locations sorted by state presented in the final map. The ID number corresponds
to the number on the map.
Discussion
Locations identified using the k-means clustering
identified 34 unique locations in 18 states. Several locations we have visited
and enjoyed like Tucson, Arizona, Tampa, Florida, Coeur D’Alene, Idaho, Swannanoa,
North Carolina, Albuquerque New Mexico, Portland, Oregon, Bend, Oregon,
Nashville, Tennessee, Austin, Texas, Fredericksburg, Texas, North Salt Lake, Utah,
and the area around Seattle, Washington (La Conner, Poulsbo and Bellingham).
The clustering identified several areas that we had not
visited and had not considered like Glenwood Springs, Colorado, the Denver,
Colorado area (Colorado Springs, Estes Park, and Englewood), Ypsilanti, Michigan,
Frankenmuth, Michigan, Reno, Nevada, Ulster County, New York, Dutchess County, New
York, Ithaca, New York, Greenville, South Carolina, and Greenwood, Virginia.
Since the clustering identified several places that we have enjoyed and have considered
staying at for at least a month, we plan to include these places that are new
to us in our travels to see how much we like them.
My sense is the final results would differ if the number of
venues used to cluster areas was increased to 10, the maximum number of unique
counties to end the k-clustering algorithm had been reduced or
increased, the demographic data was included in clustering process. Results may
have been different if I had been able to measure the drive time between the RV
Parks and Costco Warehouses instead of the distance. It would be interesting to
use a larger geographic area like congressional districts or Standard
Metropolitan/Micropolitan Statistical Area to filter the demographic data and
to identify the 100 most popular venues in order to find areas to visit.
I believe the analysis would have been faster if I had
screened the demographic data first then merged that resulting data with the RV
Park data. I expect that such a step would have reduced the number of RV Parks
to find measure the distance to the closest Costco Warehouse.
We enjoy spending two to three months during the summer in
Canada. I would like to prepare a similar analysis using Canadian demographic
data, RV Parks, Costco Warehouses, and Foursquare data.
Conclusion
I used the k-means clustering technique to identify areas similar to the Houston Heights neighborhood where we lived before we retired. I used location data for RV Parks and Costco Warehouses in the United States along with demographic data for ZCTAs from the Census Bureau to identify 198 unique locations to consider. I then retrieved data for each locations 100 most popular venues. The k-means clustering method identified 34 unique areas with similar venues as the Houston Heights neighborhood. The clustering identified several locations that we have visited, have enjoyed, and have considered spending at least one month in the future. As a result, I believe it is highly likely that we would enjoy the areas identified which we have not visited identified by the k-means clustering method.
[1] Outliers are points outside of the range from the 1st quartile – 1.5*IQR to 3rd quartile +1.5 IQR, where IQR is the difference between the 3rd quartile and the 1st quartile.
No comments:
Post a Comment