This project was the final assignment of the IBM Data Science Professional Certificate, in which many of the tools and methods learned throughout the course were applied, such as IBM Watson Studio, Foursquare APIs and various Python libraries (e.g. pandas, folium, matplotlib, seaborn, numpy, geopy, Scikit learn etc.). In a self-chosen fictional challenge around the idea of a “Battle of Neighborhoods”, I chose to compare the neighborhoods of Queens (a borough of New York City) based on venues data fetched from Foursquare. The project was executed in a Jupyter notebook, which is available on my GitHub page.
1. Description of the problem and a discussion of the background
A got approached by a friend who wanted to open a new restaurant in Queens (a borough of New York City). He wanted to open a Chinese Restaurant or an Indian Restaurant. He wanted my help to decide. In addition, we wanted my help to decide in which neighborhood(s) he should/should not go for.
Queens is the largest borough in area and is the second-largest borough in population of the five New York City boroughs with a population of c. 2,2 million.
Target Audience
In addition to my friend, there are others who would be interested in this project, e.g.:
- People who wants to invest in/open a restaurant.
- Aspiring data scientists who want to learn certain techniques/libraries used in this project.
- Tourists who want to know in which district of Queens they will find certain types of restaurants
Data
For this project I have used the following data:
- New York City data that contains Borough, Neighborhoods along with there latitudes and longitudes
- Data Source: https://cocl.us/new_york_dataset
- Description: This data set contains the required information. I will use this data set to explore various neighborhoods of Queens.
- Indian restaurants in Queens neighborhood of new york city.
- Data Source: Foursquare API
- Description: By using this API I will get all the venues in the Queens neighborhood. I will filter these venues for restaurants only.
2. Methodology
In this project, I will leverage the Foursquare API to explore neighborhoods in Queens (New York City). I will use the explore function to get the most common restaurant categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I will use the k-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in Queens and their emerging clusters.
Preparation – download and import all required libraries
Before we get the data and start exploring it, let’s download all the dependencies that we will need.
Download and Explore Dataset
Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.
Luckily, this dataset exists for free on the web.
Load and explore the data
Next, let’s load the data.
Let’s take a quick look at the data.
Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let’s define a new variable that includes this data.


Tranform the data into a pandas dataframe
The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. So let’s start by creating an empty dataframe.
Take a look at the empty dataframe to confirm that the columns are as intended.
Then let’s loop through the data and fill the dataframe one row at a time.
Quickly examine the resulting dataframe.
And make sure that the dataset has all 5 boroughs and 306 neighborhoods.
However, for analysis purposes, let’s slice the original dataframe and create a new dataframe of the Queens data.
Use geopy library to get the latitude and longitude values of Queens.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent ny_explorer, as shown below.


Define Foursquare Credentials and Version
I have blacklined my credentials.
Explore Neighborhoods in queens
Let’s create a function to repeat the same process to all the neighborhoods in queens

Now write the code to run the above function on each neighborhood and create a new dataframe called _queensvenues.
Quickly examine the resulting dataframe.
Analyze Each Neighborhood
As you see, the above dataframe includes all types of venues (e.g. restaurants, shops and gyms). Let’s filter on restaurants and add “dummies” (0/1) for each restaurant type for each row.
Next, let’s group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
Let’s print each neighborhood along with the top 5 most common venues
Let’s put that into a pandas dataframe
First, let’s write a function to sort the venues in descending order.
Now let’s create the new dataframe and display the top 10 venues for each neighborhood.
Cluster Neighborhoods
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k. I’m using the K-Means clustering technique from the Sklearn library
To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 7.
Run k-means to cluster the neighborhood into 5 clusters.
Let’s create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
3. Result
We can start by visualizing the frequency of the 10 most frequently occuring restaurants in Queens, using seaborn/matplotlib packages.
As seen, Chinese Restaurants is the most common restaurant in Queens, it’s about three times more frequent than Indian Restaurants. This indicates that my friend might should open an Indian restaurant instead of a Chinese restaurant.
What we see in the table below the neighborhoods and their most common restaurants, and they now have been assigned seven different cluster labels from 0 to 6.
Finally, let’s visualize the resulting clusters. We can now use the cluster labels to show the neighborhoods marked with a cluster-specific color on a map, again using folium:
Examine Clusters
Now, we can examine each cluster and determine the discriminating restaurant types that distinguish each cluster.
Cluster 1
As seen in the table above, all these neighborhoods got ‘Indian Restaurant’ as the most common restaurant type. This indicates that my friend shouldn’t open the restaurant in these neighborhoods.
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
4. Discussion
According to the analysis, my friend should open an Indian restaurant rather than a Chinese restaurant. In addition, my friend should not open an Indian restaurant in the neighborhoods of Cluster 1 as these have ‘Indian Restaurant’ as the most common restaurant type.
5. Conclusion
Finally, to conclude this project, I have got a small glimpse of how a real-life Data science project looks like. I have used some frequently used python libraries to scrap web-data, handle JSON files, use Foursquare API to explore the neighborhoods of Queens and plotting graphs.
This is just one example of fantastic data science use cases we can realize applying freely available technology.
Acknowledgement & sources
A number of publications have inspired this piece of work and helped me develop the skills to run this analysis and the difficult coding behind. The courses of the IBM Data Science Professional Certificate played an important role here.