A Data Tour Through Los Angeles Neighborhoods

Summary

With close to 4 million people, Los Angeles is a cultural hodgepodge that is made up of more diverse cultures and neighborhoods than what Hollywood portrays on TV. Newcomers in LA will most likely have questions about which neighborhoods are affordable and safe to live in, and which ones are fun for a night out. To answer those questions, through this data-driven guide I hoped to inform young professionals who are interested in relocating to Los Angeles but are unsure of which neighborhood to live in.

To accomplish this, I scraped data from several sources:

Rental price data from the American Community Survey
Historical crime data (2014-2018) from LAPD
Neighborhood venues data from Foursquare
List of LA neighborhoods from LA Times

I started off by performing basic exploratory data analysis to understand various characteristics of the neighborhoods. I then used K-means clustering to group the neighborhoods into 10 categories based on venue categories, city population, median rental prices, and safety ranking (tabulated as the average reported crimes per capita). The clustering results show areas in the City of Los Angeles with clearly defined boundaries. Not surprisingly, the high income, high safety ranking areas reside near the coast in west LA, whereas the low income, low safety ranking areas locate mostly in downtown, central, and southeast LA. My conclusion upon completing this machine learning project is that I would recommend young professionals to relocate to cities in west LA, in areas north of the LAX. Despite its above average rent, the area offers safe neighborhoods with lots of good coffee shops and restaurants to dine at. Conversely, I would not recommend living near downtown LA or its surrounding areas, due to high crime rates, despite its below average rent.

Introduction
1.1 Question of interest
1.2 Who may find this report interesting?
Data sources
2.1 Data.org
2.2 Zillow
2.3 Foursquare API
Methodology
3.1 Exploratory data analysis: Location, housing, crime
3.2 Combine data
3.3 K-means clustering
Results
Discussion
Conclusion

1. Introduction

Whether you’re a seasoned Angelino, a recent transplant from somewhere else, or even just someone who is contemplating moving to the city of Angels, the task of searching for a young, fun and affordable neighborhood to live in Los Angeles can be very daunting. With 16 counties and 200+ neighborhoods in LA, how do you even begin to find out which neighborhood fits your personality and lifestyle the best? If you are on a budget, how do you know which neighborhood offers the most affordable housing without compromising safety? There are so many questions that can be asked, so in order to help you better understand the different neighborhoods and make an informed choice, in this final capstone project I will leverage power of data to separate LA neighborhoods into different categories. In doing so I hope to help you find your dream neighborhood to live in LA.

1.1 Question of interest

The main question I would like to ask is this: For those young professionals starting off in the working world, where are some neighborhoods in LA is most suitable for young people to live in? When I imagine a suitable neighborhood for young people, I can think of 3 main criteria that will be important to consider:

Rental price: The rent must be affordable. If most of your hard-earned money goes to rent, then little disposable income will be left for all the other fun stuff. The general rule-of-thumb for rent is ~30% of take home pay per month.
Safety: Nobody wants to live somewhere that, although very affordable, has implicit curfews at night because the neighborhood isn’t safe. On the other hand, maybe it’s not necessary to live in an expensive gated community with 24-hour security while having to shell out most of your paycheck just so you feel safe.
Entertainment: What do young and restless minds do when they are not working? They are enjoying life! An ideal neighborhood should provide a vast array of entertainment options that is affordable and diverse.

Just like with anything in life, there is no 100% perfect choice for everything we do; however, when we strike a balance between what we think is important, then usually the optimal choice will emerge. In the case of finding an ideal neighborhood to live in, I believe safety, housing prices, and entertainment options are three critical criteria that a good neighborhood should have a good balance in. With that in mind, let’s move onto the data requirements!

1.2 Who may find this report interesting?

This report will be interesting to those who are currently living in Los Angeles and are looking to relocate to another neighborhood, or expected LA transplants who are making decisions on where to settle in once they relocate.

2. Data sources

For this study, I used data from the following sources:

LA Crime Data from 2010 to Present (Source: LA Open Data)
America Community Survey median rental prices (Source: US Census Bureau)
Location and venue information (Source: Foursquare)

2.1 LA Crime Data

LA Open Data provides Crime Data from 2010 to Present in Los Angeles. Here’s the description:

This dataset reflects incidents of crime in the City of Los Angeles dating back to 2010. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.

The dataset contains the following features:

Index([‘DR Number’, ‘Date Reported’, ‘Date Occurred’, ‘Time Occurred’, ‘Area ID’, ‘Area Name’, ‘Reporting District’, ‘Crime Code’, ‘Crime Code Description’, ‘MO Codes’, ‘Victim Age’, ‘Victim Sex’, ‘Victim Descent’, ‘Premise Code’, ‘Premise Description’, ‘Weapon Used Code’, ‘Weapon Description’, ‘Status Code’, ‘Status Description’, ‘Crime Code 1’, ‘Crime Code 2’, ‘Crime Code 3’, ‘Crime Code 4’, ‘Address’, ‘Cross Street’, ‘Location ‘], dtype=’object’)

See the LA City landing page for descriptions of the features in the dataset.

2.2 Median rental prices

I used the rental price data from USC’s Neighborhood Data for Social Change (NDSC) platform. The data description is below.

Gross rent provides information on the monthly housing cost expenses for individual renters. Gross rent is the contract rent plus the estimated average monthly cost of utilities (electricity, gas, and water and sewer) and fuels (oil, coal, kerosene, wood, etc.) if these are paid by the renter (or paid for the renter by someone else). Gross rent is intended to eliminate differentials that result from varying practices with respect to the inclusion of utilities and fuels as part of the rental payment. The estimated costs of water and sewer, and fuels are reported on a 12-month basis but are converted to monthly figures for the tabulations.

Data source is America Community Survey (ACS). According to Wikipedia:

The American Community Survey (ACS) is an ongoing survey by the U.S. Census Bureau. It regularly gathers information previously contained only in the long form of the decennial census, such as ancestry, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics. These data are used by many public-sector, private-sector, and not-for-profit stakeholders to allocate funding, track shifting demographics, plan for emergencies, and learn about local communities. Sent to approximately 295,000 addresses monthly (or 3.5 million per year), it is the largest household survey that the Census Bureau administers.

2.3 Location and venue data

To get venue information from LA neighborhoods, I used Foursquare API, which provides regular endpoints for fetching these information:

Price
Like count
Rating
Category
Postal code

This can be used to qualitatively evaluate the characteristics of each neighborhood, and can be used in clustering to group similar neighborhoods.

3. Methodology

3.1 Exploratory Data analysis

3.1.1 Download LA crime dataset, neighborhood, and population data

# Download LA crime data
# !wget -O 'crime_data.csv' 'https://data.lacity.org/api/views/y8tr-7khq/rows.csv?accessType=DOWNLOAD' # uncomment if file hasn't been downloaded before
# print('LA crime data downloaded!')

# import required libraries
import pandas as pd
import requests
import numpy as np
import pprint
from pandas.io.json import json_normalize
# !conda install -q -c conda-forge geopy --yes # uncomment if library not installed
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# !conda install -q -c conda-forge folium=0.5.0 --yes # uncomment if library not installed
import folium
# !pip install -q python-zillow # uncomment if library not installed
# import zillow
# !pip install -q shapely # uncomment if library not installed
from shapely.geometry import shape, Point
import json
from bs4 import BeautifulSoup
from matplotlib import cm
import matplotlib.colors

print('Libraries loaded')

Libraries loaded

# set up credentials
# FOURSQUARE API credentials
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

# scrap LA neighborhood list from LA Times's website.
url = requests.get('http://maps.latimes.com/neighborhoods/population/total/neighborhood/list/').text
soup = BeautifulSoup(url, 'lxml')

# print(soup.prettify())

# extract the list of cities from LA
my_table = soup.find('table', {'class': 'datagrid'}) # returns a bs4 object
neighborhoods = []
populations = []

for row in my_table.find_all('a', href=True):
    neighborhoods.append(row.find(text=True))
    
for row in my_table.find_all('td', text=True):
    populations.append(row.find(text=True))
populations = populations[1::2] # skip rankings

neigh_pop_df = pd.DataFrame({'Neighborhood':neighborhoods, 'Population':populations})

# clean up neighborhood names and remove ones that don't return coordinates 
drop_neighborhoods = ['Unincorporated Santa Monica Mountains', 
                      'Castaic Canyons', 
                      'Tujunga Canyons',
                     'Unincorporated Santa Susana Mountains',
                     'Unincorporated Catalina Island']
neigh_pop_df = neigh_pop_df[~neigh_pop_df['Neighborhood'].isin(drop_neighborhoods)]
neigh_pop_df['Neighborhood'] = neigh_pop_df['Neighborhood'].str.split('/').str[0].str.split(',').str[0].str.split('e-L').str[0]
neigh_pop_df = neigh_pop_df.sort_values('Neighborhood', ascending=False)
neigh_pop_df['Population'] = pd.to_numeric(neigh_pop_df['Population'].str.replace(',', ''))

print(neigh_pop_df.shape)
neigh_pop_df.head()
# neigh_pop_df.shape
# neigh_pop_df.info()

(260, 2)

	Neighborhood	Population
46	Woodland Hills	59661
68	Winnetka	46943
225	Windsor Square	5642
61	Wilmington	51517
102	Willowbrook	34336

3.1.2 Location data from Foursquare

Next, fetch and clean LA venue information from Foursquare.

# define function to get venue information based on latitude and longitude
def getCityVenues(names, radius=1000, limit=100):
    venues_list = []
    c = 1
    for name in names:
        # get latitude and longitudes of the city
        address = name + ', CA'
#         if c 
        print(str(c) + ': ' + address)
        geolocator = Nominatim(user_agent="la_explorer", timeout=3)
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude

        # make the API call
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude, 
            radius,
            limit)
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # add venue name, location, category, and city
        venues_list.append([(
            name,
            latitude,
            longitude,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
        c += 1

    la_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    la_venues.columns = ['City',
                        'City latitude', 
                        'City longitude',
                        'Venue',
                        'Venue latitude',
                        'Venue longitude',
                        'Venue category']
    return la_venues

# get venue information for all neighborhoods
# venues_df = getCityVenues(names=neigh_pop_df['Neighborhood']) # takes a few minutes
# print('Venue information done')
# venues_df.to_pickle('venues.pkl')

# load from saved pickle
venues_df = pd.read_pickle('venues.pkl')
print('Venue information loaded')

# check the dataframe
print(venues_df.shape)
venues_df.head()

Venue information loaded
(8335, 7)

	City	City latitude	City longitude	Venue	Venue latitude	Venue longitude	Venue category
0	Winnetka	34.201262	-118.571087	Collector's Paradise	34.200163	-118.571371	Hobby Shop
1	Winnetka	34.201262	-118.571087	Starbucks	34.200803	-118.570514	Coffee Shop
2	Winnetka	34.201262	-118.571087	Johnny Pacific	34.208097	-118.570644	Latin American Restaurant
3	Winnetka	34.201262	-118.571087	Crazy Harry's Bar	34.208189	-118.571561	Bar
4	Winnetka	34.201262	-118.571087	Guido's Pizza & Pasta	34.202035	-118.571647	Pizza Place

# check the # of venues returned for each city
venues_df.groupby('City').count();

# clean up data
# one hot encoding
la_venue_onehot = pd.get_dummies(venues_df[['Venue category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
la_venue_onehot['City'] = venues_df['City'] 

# move neighborhood column to the first column
city_idx = int(np.where(la_venue_onehot.columns == 'City')[0])
other_idx = np.where(la_venue_onehot.columns != 'City')
fixed_columns = [la_venue_onehot.columns[city_idx]] + list(la_venue_onehot.columns[other_idx])
la_venue_onehot = la_venue_onehot[fixed_columns]

# group venues by city and calculate frequency of occurrence
la_venue_onehot_grouped = la_venue_onehot.groupby('City').mean().reset_index()
la_venue_onehot_grouped

# fixed_columns
# la_venue_onehot.columns[:100]
print(la_venue_onehot_grouped.shape)
la_venue_onehot_grouped.head()

(232, 434)

	City	ATM	Airport	...	Wine Bar	Wings Joint
0	Acton	0.00000	0.00000	...	0.000000	0.000000
1	Adams-Normandie	0.02439	0.02439	...	0.000000	0.000000
2	Agoura Hills	0.00000	0.00000	...	0.000000	0.000000
3	Agua Dulce	0.00000	0.00000	...	0.000000	0.000000
4	Alhambra	0.00000	0.00000	...	0.014085	0.028169

5 rows × 434 columns

3.1.3 Rental price data

# uncomment if file hasn't been downloaded before
# !wget -q -O 'la_rental_prices.csv' 'https://usc.data.socrata.com/api/views/4a97-v5tx/rows.csv?accessType=DOWNLOAD'
# print('LA rental prices data downloaded!')

# Clean up rental price data
prices_df = pd.read_csv('la_rental_prices.csv')
prices_df.head()
# prices_df.drop(['Policy Area', 'Dataset', 'Variable', 'Tract', 'GEOID', 'Row ID', 'Date'], axis=1, inplace=True)
prices_df.drop(['Policy Area', 'Dataset', 'Variable', 'Tract', 'GEOID', 'Row ID', 'Date',
       'Neighborhoods_from_gw34_aa8i', 'Neighborhoods_from_5jp5_kw3x',
       'New Neighborhoods_from_s3dq_ztay', 'layer_0_from_mgnu_zr5s',
       'layer_0_from_yr99_sgmi', 'LA Neighborhoods_from_5ikr_59ci',
       'City Boundaries_from_22pk_kkuz', 'Los Angeles Incorporated Cities',
       'la_county_tracts_from_jczq_z5ui', 'DPW_CITY_BOUNDARIES_from_dmu2_xaer',
       ':@computed_region_yak3_5aqv', ':@computed_region_izm9_hpec',
       ':@computed_region_aysi_wk4g', ':@computed_region_an3c_5a9u'], axis=1, inplace=True)

# select data for 2016 (most recent)
prices_df = prices_df[prices_df['Year'] == 2016]

# split location into latitude and longitude
coordinates = prices_df['Location'].str.split(',', expand=True)
coordinates.columns = ['Latitude', 'Longitude']
coordinates['Latitude'] = pd.to_numeric(coordinates['Latitude'].str.strip('('))
coordinates['Longitude'] = pd.to_numeric(coordinates['Longitude'].str.strip(')'))
prices_df = pd.concat([prices_df, coordinates], axis=1)
prices_df.drop(['Location'], axis=1, inplace=True)

# # drop rows without rent 
prices_df.dropna(inplace=True)

# keep only entries per neighborhood with largest tract number (assumed to be most recent)
prices_df = prices_df.sort_values('Tract Number', ascending=False).drop_duplicates('Neighborhood')
prices_df.reset_index(inplace=True)
prices_df.drop('index', axis=1, inplace=True)

# rename columns
col_names = ['Year', 'Median Rent', 'Tract Number', 'Neighborhood', 'Latitude', 'Longitude']
prices_df.columns = col_names

# # final clean up
prices_df.drop(['Year', 'Tract Number'], axis=1, inplace=True)

# prices_df.shape
prices_df.head()

	Median Rent	Neighborhood	Latitude	Longitude
0	960.0	Wilmington	33.778611	-118.295234
1	719.0	Elysian Park	34.081932	-118.238222
2	975.0	Van Nuys	34.207747	-118.490984
3	1263.0	Angeles Crest	34.294753	-117.913563
4	2050.0	Tujunga Canyons	34.340966	-118.294994

Merge neigh_pop_df with prices_df

# combine neighborhood, population, median rent, and location
neigh_rent_df = pd.merge(left=neigh_pop_df, right=prices_df, left_on='Neighborhood', right_on='Neighborhood')
print(neigh_rent_df.shape)
neigh_rent_df.head()

(247, 5)

	Neighborhood	Population	Median Rent	Latitude	Longitude
0	Woodland Hills	59661	2669.0	34.146928	-118.591402
1	Winnetka	46943	1022.0	34.190469	-118.579700
2	Windsor Square	5642	1643.0	34.065326	-118.318465
3	Wilmington	51517	960.0	33.778611	-118.295234
4	Willowbrook	34336	1009.0	33.914153	-118.227834

3.1.4 LA Crime data

For the purpose of this analysis, I will use only the Crime Data from 2010 to Present from the past 5 years (2014-2018). According to LA City landing page, the following features will be useful for our analysis:

Area ID: The LAPD has 21 Community Police Stations referred to as Geographic Areas within the department. These Geographic Areas are sequentially numbered from 1-21.
Area Name: The 21 Geographic Areas or Patrol Divisions are also given a name designation that references a landmark or the surrounding community that it is responsible for.
Location: The location where the crime incident occurred. Actual address is omitted for confidentiality. XY coordinates reflect the nearest 100 block.

There are many other features, such as crimes committed, victim information, weapons, and premise, that allow for much in-depth analysis of past crimes associated with the neighborhoods, but for a more macro view of LA neighborhood safety these features should suffice.

# read data
crime_df = pd.read_csv('crime_data.csv') # takes ~10 s to read

# regex to find rows from years 2014-2019
crime_df = crime_df[crime_df['Date Occurred'].str.contains('.*(201[4-9])', regex=True)]

# keep the columns we want
crime_df = crime_df.loc[:, ['Area ID', 'Area Name', 'Location ']].reset_index()

# separate Location into latitude and longitude
coordinates = crime_df['Location '].str.split(',', expand=True)
coordinates.columns = ['Latitude', 'Longitude']
coordinates['Latitude'] = pd.to_numeric(coordinates['Latitude'].str.strip('('))
coordinates['Longitude'] = pd.to_numeric(coordinates['Longitude'].str.strip(')'))
crime_df = pd.concat([crime_df, coordinates], axis=1)
crime_df.drop(['index', 'Location '], axis=1, inplace=True)

# group into neighborhoods
crime_df.tail()

	Area ID	Area Name	Latitude	Longitude
1149914	18	Southeast	33.9393	-118.2587
1149915	18	Southeast	33.9310	-118.2826
1149916	18	Southeast	33.9265	-118.2739
1149917	18	Southeast	33.9451	-118.2482
1149918	18	Southeast	33.9519	-118.2739

# rank neighorhood safety based on the total crimes (i.e. lowest ranking means highest crime counts)
crime_counts = pd.DataFrame(crime_df['Area Name'].value_counts().sort_values()).reset_index()
# safety_rankings['Safety Ranking'] = list(pd.to_numeric(safety_rankings.index)+1)
crime_counts.columns = ['Area Name', 'Crime Counts']

# set average coordinates per area as location
neigh_coordinates = crime_df.groupby('Area Name')[['Latitude', 'Longitude']].agg([np.mean]).reset_index() # average coordinate of area
neigh_coordinates.columns = ['Area Name', 'Latitude', 'Longitude']

# combine both df together
safety_df = pd.merge(left=crime_counts, right=neigh_coordinates, left_on='Area Name', right_on='Area Name')
safety_df.head()

	Area Name	Crime Counts	Latitude	Longitude
0	Foothill	40819	34.247529	-118.374105
1	Hollenbeck	44238	34.042074	-118.154040
2	West Valley	47553	34.181970	-118.503372
3	Harbor	48141	33.759340	-118.246807
4	Wilshire	48261	34.056564	-118.334722

# Find the closest neighborhood to each police area by comparing Euclidean distances
closest_area = []
area_list = []
for lat_neigh, lng_neigh in zip(neigh_rent_df['Latitude'], neigh_rent_df['Longitude']):
    best_distance = 1e6
    for area, lat_area, lng_area in zip(safety_df['Area Name'], safety_df['Latitude'], safety_df['Longitude']):
        distance = np.sqrt((lat_neigh - lat_area)**2 + (lng_neigh - lng_area)**2)
        if distance < best_distance:
            best_distance = distance
            closest_area = area
    area_list.append(closest_area)
            
neigh_rent_df['Area Name'] = area_list
neigh_rent_df.head()

	Neighborhood	Population	Median Rent	Latitude	Longitude	Area Name
0	Woodland Hills	59661	2669.0	34.146928	-118.591402	Topanga
1	Winnetka	46943	1022.0	34.190469	-118.579700	Topanga
2	Windsor Square	5642	1643.0	34.065326	-118.318465	Wilshire
3	Wilmington	51517	960.0	33.778611	-118.295234	Harbor
4	Willowbrook	34336	1009.0	33.914153	-118.227834	Southeast

3.1.5 Combine LA crime, venues, and rental price data

neigh_rent_crime_df = pd.merge(left=neigh_rent_df, 
                               right=safety_df, 
                               left_on='Area Name', 
                               right_on='Area Name',
                               suffixes=('_neigh', '_area'))
neigh_rent_crime_df.head()

	Neighborhood	Population	Median Rent	Latitude_neigh	Longitude_neigh	Area Name	Crime Counts	Latitude_area	Longitude_area
0	Woodland Hills	59661	2669.0	34.146928	-118.591402	Topanga	53413	34.187927	-118.583495
1	Winnetka	46943	1022.0	34.190469	-118.579700	Topanga	53413	34.187927	-118.583495
2	Westlake Village	8663	3034.0	34.158577	-118.801884	Topanga	53413	34.187927	-118.583495
3	West Hills	38814	2618.0	34.194387	-118.655217	Topanga	53413	34.187927	-118.583495
4	Topanga	7425	2926.0	34.059906	-118.610282	Topanga	53413	34.187927	-118.583495

# find crime rate per capita
area_population = neigh_rent_crime_df.groupby('Area Name')[['Population']].sum()
area_population.reset_index(inplace=True)
area_population.columns = ['Area Name', 'Area Population']

safety_df2 = safety_df.set_index('Area Name').join(area_population.set_index('Area Name'), lsuffix='_area', rsuffix='_neigh')
safety_df2['Crime Rate'] = safety_df2['Crime Counts']/safety_df2['Area Population']
safety_df2.reset_index(inplace=True)
safety_df2.sort_values('Crime Rate', ascending=True, inplace=True)
safety_df2['Safety Ranking'] = list(np.arange(safety_df2.shape[0], 0, -1) + 1)

neigh_rent_safety_df = pd.merge(left=neigh_rent_crime_df, right=safety_df2, on='Area Name')

# remove redundant columns
neigh_rent_safety_df.drop(['Latitude', 'Longitude', 'Crime Counts_y'], axis=1, inplace=True)

neigh_df_cleaned = neigh_rent_safety_df[['Neighborhood', 'Median Rent', 'Population', 'Safety Ranking']]

# normalize numerical columns
from sklearn import preprocessing
scaled_neigh_df = neigh_df_cleaned.copy()
scaler = preprocessing.MinMaxScaler()
cols = ['Median Rent', 'Population', 'Safety Ranking']
x_scaled = scaler.fit_transform(scaled_neigh_df[cols].values)
scaled_neigh_df.loc[:,cols] = x_scaled

print(scaled_neigh_df.shape)
scaled_neigh_df.head()

(247, 4)

	Neighborhood	Median Rent	Population	Safety Ranking
0	Woodland Hills	0.721047	0.128755	0.65
1	Winnetka	0.168177	0.101265	0.65
2	Westlake Village	0.843572	0.018522	0.65
3	West Hills	0.703927	0.083694	0.65
4	Topanga	0.807318	0.015846	0.65

# finally, get the total combined df
df = pd.merge(scaled_neigh_df, la_venue_onehot_grouped, left_on='Neighborhood', right_on='City')

print(df.shape)
df.head()

(220, 438)

	Neighborhood_x	Median Rent	Population	Safety Ranking	City	ATM	...	Wine Shop	Women's Store
0	Winnetka	0.168177	0.101265	0.65	Winnetka	0.000000	...	0.000000	0.000
1	Westlake Village	0.843572	0.018522	0.65	Westlake Village	0.000000	...	0.000000	0.000
2	West Hills	0.703927	0.083694	0.65	West Hills	0.000000	...	0.000000	0.000
3	Topanga	0.807318	0.015846	0.65	Topanga	0.000000	...	0.000000	0.125
4	Tarzana	0.932528	0.076535	0.65	Tarzana	0.014493	...	0.014493	0.000

5 rows × 438 columns

# create map of New York using latitude and longitude values
la_lat = 34.0522
la_lng = -118.2437
map_la = folium.Map(location=[la_lat, la_lng], zoom_start=10)

# add markers to map
# show areas
for lat, lng, area in zip(safety_df['Latitude'], 
                          safety_df['Longitude'], 
                          safety_df['Area Name']):
    label = '{}'.format(area)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
# show neighborhoods
for lat, lng, city, area in zip(neigh_rent_safety_df['Latitude_neigh'], 
                                neigh_rent_safety_df['Longitude_neigh'], 
                                neigh_rent_safety_df['Neighborhood'], 
                                neigh_rent_safety_df['Area Name']):
    label = '{}, {}'.format(city, area)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  
    
map_la

3.2 K-means clustering

from sklearn.cluster import KMeans

n_clusters = 10
df_cluster = df.drop(['Neighborhood_x', 'City'], axis=1)
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(df_cluster)

kmeans.labels_[:10]

array([2, 1, 1, 1, 1, 5, 1, 5, 5, 5], dtype=int32)

4. Results

Let’s summarize all the findings and K-means results in one dataframe.

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

# create columns according to number of top venues
# columns = ['Neighborhood']
num_top_venues = 5
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
final_df = pd.DataFrame(columns=columns)
final_df['Neighborhood'] = df['Neighborhood_x']


# get top venues
c = 1
for neighborhood in final_df['Neighborhood']:
    c += 1
    final_df.iloc[np.where(final_df['Neighborhood'] == neighborhood)[0], 1:] = return_most_common_venues(
        la_venue_onehot_grouped.iloc[np.where(la_venue_onehot_grouped['City'] ==  neighborhood)[0][0], 1:], 
        num_top_venues)
    
# put in safety ranking, median rent, and population
final_df = final_df.merge(neigh_df_cleaned, on='Neighborhood')
final_df = final_df.merge(neigh_rent_df[['Neighborhood', 'Latitude', 'Longitude']], on='Neighborhood')
final_df['Cluster'] = kmeans.labels_ + 1


print(final_df.shape)
final_df.head()
# final_df.head()

(220, 12)

	Neighborhood	1th Most Common Venue	2th Most Common Venue	3th Most Common Venue	4th Most Common Venue	5th Most Common Venue	Median Rent	Population	Safety Ranking	Latitude	Longitude	Cluster
0	Winnetka	Video Store	Pizza Place	Bar	Ice Cream Shop	Hotel	1022.0	46943	15	34.190469	-118.579700	3
1	Westlake Village	Farm	Inn	Sandwich Place	Falafel Restaurant	Empanada Restaurant	3034.0	8663	15	34.158577	-118.801884	2
2	West Hills	Grocery Store	Pub	Gym	Shopping Mall	Sandwich Place	2618.0	38814	15	34.194387	-118.655217	2
3	Topanga	Women's Store	Gourmet Shop	Farmers Market	Bookstore	Furniture / Home Store	2926.0	7425	15	34.059906	-118.610282	2
4	Tarzana	Sushi Restaurant	Pizza Place	Coffee Shop	Fast Food Restaurant	Gym / Fitness Center	3299.0	35502	15	34.145344	-118.554520	2

We end up with 220 different neighborhoods, with 11 metrics to evaluate each neighborhood.

5. Discussion

First, let’s visualize the clustered neighborhoods,

# create map of New York using latitude and longitude values
colors = []
cmap = cm.get_cmap('tab10', n_clusters)    # PiYG
for i in range(cmap.N):
    rgb = cmap(i)[:3] # will return rgba, we take only first 3 so we get rgb
    colors.append(matplotlib.colors.rgb2hex(rgb))

la_lat = 34.0522
la_lng = -118.2437
map_la = folium.Map(location=[la_lat, la_lng], zoom_start=10)
# colors = ['#ff00ff', '#ff0000', '#00ff00', '#00ffff', '#0000ff']

# show neighborhoods
for lat, lng, neighborhood, cluster in zip(final_df['Latitude'], 
                                           final_df['Longitude'], 
                                           final_df['Neighborhood'],
                                           final_df['Cluster']):
    label = '{} ({})'.format(neighborhood, cluster)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colors[cluster-1],
        fill=True,
        fill_color=colors[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_la)  

map_la

Upon a first glance, we can see that some clusters are well localized. Some quick observations:

Cluster 1 is mostly in east and south LA areas,
Cluster 2 is mostly near the coast or in the mountainous areas such as Malibu, Palos Verdes and Topanga
Cluster 3 is near LAX, south of 10 freeway, surrounding Inglewood area
Cluster 4 is all the way near Lancast and Palmdale area
Cluster 5 is only in Silver Lake
Cluster 6 is mostly in west LA, and mostly north of freeway 10 and west of 405
Cluster 7 is in east and south LA, covering a large swatch of the land
Cluster 8 is only in Long Beach
Cluster 9 is the area surrounding downtown LA, and extends a bit south to Huntington Park
Cluster 10 is Downey and Mount Washington

Let’s see the characteristics of the 10 different clustered areas.

averages = final_df.groupby('Cluster')[['Median Rent', 'Safety Ranking', 'Population']].mean()
averages

	Median Rent	Safety Ranking	Population
Cluster
1	1150.144928	20.434783	31092.478261
2	3059.125000	17.125000	17502.375000
3	1328.813953	11.279070	36828.325581
4	1287.333333	18.666667	4388.000000
5	1166.000000	7.000000	30972.000000
6	1872.793103	15.965517	24550.275862
7	1550.410256	21.307692	58453.102564
8	2225.000000	21.000000	462731.000000
9	1096.176471	4.882353	42071.235294
10	1472.000000	19.500000	60025.500000

# plot average median rent vs average safety ranking
averages.plot(x='Median Rent', y='Safety Ranking', kind='scatter');

png

# plot average median rent vs average safety ranking
averages.plot(x='Median Rent', y='Population', kind='scatter');

png

# plot average median rent vs average safety ranking
averages.plot(x='Safety Ranking', y='Population', kind='scatter');

png

5.1 Rent prices and neighborhood safety

While neighborhood safety tend to increase with median rent prices, especially in clusters such as #2, which contains Bel-Air, San Marino, Palos Verdes, and #8, which encompasses the vicinity of Long Beach (I’m surprised to find Long Beach with such high rental prices!), there are still some areas that are safe and yet very affordable, such as clusters #1, #4, #7, and #10. However, the safest neighborhoods may not always attract young professionals, since most of those neighborhoods are in east LA, in cities such as Pomona, Palmdale, and San Fernando Valley, which is far away from the more lively districts with restaurants and bars. The kind of area that young professionals want to live in probably is a balance between those criteria.

Silver Lake, which is known to be inhabited by lots of young professionals, offers very cheap average median rent at $1166, but only has a below-average safety ranking of 7 (out of 21). So perhaps some people would prefer to live somewhere that is more expensive yet still pretty safe, such as areas in cluster #6, in neighborhoods like West Hollywood, Santa Monica, Culver City, Palms, and Mid-Wilshire. Even though the median rent in those places is above average at ~$1800, their average safety ranking is 16, which is much higher than neighborhoods currently undergoing gentrification, such as Silver Lake.

Median rent and safety rankings don’t correlate with population. In other words, no matter how large the city is, the rent prices and safety of the neighborhood remains pretty much the same. Interestingly, Koreatown is grouped in cluster 1, which has both low rent and high safety ranking. I always thought of Koreatown as somewhere that is safe during the day, but requires caution at night, but maybe the high volume of restaurants and activities there help to reduce crimes.

Based on these clustering results, I would recommend any new LA transplants to move to somewhere in west LA (if you can afford it), where the medium to high rental prices is offset by safe neighborhoods and lots of coffee shops, mexican restaurants, and pizza places. Despite the cheap rent, I would not recommend living in the areas surrounding downtown LA due to the high crime rates. Within downtown LA there seems to be no cluster, which is strange.

Next, let’s dig a little deeper into the neighborhood venues.

5.2 Neighborhood venues

final_df.groupby('Cluster')[['1th Most Common Venue', 
                              '2th Most Common Venue', 
                              '3th Most Common Venue',
                              '4th Most Common Venue',
                              '5th Most Common Venue']].agg(lambda x: x.value_counts().index[0])

	1th Most Common Venue	2th Most Common Venue	3th Most Common Venue	4th Most Common Venue	5th Most Common Venue
Cluster
1	Mexican Restaurant	Fast Food Restaurant	Pizza Place	Grocery Store	Convenience Store
2	Sushi Restaurant	Pizza Place	Zoo	Fair	Empanada Restaurant
3	Pizza Place	Fast Food Restaurant	Park	Park	Park
4	Food	Zoo	Farm	Empanada Restaurant	English Restaurant
5	Lake	Zoo	Falafel Restaurant	Empanada Restaurant	English Restaurant
6	Coffee Shop	Convenience Store	Mexican Restaurant	Pizza Place	Coffee Shop
7	Mexican Restaurant	Fast Food Restaurant	Mexican Restaurant	Sandwich Place	Coffee Shop
8	Fishing Spot	American Restaurant	Zoo	Farm	English Restaurant
9	Coffee Shop	Fast Food Restaurant	Coffee Shop	American Restaurant	Café
10	Ski Chairlift	Mountain	Zoo	Falafel Restaurant	Empanada Restaurant

In the very safe neighborhoods such as clusters #1, #7, #8, and #10, Mexican restaurants seem to be very common (maybe that’s just true across LA in general). After checking cluster #10 (Downey and Mount Washington), I realized that there’s not much venue information there besides mountain and ski chairlift, so the information is probably not very accurate.

In the high rental price clusters, #2 and #8, sushi, English and American restaurants make the top 5 common venues, which is pretty indicative of above average communities. Both areas also contain zoos as well as farms and fishing spot. These are indicative of less urban areas with lots of mountains and natural areas, which I can definitely see happening with areas like Malibu, Topanga, and Palos Verdes.

Most of the clusters seems to contain pretty similar venues despite safety rankings. Most have coffee shops, Mexican restaurants, fast food restaurants, or pizza places. I’d interpret this as most of LA being saturated with generic strip malls that offer those kind of venues. Therefore, to truly discover the uniqueness of each neighborhood, we probably have to dig deeper than the top 5 most common venues. This is interesting and true from my experience, because at a first glance, LA really does seem like a giant strip mall, and to outsiders the different neighborhoods may appear to lack uniqueness. However, upon a closer inspection you will notice that each neighborhood is actually quite different. Unfortunately, the location information provided by Foursquare doesn’t seem to reflect that.

# create columns according to number of top venues
# columns = ['Neighborhood']
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
top_columns = columns;
venues_df2 = pd.DataFrame(columns=columns)
venues_df2['Neighborhood'] = df['Neighborhood_x']


# get top venues
c = 1
for neighborhood in venues_df2['Neighborhood']:
    c += 1
    venues_df2.iloc[np.where(venues_df2['Neighborhood'] == neighborhood)[0], 1:] = return_most_common_venues(
        la_venue_onehot_grouped.iloc[np.where(la_venue_onehot_grouped['City'] ==  neighborhood)[0][0], 1:], 
        num_top_venues)

    
# put in safety ranking, median rent, and population
venues_df2 = venues_df2.merge(neigh_df_cleaned, on='Neighborhood')
venues_df2 = venues_df2.merge(neigh_rent_df[['Neighborhood', 'Latitude', 'Longitude']], on='Neighborhood')
venues_df2['Cluster'] = kmeans.labels_ + 1

# print(venues_df2.shape)
# venues_df2.head()

venues_df2.groupby('Cluster')[top_columns[1:]].agg(lambda x: x.value_counts().index[0])

	1st Most Common Venue	2nd Most Common Venue	3rd Most Common Venue	4th Most Common Venue	5th Most Common Venue	6th Most Common Venue	7th Most Common Venue	8th Most Common Venue	9th Most Common Venue	10th Most Common Venue
Cluster
1	Mexican Restaurant	Fast Food Restaurant	Pizza Place	Grocery Store	Convenience Store	Coffee Shop	Café	Grocery Store	Sandwich Place	Sandwich Place
2	Sushi Restaurant	Pizza Place	Zoo	Fair	Empanada Restaurant	English Restaurant	Ethiopian Restaurant	Event Service	Event Space	Exhibit
3	Pizza Place	Fast Food Restaurant	Park	Park	Park	Ethiopian Restaurant	Bakery	Grocery Store	Grocery Store	Event Service
4	Food	Zoo	Farm	Empanada Restaurant	English Restaurant	Ethiopian Restaurant	Event Service	Event Space	Exhibit	Eye Doctor
5	Lake	Zoo	Falafel Restaurant	Empanada Restaurant	English Restaurant	Ethiopian Restaurant	Event Service	Event Space	Exhibit	Eye Doctor
6	Coffee Shop	Convenience Store	Mexican Restaurant	Pizza Place	Coffee Shop	Mexican Restaurant	Supermarket	Movie Theater	Sandwich Place	Grocery Store
7	Mexican Restaurant	Fast Food Restaurant	Mexican Restaurant	Sandwich Place	Coffee Shop	Burger Joint	Bar	Chinese Restaurant	Convenience Store	Coffee Shop
8	Fishing Spot	American Restaurant	Zoo	Farm	English Restaurant	Ethiopian Restaurant	Event Service	Event Space	Exhibit	Eye Doctor
9	Coffee Shop	Fast Food Restaurant	Coffee Shop	American Restaurant	Café	English Restaurant	Ethiopian Restaurant	Pizza Place	Event Space	Event Space
10	Ski Chairlift	Mountain	Zoo	Falafel Restaurant	Empanada Restaurant	Ethiopian Restaurant	Ethiopian Restaurant	Event Service	Exhibit	Exhibit

6. Conclusion

After reading this extensive report, I hope you will have a good grasp of the different neighborhoods in LA and have a good idea of where you would want to live if you ever move to LA. Despite LA’s sunshine, palm trees and neverending strip malls, LA is a big, big place that actually offers a wide diversity of neighborhoods and caters to people of all backgrounds and lifestyles. If you are a young professional, I hope this guide can help you decide which neighborhood you’d like to live in!