My fiancée and I are always on the lookout for restaurants with a good salad. We love going out to eat, but hate the calories and fat that typically come along with it. My goal in this post is to find new restaurants with good salads or salad bars in my neighborhood.

Since ranking nearby restaurants immediately brings Yelp to mind and they have a REST API, Yelp is the natural choice for a data source here. To run the extract code below, you’ll need to register an app with the Yelp API. I’ll be using python and pandas DataFrames to manipulate the data.

Extraction

Yelp’s API uses straight-forward oauth2 authentication which is easy to implement with the requests library:

def yelp_login(client_id, client_secret):
    params = {
        'client_id': client_id,
        'client_secret': client_secret,
        'grant_type': 'client_credentials'
    }
    
    resp = requests.post('https://api.yelp.com/oauth2/token', params)
    
    if resp.status_code == 200:
        obj = resp.json()
        if 'access_token' in obj:
            return obj['access_token']

    return None

The two API endpoints I’ll be using use the same pagination technique, so a generic request method isolates the pagination logic and keeps code simpler:

def yelp_request(endpoint, access_token, params, object_name):
    headers = {
        'Authorization': 'Bearer '+access_token
    }
    
    params['limit'] = 50
    params['offset'] = 0
    keep_extracting = True
    
    while keep_extracting:
        resp = requests.get(yelp_api_base + endpoint, headers=headers, params=params)

        if resp.status_code == 200:
            json = resp.json()
            
            if len(json[object_name]) < params['limit']:
                keep_extracting = False
            
            for b in json[object_name]:
                yield b
                time.sleep(2)
            
            params['offset'] += 50
        else:
            print(resp.json())
            keep_extracting = False

The yelp_request method makes business and review extraction straight-forward: I just need to pass the endpoint path and query parameters for each call to the function.

def yelp_search_businesses(access_token, search_term, location, radius=40000, categories=None):
    params = {
        'term': search_term,
        'location': location,
        'radius': radius
    }
    
    if categories:
        params['categories'] = categories
    
    return yelp_request('businesses/search', access_token, params, 'businesses')
    

def yelp_reviews(access_token, business_id):
    return yelp_request('businesses/{id}/reviews'.format(id=business_id), access_token, {}, 'reviews')

Finally, putting it all together to extract businesses matching ‘salad bar’ within the search radius from my ZIP code and adding available reviews:

yelp_api_base = 'https://api.yelp.com/v3/'
access_token = yelp_login(app_id, app_secret)

places = []

for b in yelp_search_businesses(access_token, 'salad bar', '91307'):
	b['reviews'] = []
	for r in yelp_reviews(access_token, b['id']):
		b['reviews'].append(r)

	places.append(json.loads(b))

df = pandas.DataFrame.from_records(places)
print(len(df))

1000

Hmm. Yelp seems to have cut me off after 1,000 records, despite showing more than 8,000 total records in my search. That shouldn’t be a problem for this dataset: since the Yelp API is sorting by adjusted rating, anything below the 1,000th rank isn’t likely to be very appetizing anyway.

Additionally, the reviews endpoint only returns three results. I’m not sure which three reviews are returned, so they may not be a representative sample of the business’s overall reviews.

Analysis

Cities by Restaurant Rating

First, I’ll group by city to rank the cities in my area by average restaurant rating. The mask call removes results more than 10 kilometers from my home.

cities = (df.mask(lambda x: x['distance'] > 10000)[['city', 'rating']]
            .groupby('city')
            .mean()
            .sort_values('rating', ascending=False))

cities.to_csv('data/the-perfect-salad-bar/cities_by_ranking.csv')

Astute readers may notice that the city with the highest overall rating, Wodland Hills, does not exist. Disregarding that typo, we now know where the highest rated restaurants likely are.

One interesting note: West Hills, the city I live in, came up dead last in rating. This confirms my previous belief that there are few good salad places in West Hills.

Reviews Containing “Salad”

Next, I want to find restaurants where the reviews actually mention salads. It’s usually a good sign when reviewers like the salads enough to mention them in reviews.

salads = (df[['name', 'reviews', 'rating', 'city', 'distance', 'price']]
            .where(df['reviews'].str.contains('salad'))
            .dropna()
            .sort_values('rating', ascending=False))

salads_by_occurrance = (salads
                        .groupby(['name'])['reviews']
                        .apply(lambda x: x[x.str.contains('salad')].count())
                        .sort_values(0, ascending=False))

salads_by_occurrance.to_csv('data/the-perfect-salad-bar/salad_reviews.csv')

The above chart plots restaurants by distance from my location and star rating. The heavily populated 4-star spot about 30 kilometers from West Hills is the trendy Westside LA neighborhood, home to plenty of excellent healthy restaurants.

Final List

Finally, I’ll merge and sort the two DataFrames above to come up with a ranked list of restaurants to try in nearby cities.

cities = pd.DataFrame.from_csv('data/cities_by_ranking.csv', index_col=None)

merged = salads.merge(cities, on='city')

(merged.sort_values(['rating_x', 'price', 'rating_y'], ascending=[False, True, True])[['name', 'city', 'rating_x', 'price']]
       .head(10)
       .to_csv('data/sorted_salad_restaurants.csv', index=False))

Wrap up

The above list satisfies most of the criteria I defined going in to this project: they’re relatively close-by, theoretically have good salads, and I’ve only eaten at one of them.

Overall, the Yelp API is very easy to use and is very fast. Unfortunately, since Yelp is not in the business of giving their data away, its results are perhaps too limited to make this a comprehensive analysis. Some areas to come back and reconsider are:

  1. Investigate and fix rate limiting issue that cut off the extract at 1,000 rows.
  2. Try the reviews API endpoint multiple times to see if it returns random results to build a larger review set.
  3. Incorporate Yelp check-ins to rank similarity to previous places I’ve eaten.