I just got back from my honeymoon, my first cruise to Alaska on the Explorer of the Seas. Aside from having a fantastic time with my wife, I gained a new fascination with the breadth and scale of cruise travel. Our ship housed upwards of 2,000 passengers and was often one of two or three ships docked in a port. Multiply that by the 100+ ships that are currently running cruises, you can have 200,000 people on the ocean at a time visiting dozens of ports across the globe.

When we got back, I began looking into cruise tracking sites and found one site that offers historical ship location data for 200+ ships going back to 2003. Using that data, I hoped to compile a dataset of ships, cities and locations for analysis and publishing.

Extract

The scripts used to extract and clean the raw data are available on Github. They are quick hacks, but they got the job done.

Transformation

Once the raw data was loaded into BigQuery, I began transforming it. First step, I want to join the location stamps to port cities and find the closest port city to that location. I tried using a Cartesian join on the whole dataset:

SELECT callsign, created_at, loc.lat as lat, loc.lon as lon, wind_from, wind_knots,
       wind_gust, barometer, air_temp, dew_point, city.lat as city_lat, city.lon as city_lon,
  	   city.id as city_id
FROM `cruise_scraper.locations_clean` loc, 
     `cruise_scraper.port_cities` city

Since the raw dataset is rather large, about 4.7 million location stamps, BigQuery couldn’t quite handle it. I limited the query to a few months worth of data and that worked, so I ran the same query 47 times and finally ended up with 2.5 billion rows of Cartesian joined goodness.

The next step is to eliminate the rows where the location stamp is paired with any city other than the one closest to it:

SELECT callsign, created_at, lat, lon, city_id as nearest_port_city, distance_to_port
FROM (
  SELECT
    callsign, created_at, lat, lon, wind_from, wind_knots, wind_gust, barometer, air_temp,
    dew_point, city_lat as city_lat, city_lon as city_lon, city_id as city_id,
    cartesianDistance(lat, lon, city_lat, city_lon) as distance_to_port,
    row_number() over(partition by callsign, created_at order by cartesianDistance(lat, lon, city_lat, city_lon)) as rnk
  FROM `cruise_scraper.locations_city_join_*`
  GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14
)
WHERE rnk=1

The cartesianDistance function in the above query is a UDF. BigQuery accepts Javascript UDF’s, this one computes the distance in meters between two (lat,lon) pairs:

CREATE TEMPORARY FUNCTION cartesianDistance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  var R = 6378.137; // Radius of earth in KM
  var dLat = lat2 * Math.PI / 180 - lat1 * Math.PI / 180;
  var dLon = lon2 * Math.PI / 180 - lon1 * Math.PI / 180;
  var a = Math.sin(dLat/2) * Math.sin(dLat/2) +
  Math.cos(lat1 * Math.PI / 180) * Math.cos(lat2 * Math.PI / 180) *
  Math.sin(dLon/2) * Math.sin(dLon/2);
  var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a));
  var d = R * c;
  return d * 1000; // meters
""";

Using the nearest city location stamp data, I aggregated down to daily granularity:

SELECT 
  trim(ships_v.name) as ship_name,
  trim(loc.callsign) as callsign,
  trim(ships_v.major_cruise_line) as major_cruise_line,
  date, 
  cartesianDistance(x1, y1, x2, y2) as meters_traveled,
  x1 as lat_start,
  y1 as lon_start,
  x2 as lat_end,
  y2 as lon_end,
  case when stopped_minutes >= 180 then TRUE else FALSE end as in_port, -- 3 hours spent stationary
  stopped_minutes,
  num_periods,
  port_city as port_city_id,
  cities.city as port_city_name,
  cities.country_code as port_city_country
FROM (
  SELECT 
    callsign,
    date(created_at) as date,
    max(case when day_start=1 then lat else null end) as x1,
    max(case when day_start=1 then lon else null end) as y1,
    max(case when day_end=1 then lat else null end) as x2,
    max(case when day_end=1 then lon else null end) as y2,
    sum(case when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then timestamp_diff(created_at, case when last_created_at is null then timestamp(date(created_at)) else last_created_at end, MINUTE) else 0 end) as stopped_minutes,
    sum(timestamp_diff(created_at, last_created_at, MINUTE)) as total_minutes,
    max(case 
             when last_lat is null and last_lon is null then nearest_port_city
             when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then nearest_port_city 
             else null end) as port_city,
    count(*) as num_periods
  FROM (
    SELECT 
      callsign,
      created_at,
      lat,
      lon,
      nearest_port_city,
      distance_to_port,
      lag(lat) over(partition by callsign, date(created_at) order by created_at) as last_lat,
      lag(lon) over(partition by callsign, date(created_at) order by created_at) as last_lon,
      lag(created_at) over(partition by callsign, date(created_at) order by created_at) as last_created_at,
      row_number() over (partition by callsign, date(created_at) order by created_at) as day_start,
      row_number() over (partition by callsign, date(created_at) order by created_at desc) as day_end
    FROM cruise_scraper.locations_with_city
    -- WHERE created_at > timestamp('2017-06-01')
    GROUP BY 1,2,3,4,5,6
  )
  GROUP BY 1,2
) loc
LEFT JOIN `cruise_scraper.ships_v` ships_v ON LOWER(TRIM(loc.callsign))=LOWER(TRIM(ships_v.callsign))
LEFT JOIN `cruise_scraper.port_cities` cities ON port_city = cities.id

Analysis

Which cruise line is operating the most ships:

SELECT major_cruise_line, COUNT(*) 
FROM cruise_scraper.ships_v
GROUP BY 1
ORDER BY 2 DESC
major_cruise_line count
Other 149
Royal Caribbean International 25
Carnival Cruise Line 24
Norwegian Cruise Line 16
Holland America Line 15
Costa Crociere 15
Princess Cruises 14
Celebrity Cruises 11
Silversea Cruises 4
Disney Cruises 3

How many day/ship combinations are missing data?

SELECT 
  COUNT(*)
FROM(
  SELECT 
    ship_name, 
    date,
    lag(date) over(partition by callsign order by date) as last_date
  FROM `home-data-warehouse.cruise_scraper.daily_ship_location`
) 
WHERE date_diff(date, last_date, DAY) > 1
83,106

What are the most popular port cities?

SELECT
  port_city_name as city,
  port_city_country as country,
  COUNT(*) as count
FROM cruise_scraper.daily_ship_location
WHERE major_cruise_line <> 'Other' and in_port = True
GROUP BY 1,2
ORDER BY 3 DESC
city country count
Fort Lauderdale (Port Everglades) US 3705
Vancouver CA 3236
Cozumel MX 3074
San Diego US 2762
St. Maarten SX 2046

Data Issues

The raw data comes from an unmonitored data pipeline, there may be data missing. Additionally, it looks like ships do not report location details when they are outside of service areas. This leads to many days where there’s only one reported location.

Since my definition of an in-port day depends on a time difference between stationary locations, these days cannot be shown as in-port. This may be leading to under-reported port days.