I just got back from my honeymoon, my first cruise to Alaska on the Explorer of the Seas. Aside from having a fantastic time with my wife, I gained a new fascination with the breadth and scale of cruise travel. Our ship housed upwards of 2,000 passengers and was often one of two or three ships docked in a port. Multiply that by the 100+ ships that are currently running cruises, you can have 200,000 people on the ocean at a time visiting dozens of ports across the globe.
When we got back, I began looking into cruise tracking sites and found one site that offers historical ship location data for 200+ ships going back to 2003. Using that data, I hoped to compile a dataset of ships, cities and locations for analysis and publishing.
The scripts used to extract and clean the raw data are available on Github. They are quick hacks, but they got the job done.
Once the raw data was loaded into BigQuery, I began transforming it. First step, I want to join the location stamps to port cities and find the closest port city to that location. I tried using a Cartesian join on the whole dataset:
SELECT callsign, created_at, loc.lat as lat, loc.lon as lon, wind_from, wind_knots, wind_gust, barometer, air_temp, dew_point, city.lat as city_lat, city.lon as city_lon, city.id as city_id FROM `cruise_scraper.locations_clean` loc, `cruise_scraper.port_cities` city
Since the raw dataset is rather large, about 4.7 million location stamps, BigQuery couldn’t quite handle it. I limited the query to a few months worth of data and that worked, so I ran the same query 47 times and finally ended up with 2.5 billion rows of Cartesian joined goodness.
The next step is to eliminate the rows where the location stamp is paired with any city other than the one closest to it:
SELECT callsign, created_at, lat, lon, city_id as nearest_port_city, distance_to_port FROM ( SELECT callsign, created_at, lat, lon, wind_from, wind_knots, wind_gust, barometer, air_temp, dew_point, city_lat as city_lat, city_lon as city_lon, city_id as city_id, cartesianDistance(lat, lon, city_lat, city_lon) as distance_to_port, row_number() over(partition by callsign, created_at order by cartesianDistance(lat, lon, city_lat, city_lon)) as rnk FROM `cruise_scraper.locations_city_join_*` GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14 ) WHERE rnk=1
UDF’s, this one computes the distance in meters between two (lat,lon) pairs:
CREATE TEMPORARY FUNCTION cartesianDistance(lat1 FLOAT64, lon1 FLOAT64, lat2 FLOAT64, lon2 FLOAT64) RETURNS FLOAT64 LANGUAGE js AS """ var R = 6378.137; // Radius of earth in KM var dLat = lat2 * Math.PI / 180 - lat1 * Math.PI / 180; var dLon = lon2 * Math.PI / 180 - lon1 * Math.PI / 180; var a = Math.sin(dLat/2) * Math.sin(dLat/2) + Math.cos(lat1 * Math.PI / 180) * Math.cos(lat2 * Math.PI / 180) * Math.sin(dLon/2) * Math.sin(dLon/2); var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a)); var d = R * c; return d * 1000; // meters """;
Using the nearest city location stamp data, I aggregated down to daily granularity:
SELECT trim(ships_v.name) as ship_name, trim(loc.callsign) as callsign, trim(ships_v.major_cruise_line) as major_cruise_line, date, cartesianDistance(x1, y1, x2, y2) as meters_traveled, x1 as lat_start, y1 as lon_start, x2 as lat_end, y2 as lon_end, case when stopped_minutes >= 180 then TRUE else FALSE end as in_port, -- 3 hours spent stationary stopped_minutes, num_periods, port_city as port_city_id, cities.city as port_city_name, cities.country_code as port_city_country FROM ( SELECT callsign, date(created_at) as date, max(case when day_start=1 then lat else null end) as x1, max(case when day_start=1 then lon else null end) as y1, max(case when day_end=1 then lat else null end) as x2, max(case when day_end=1 then lon else null end) as y2, sum(case when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then timestamp_diff(created_at, case when last_created_at is null then timestamp(date(created_at)) else last_created_at end, MINUTE) else 0 end) as stopped_minutes, sum(timestamp_diff(created_at, last_created_at, MINUTE)) as total_minutes, max(case when last_lat is null and last_lon is null then nearest_port_city when abs(lat-last_lat) < 0.1 and abs(lon-last_lon) < 0.1 then nearest_port_city else null end) as port_city, count(*) as num_periods FROM ( SELECT callsign, created_at, lat, lon, nearest_port_city, distance_to_port, lag(lat) over(partition by callsign, date(created_at) order by created_at) as last_lat, lag(lon) over(partition by callsign, date(created_at) order by created_at) as last_lon, lag(created_at) over(partition by callsign, date(created_at) order by created_at) as last_created_at, row_number() over (partition by callsign, date(created_at) order by created_at) as day_start, row_number() over (partition by callsign, date(created_at) order by created_at desc) as day_end FROM cruise_scraper.locations_with_city -- WHERE created_at > timestamp('2017-06-01') GROUP BY 1,2,3,4,5,6 ) GROUP BY 1,2 ) loc LEFT JOIN `cruise_scraper.ships_v` ships_v ON LOWER(TRIM(loc.callsign))=LOWER(TRIM(ships_v.callsign)) LEFT JOIN `cruise_scraper.port_cities` cities ON port_city = cities.id
Which cruise line is operating the most ships:
SELECT major_cruise_line, COUNT(*) FROM cruise_scraper.ships_v GROUP BY 1 ORDER BY 2 DESC
|Royal Caribbean International||25|
|Carnival Cruise Line||24|
|Norwegian Cruise Line||16|
|Holland America Line||15|
How many day/ship combinations are missing data?
SELECT COUNT(*) FROM( SELECT ship_name, date, lag(date) over(partition by callsign order by date) as last_date FROM `home-data-warehouse.cruise_scraper.daily_ship_location` ) WHERE date_diff(date, last_date, DAY) > 1
What are the most popular port cities?
SELECT port_city_name as city, port_city_country as country, COUNT(*) as count FROM cruise_scraper.daily_ship_location WHERE major_cruise_line <> 'Other' and in_port = True GROUP BY 1,2 ORDER BY 3 DESC
|Fort Lauderdale (Port Everglades)||US||3705|
The raw data comes from an unmonitored data pipeline, there may be data missing. Additionally, it looks like ships do not report location details when they are outside of service areas. This leads to many days where there’s only one reported location.
Since my definition of an in-port day depends on a time difference between stationary locations, these days cannot be shown as in-port. This may be leading to under-reported port days.