This notebook example involves using a managed version of Trino (Starburst). It will work without Starburst provided you are able to import data into a Trino cluster connected to a lake. We will be using one month of Yellow Taxi data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and a zone look up file provided on the same page. Please download both files and register in either your Starburst or Trino cluster before proceeding.

import ibis
ibis.options.interactive = True

#from trino.auth import OAuth2Authentication

IMPORTANT!!!! Change your user, host, port, database, schema and roles to be relevant to your Starburst Galaxy setup. If you are using OAuth2, uncomment the keyword lines roles, and auth. Then comment PASSWORD to proceed. You can reference: https://ibis-project.org/backends/trino#connecting-to-starburst-managed-trino-instances for more information.

import os
con = ibis.trino.connect(
  user=os.environ['user'],
  host=os.environ['host'],
  password=os.environ['password'],
  port=443,
  database=os.environ['database'],
  schema=os.environ['schema'],
  #roles="accountadmin",
  #auth=OAuth2Authentication(),
  http_scheme="https"
)

Within Ibis con.list_tables() allows us to list all the tables.

con.list_tables()
['groupbyboroughtrips', 'taxizonenyc', 'zonelookup']

Ibis tables in trino can be referenced through the use of con.table. We’re going to create two reference ibis tables to Trino from our tables below:

nycjantrips = con.table("taxizonenyc")
zonelookup = con.table("zonelookup")

In Ibis we can check the schema of the tables we just referenced through .schema()

nycjantrips.schema()
ibis.Schema {
  vendorid               int64
  tpep_pickup_datetime   timestamp(3)
  tpep_dropoff_datetime  timestamp(3)
  passenger_count        float64
  trip_distance          float64
  ratecodeid             float64
  store_and_fwd_flag     string
  pulocationid           int64
  dolocationid           int64
  payment_type           int64
  fare_amount            float64
  extra                  float64
  mta_tax                float64
  tip_amount             float64
  tolls_amount           float64
  improvement_surcharge  float64
  total_amount           float64
  congestion_surcharge   float64
  airport_fee            float64
}
zonelookup.schema()
ibis.Schema {
  locationid    string
  borough       string
  zone          string
  service_zone  string
}

We’re going to preview the dataset with ibis slice method. We can see the first 10 rows here. We also included ibis.options.interactive = True at the start of our notebook which allows us to display the ibis tables in a prettified way.

nycjantrips[0:10]
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64     │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┤
│        22023-01-01 00:32:102023-01-01 00:40:361.00.971.0N                 16114129.31.000.50.000.01.014.302.50.00 │
│        22023-01-01 00:55:082023-01-01 01:01:271.01.101.0N                 4323717.91.000.54.000.01.016.902.50.00 │
│        22023-01-01 00:25:042023-01-01 00:37:491.02.511.0N                 48238114.91.000.515.000.01.034.902.50.00 │
│        12023-01-01 00:03:482023-01-01 00:13:250.01.901.0N                 1387112.17.250.50.000.01.020.850.01.25 │
│        22023-01-01 00:10:292023-01-01 00:21:191.01.431.0N                 10779111.41.000.53.280.01.019.682.50.00 │
│        22023-01-01 00:50:342023-01-01 01:02:521.01.841.0N                 161137112.81.000.510.000.01.027.802.50.00 │
│        22023-01-01 00:09:222023-01-01 00:19:491.01.661.0N                 239143112.11.000.53.420.01.020.522.50.00 │
│        22023-01-01 00:27:122023-01-01 00:49:561.011.701.0N                 142200145.71.000.510.743.01.064.442.50.00 │
│        22023-01-01 00:21:442023-01-01 00:36:401.02.951.0N                 164236117.71.000.55.680.01.028.382.50.00 │
│        22023-01-01 00:39:422023-01-01 00:50:361.03.011.0N                 141107214.91.000.50.000.01.019.902.50.00 │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┘

To understand the dataset a little more we can try an order by. Looks like there are some columns with passenger count of undefined. In this case we’re going to want to curate the dataset and clean it up a bit to ensure more accurate data.

nycjantrips.order_by(nycjantrips.trip_distance.desc())
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64     │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┤
│        22023-01-23 17:14:002023-01-23 17:20:00NULL258928.15NULLNULL238239013.650.00.53.530.01.021.18NULLNULL │
│        22023-01-07 18:18:002023-01-07 18:47:00NULL225987.37NULLNULL261263056.490.00.512.100.01.072.59NULLNULL │
│        22023-01-15 08:33:002023-01-15 08:43:00NULL187872.33NULLNULL229142014.020.00.53.600.01.021.62NULLNULL │
│        22023-01-13 05:53:002023-01-13 06:02:00NULL116439.71NULLNULL239230013.220.00.53.000.01.020.22NULLNULL │
│        22023-01-31 07:18:002023-01-31 07:57:00NULL85543.66NULLNULL4187039.900.00.57.900.01.051.80NULLNULL │
│        22023-01-12 08:45:002023-01-12 09:09:00NULL76886.52NULLNULL14868021.490.00.55.100.01.030.59NULLNULL │
│        22023-01-01 19:40:432023-01-01 20:21:481.062359.524.0N                 794820.000.00.00.000.00.02.502.50.0 │
│        22023-01-31 21:24:002023-01-31 21:28:00NULL52042.30NULLNULL23675014.400.00.51.840.01.020.24NULLNULL │
│        22023-01-22 11:25:002023-01-22 11:33:00NULL33205.32NULLNULL23043012.870.00.53.370.01.020.24NULLNULL │
│        22023-01-23 07:29:002023-01-23 07:51:00NULL16562.61NULLNULL26290028.870.00.54.930.01.037.80NULLNULL │
│         │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┘

We can chain together expressions with filter - similar to a WHERE clause in SQL. We can see nan (not a number) involved, ibis also has built-in support for that.

nyc_filtered = nycjantrips.filter((nycjantrips.passenger_count != 0) | (nycjantrips.passenger_count.isnan() is False))
nyc_filtered
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64     │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┤
│        22023-01-01 00:32:102023-01-01 00:40:361.00.971.0N                 16114129.31.00.50.000.01.014.302.50.0 │
│        22023-01-01 00:55:082023-01-01 01:01:271.01.101.0N                 4323717.91.00.54.000.01.016.902.50.0 │
│        22023-01-01 00:25:042023-01-01 00:37:491.02.511.0N                 48238114.91.00.515.000.01.034.902.50.0 │
│        22023-01-01 00:10:292023-01-01 00:21:191.01.431.0N                 10779111.41.00.53.280.01.019.682.50.0 │
│        22023-01-01 00:50:342023-01-01 01:02:521.01.841.0N                 161137112.81.00.510.000.01.027.802.50.0 │
│        22023-01-01 00:09:222023-01-01 00:19:491.01.661.0N                 239143112.11.00.53.420.01.020.522.50.0 │
│        22023-01-01 00:27:122023-01-01 00:49:561.011.701.0N                 142200145.71.00.510.743.01.064.442.50.0 │
│        22023-01-01 00:21:442023-01-01 00:36:401.02.951.0N                 164236117.71.00.55.680.01.028.382.50.0 │
│        22023-01-01 00:39:422023-01-01 00:50:361.03.011.0N                 141107214.91.00.50.000.01.019.902.50.0 │
│        22023-01-01 00:53:012023-01-01 01:01:451.01.801.0N                 23468111.41.00.53.280.01.019.682.50.0 │
│         │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┘

You can see with the command below that nan has been filtered out!

nyc_filtered.order_by(nyc_filtered.trip_distance.desc())
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64     │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┤
│        22023-01-01 19:40:432023-01-01 20:21:481.062359.524.0N                 794820.00.00.00.000.000.02.502.50.00 │
│        22023-01-31 15:53:232023-01-31 16:50:382.014098.552.0N                 142132170.00.00.50.000.001.074.002.50.00 │
│        22023-01-23 08:18:502023-01-23 08:47:221.09947.031.0N                 239140123.30.00.51.910.001.029.212.50.00 │
│        22023-01-26 15:20:462023-01-26 16:01:481.09684.003.0N                 164265197.90.00.024.9826.001.0149.880.00.00 │
│        22023-01-18 19:37:412023-01-18 20:14:291.09683.761.0N                 264112163.92.50.55.000.001.074.150.01.25 │
│        22023-01-03 15:25:052023-01-03 16:09:021.09680.611.0N                 264244147.10.00.510.220.001.061.322.50.00 │
│        22023-01-18 19:43:552023-01-18 20:02:331.09679.831.0N                 26449135.27.50.53.000.001.048.450.01.25 │
│        22023-01-05 15:26:432023-01-05 16:04:171.09679.731.0N                 264244147.15.00.512.536.551.076.432.51.25 │
│        22023-01-19 19:36:582023-01-19 20:01:441.09675.571.0N                 264151124.02.50.51.580.001.032.082.50.00 │
│        22023-01-09 15:44:582023-01-09 16:02:351.09674.671.0N                 264239217.70.00.50.000.001.019.200.00.00 │
│         │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┘

Let’s add a column to our dataset. I want to add a column to help calculate the average ride duration. We are going to use the Ibis ‘Delta’ function for this result Ibis is also pretty cool and can simply visualize a column in isolation:

ride_duration = nyc_filtered.tpep_dropoff_datetime.delta(nyc_filtered.tpep_pickup_datetime, "minute").name("rideminutes")
ride_duration
┏━━━━━━━━━━━━━┓
┃ rideminutes ┃
┡━━━━━━━━━━━━━┩
│ int64       │
├─────────────┤
│           8 │
│           6 │
│          12 │
│          11 │
│          12 │
│          10 │
│          22 │
│          15 │
│          11 │
│           8 │
│            │
└─────────────┘

We can also combine the column with our original table using the ‘mutate’ method shown here.

nycjanduration = nyc_filtered.mutate(rideminutes=nyc_filtered.tpep_dropoff_datetime.delta(nyc_filtered.tpep_pickup_datetime, "minute"))
nycjanduration["vendorid","rideminutes","trip_distance"]
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ vendorid  rideminutes  trip_distance ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ int64int64float64       │
├──────────┼─────────────┼───────────────┤
│        280.97 │
│        261.10 │
│        2122.51 │
│        2111.43 │
│        2121.84 │
│        2101.66 │
│        22211.70 │
│        2152.95 │
│        2113.01 │
│        281.80 │
│         │
└──────────┴─────────────┴───────────────┘
nycjanduration["vendorid","rideminutes","trip_distance"].head(3)
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ vendorid  rideminutes  trip_distance ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ int64int64float64       │
├──────────┼─────────────┼───────────────┤
│        280.97 │
│        261.10 │
│        2122.51 │
└──────────┴─────────────┴───────────────┘

Next up are some basic analytics and aggregations in Ibis - let’s get total revenue with sum(), longest trip with max(), and average trip duration with mean(). Ibis is able to chain expressions similar to pandas.

#some basic analytics - let's get total revenue, longest trip. 
insights = nycjanduration.agg(
    [
        ibis._.count().name("total_trips"),
        ibis._["total_amount"].sum().name("total_revenue"),
        ibis._["trip_distance"].sum().name("total_distance_all"),
        ibis._["rideminutes"].max().name("longest"),    
        ibis._["rideminutes"].mean().round(2).name("average_ride")
    ]
)
insights
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ total_trips  total_revenue  total_distance_all  longest  average_ride ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ int64float64float64int64float64      │
├─────────────┼───────────────┼────────────────────┼─────────┼──────────────┤
│     29438597.953883e+071.015018e+071003015.69 │
└─────────────┴───────────────┴────────────────────┴─────────┴──────────────┘

Wait, the longest trip seems a bit… lengthy… Note: we added a .round function to display the average ride more nicely. Let’s check out the ride itself.

nycjanduration.filter(nycjanduration["rideminutes"] == 10030)
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee  rideminutes ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64int64       │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┼─────────────┤
│        22023-01-23 11:21:512023-01-30 10:31:021.00.01.0N                 2079523.00.00.50.00.01.04.50.00.010030 │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┴─────────────┘

7 day trip? looks like the trip distance is zero, we can decide to remove the row from future calculations of average Let’s remove the outliers and join with a lookup table to get more information about the “where” of our analytical datasets - zones.

nycjanduration_new = (
    nycjanduration.filter(nycjanduration.trip_distance != 0.0)
)
nycjanduration_new
    
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee  rideminutes ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64int64       │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┼─────────────┤
│        22023-01-01 00:32:102023-01-01 00:40:361.00.971.0N                 16114129.31.00.50.000.01.014.302.50.08 │
│        22023-01-01 00:55:082023-01-01 01:01:271.01.101.0N                 4323717.91.00.54.000.01.016.902.50.06 │
│        22023-01-01 00:25:042023-01-01 00:37:491.02.511.0N                 48238114.91.00.515.000.01.034.902.50.012 │
│        22023-01-01 00:10:292023-01-01 00:21:191.01.431.0N                 10779111.41.00.53.280.01.019.682.50.011 │
│        22023-01-01 00:50:342023-01-01 01:02:521.01.841.0N                 161137112.81.00.510.000.01.027.802.50.012 │
│        22023-01-01 00:09:222023-01-01 00:19:491.01.661.0N                 239143112.11.00.53.420.01.020.522.50.010 │
│        22023-01-01 00:27:122023-01-01 00:49:561.011.701.0N                 142200145.71.00.510.743.01.064.442.50.022 │
│        22023-01-01 00:21:442023-01-01 00:36:401.02.951.0N                 164236117.71.00.55.680.01.028.382.50.015 │
│        22023-01-01 00:39:422023-01-01 00:50:361.03.011.0N                 141107214.91.00.50.000.01.019.902.50.011 │
│        22023-01-01 00:53:012023-01-01 01:01:451.01.801.0N                 23468111.41.00.53.280.01.019.682.50.08 │
│         │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┴─────────────┘

Let’s create a cleaner set similar to before.

insights_new = nycjanduration_new.agg(
    [
        ibis._.count().name("total_trips"),
        ibis._["total_amount"].sum().name("total_revenue"),
        ibis._["trip_distance"].sum().name("total_distance_all"),
        ibis._["rideminutes"].max().name("longest"),    
        ibis._["rideminutes"].mean().round(2).name("average_ride")
    ]
)
insights_new
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ total_trips  total_revenue  total_distance_all  longest  average_ride ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ int64float64float64int64float64      │
├─────────────┼───────────────┼────────────────────┼─────────┼──────────────┤
│     29066707.838373e+071.015018e+07617915.76 │
└─────────────┴───────────────┴────────────────────┴─────────┴──────────────┘

You can already see a slightly more massaged dataset - the longest trip is lower, alongside average_ride has changed and the total number of trips has gone down by almost 40k

Next up we want to do something more powerful - join with related datasets to get more insights on geographical behaviour of taxi trips around NYC. Let’s look over the zonelookup table again.

zonelookup
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ locationid  borough        zone                     service_zone ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ stringstringstringstring       │
├────────────┼───────────────┼─────────────────────────┼──────────────┤
│ 1         EWR          Newark Airport         EWR          │
│ 2         Queens       Jamaica Bay            Boro Zone    │
│ 3         Bronx        Allerton/Pelham GardensBoro Zone    │
│ 4         Manhattan    Alphabet City          Yellow Zone  │
│ 5         Staten IslandArden Heights          Boro Zone    │
│ 6         Staten IslandArrochar/Fort WadsworthBoro Zone    │
│ 7         Queens       Astoria                Boro Zone    │
│ 8         Queens       Astoria Park           Boro Zone    │
│ 9         Queens       Auburndale             Boro Zone    │
│ 10        Queens       Baisley Park           Boro Zone    │
│             │
└────────────┴───────────────┴─────────────────────────┴──────────────┘

We can see pulocationid is int64, so we must cast to have the tables fully joined. Ibis supports casting data types within its library as well. In the line below, we use .cast(“str”) to ensure the two tables can be joined together. You can try without the cast and see what happens :).

joineddata = nycjanduration_new.inner_join(zonelookup, nycjanduration_new.pulocationid.cast("str") == zonelookup.locationid)
joineddata
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ vendorid  tpep_pickup_datetime  tpep_dropoff_datetime  passenger_count  trip_distance  ratecodeid  store_and_fwd_flag  pulocationid  dolocationid  payment_type  fare_amount  extra    mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  airport_fee  rideminutes  locationid  borough    zone                   service_zone ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ int64timestamp(3)timestamp(3)float64float64float64stringint64int64int64float64float64float64float64float64float64float64float64float64int64stringstringstringstring       │
├──────────┼──────────────────────┼───────────────────────┼─────────────────┼───────────────┼────────────┼────────────────────┼──────────────┼──────────────┼──────────────┼─────────────┼─────────┼─────────┼────────────┼──────────────┼───────────────────────┼──────────────┼──────────────────────┼─────────────┼─────────────┼────────────┼───────────┼───────────────────────┼──────────────┤
│        22023-01-01 00:32:102023-01-01 00:40:361.00.971.0N                 16114129.31.00.50.000.01.014.302.50.08161       ManhattanMidtown Center       Yellow Zone  │
│        22023-01-01 00:55:082023-01-01 01:01:271.01.101.0N                 4323717.91.00.54.000.01.016.902.50.0643        ManhattanCentral Park         Yellow Zone  │
│        22023-01-01 00:25:042023-01-01 00:37:491.02.511.0N                 48238114.91.00.515.000.01.034.902.50.01248        ManhattanClinton East         Yellow Zone  │
│        22023-01-01 00:10:292023-01-01 00:21:191.01.431.0N                 10779111.41.00.53.280.01.019.682.50.011107       ManhattanGramercy             Yellow Zone  │
│        22023-01-01 00:50:342023-01-01 01:02:521.01.841.0N                 161137112.81.00.510.000.01.027.802.50.012161       ManhattanMidtown Center       Yellow Zone  │
│        22023-01-01 00:09:222023-01-01 00:19:491.01.661.0N                 239143112.11.00.53.420.01.020.522.50.010239       ManhattanUpper West Side SouthYellow Zone  │
│        22023-01-01 00:27:122023-01-01 00:49:561.011.701.0N                 142200145.71.00.510.743.01.064.442.50.022142       ManhattanLincoln Square East  Yellow Zone  │
│        22023-01-01 00:21:442023-01-01 00:36:401.02.951.0N                 164236117.71.00.55.680.01.028.382.50.015164       ManhattanMidtown South        Yellow Zone  │
│        22023-01-01 00:39:422023-01-01 00:50:361.03.011.0N                 141107214.91.00.50.000.01.019.902.50.011141       ManhattanLenox Hill West      Yellow Zone  │
│        22023-01-01 00:53:012023-01-01 01:01:451.01.801.0N                 23468111.41.00.53.280.01.019.682.50.08234       ManhattanUnion Sq             Yellow Zone  │
│                    │
└──────────┴──────────────────────┴───────────────────────┴─────────────────┴───────────────┴────────────┴────────────────────┴──────────────┴──────────────┴──────────────┴─────────────┴─────────┴─────────┴────────────┴──────────────┴───────────────────────┴──────────────┴──────────────────────┴─────────────┴─────────────┴────────────┴───────────┴───────────────────────┴──────────────┘

Now we can do more cool things in ibis with group bys and aggregate by with zones and boroughs!

groupbyboroughtrips = (
    joineddata
    .group_by("zone")
    .aggregate(
        trips=joineddata.vendorid.count(),
        totalrev=joineddata.fare_amount.sum(),
        totalpassengers=joineddata.passenger_count.sum(),
        averageride=joineddata.rideminutes.mean().round(2)
        
        )
    .order_by(ibis.desc("totalrev"))
    .limit(10)
)
groupbyboroughtrips
    
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ zone                          trips   totalrev      totalpassengers  averageride ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ stringint64float64float64float64     │
├──────────────────────────────┼────────┼──────────────┼─────────────────┼─────────────┤
│ JFK Airport                 1544929.148091e+06223590.037.12 │
│ LaGuardia Airport           873523.577509e+06118834.026.77 │
│ Midtown Center              1300311.956489e+06179987.014.58 │
│ Upper East Side South       1420431.732204e+06191430.011.78 │
│ Upper East Side North       1315811.708854e+06179243.012.53 │
│ Times Sq/Theatre District   950981.623739e+06140826.015.71 │
│ Penn Station/Madison Sq West1055661.619161e+06142369.015.60 │
│ Midtown East                1014561.490681e+06136471.014.05 │
│ Lincoln Square East         955301.282554e+06133103.012.64 │
│ Murray Hill                 845671.237674e+06114587.013.50 │
└──────────────────────────────┴────────┴──────────────┴─────────────────┴─────────────┘

If you want to see what sql ibis generates, you can use the ibis.to_sql() method.

ibis.to_sql(groupbyboroughtrips)
WITH t0 AS (
  SELECT
    t3.vendorid AS vendorid,
    t3.tpep_pickup_datetime AS tpep_pickup_datetime,
    t3.tpep_dropoff_datetime AS tpep_dropoff_datetime,
    t3.passenger_count AS passenger_count,
    t3.trip_distance AS trip_distance,
    t3.ratecodeid AS ratecodeid,
    t3.store_and_fwd_flag AS store_and_fwd_flag,
    t3.pulocationid AS pulocationid,
    t3.dolocationid AS dolocationid,
    t3.payment_type AS payment_type,
    t3.fare_amount AS fare_amount,
    t3.extra AS extra,
    t3.mta_tax AS mta_tax,
    t3.tip_amount AS tip_amount,
    t3.tolls_amount AS tolls_amount,
    t3.improvement_surcharge AS improvement_surcharge,
    t3.total_amount AS total_amount,
    t3.congestion_surcharge AS congestion_surcharge,
    t3.airport_fee AS airport_fee
  FROM dh_nyctaxi_video.taxizonenyc AS t3
  WHERE
    t3.passenger_count <> 0 OR FALSE
), t1 AS (
  SELECT
    t0.vendorid AS vendorid,
    t0.tpep_pickup_datetime AS tpep_pickup_datetime,
    t0.tpep_dropoff_datetime AS tpep_dropoff_datetime,
    t0.passenger_count AS passenger_count,
    t0.trip_distance AS trip_distance,
    t0.ratecodeid AS ratecodeid,
    t0.store_and_fwd_flag AS store_and_fwd_flag,
    t0.pulocationid AS pulocationid,
    t0.dolocationid AS dolocationid,
    t0.payment_type AS payment_type,
    t0.fare_amount AS fare_amount,
    t0.extra AS extra,
    t0.mta_tax AS mta_tax,
    t0.tip_amount AS tip_amount,
    t0.tolls_amount AS tolls_amount,
    t0.improvement_surcharge AS improvement_surcharge,
    t0.total_amount AS total_amount,
    t0.congestion_surcharge AS congestion_surcharge,
    t0.airport_fee AS airport_fee,
    DATE_DIFF(
      'minute',
      DATE_TRUNC('minute', t0.tpep_pickup_datetime),
      DATE_TRUNC('minute', t0.tpep_dropoff_datetime)
    ) AS rideminutes
  FROM t0
  WHERE
    t0.trip_distance <> 0.0
)
SELECT
  t2.zone,
  t2.trips,
  t2.totalrev,
  t2.totalpassengers,
  t2.averageride
FROM (
  SELECT
    t3.zone AS zone,
    COUNT(t1.vendorid) AS trips,
    SUM(t1.fare_amount) AS totalrev,
    SUM(t1.passenger_count) AS totalpassengers,
    ROUND(AVG(t1.rideminutes), 2) AS averageride
  FROM t1
  JOIN dh_nyctaxi_video.zonelookup AS t3
    ON CAST(t1.pulocationid AS VARCHAR) = t3.locationid
  GROUP BY
    1
) AS t2
ORDER BY
  t2.totalrev DESC
LIMIT 10

Airport rides give the most revenue to taxi companies, that makes a lot of sense.

Let’s write our result tables back to trino (to show some write functionality, of course).

con.create_table("groupbyboroughtrips_new", groupbyboroughtrips)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ zone                          trips   totalrev      totalpassengers  averageride ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ stringint64float64float64float64     │
├──────────────────────────────┼────────┼──────────────┼─────────────────┼─────────────┤
│ JFK Airport                 1544929.148091e+06223590.037.12 │
│ LaGuardia Airport           873523.577509e+06118834.026.77 │
│ Midtown Center              1300311.956489e+06179987.014.58 │
│ Upper East Side South       1420431.732204e+06191430.011.78 │
│ Upper East Side North       1315811.708854e+06179243.012.53 │
│ Times Sq/Theatre District   950981.623739e+06140826.015.71 │
│ Penn Station/Madison Sq West1055661.619161e+06142369.015.60 │
│ Midtown East                1014561.490681e+06136471.014.05 │
│ Lincoln Square East         955301.282554e+06133103.012.64 │
│ Murray Hill                 845671.237674e+06114587.013.50 │
└──────────────────────────────┴────────┴──────────────┴─────────────────┴─────────────┘

There you have it, a quick tutorial with Ibis, and Starburst Galaxy!