layout: true <div class="my-footer"> <span> <!-- You could put a footer link in here, which I used to do for a lot of my other slide decks, but it probably doesn't work well with this theme --> </span> </div> --- class: title-slide, theme-green-maximal <!-- title slide style --> .center[.large[.boldface[Doing more with data: An introduction to Arrow for R users]]] <!-- title, one line --> .center[.midi[by Danielle Navarro]] <!-- author, one line --> .center[.midi[[djnavarro.net/slides-data-thread-2022](http://djnavarro.net/slides-data-thread-2022)]] <!-- general info, one line --> <!-- font-size classes (relative to normal 100%) .larger = 200% .large = 130% .midi = 85% .small = 70% .xsmall = 60% .tiny = 50% --> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green ## Who is Danielle Navarro? .pull-left-narrow[ .pull-right-wide[ <img src="img/danielle.png" width="100%" /> ] ] .pull-right-wide[ <br> - Developer advocate at Voltron Data - Data scientist and generative artist - Tweets incessantly at [@djnavarro](https://twitter.com/djnavarro) ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## What is Apache Arrow? > A multi-language toolbox <br> > for accelerated data interchange <br> > and in-memory processing <!-- ## What is Apache Arrow? --> <!-- - Toolbox for in-memory analytics --> <!-- - Toolbox for larger than memory data --> <!-- - Toolbox for connecting languages and hardware --> <!-- - Efficient, performant, open source --> <!-- - Details at [arrow.apache.org](https://arrow.apache.org) --> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Accelerating data interchange <img src="img/data-interchange-1.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Accelerating data interchange <img src="img/data-interchange-2.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Efficient in-memory processing <img src="img/simd-1.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Efficient in-memory processing <img src="img/simd-2.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Efficient in-memory processing <img src="img/simd-3.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Efficient in-memory processing <img src="img/simd-4.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## Where does R fit (in Arrow)? <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does R fit (in Arrow)? <img src="img/arrow-ecosystem-1.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does R fit (in Arrow)? <img src="img/arrow-ecosystem-2.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does R fit (in Arrow)? <img src="img/arrow-ecosystem-3.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## Where does Arrow fit (in R)? <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does Arrow fit (in R)? - Analyze, process, and write multi-file parquet, csv, S3 buckets, ... -- - Larger-than-memory data sets -- - ... many other things... -- - Analyze Arrow data using dplyr verbs <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does Arrow fit (in R)? <img src="img/dplyr-backends-1.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does Arrow fit (in R)? <img src="img/dplyr-backends-2.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does Arrow fit (in R)? <img src="img/dplyr-backends-3.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## dplyr connects to an Arrow backend <img src="img/dplyr-backends-4.svg" width="70%" /> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## Let's see it in action: ## Analyzing Arrow data in R <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Installing {arrow} - Install arrow by typing `install.packages("arrow")` - To use dplyr pipelines both arrow and dplyr must be loaded ```r library(arrow) library(dplyr) ``` -- - Installation resources: - [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/) - [blog.djnavarro.net/starting-apache-arrow-in-r](https://blog.djnavarro.net/starting-apache-arrow-in-r/) <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-minimal background-image: url("img/redd-Ra8VlGo0tyI-unsplash.jpg") background-size: cover background-repeat: no-repeat background-position: left ## Step 1: Get data <div class="my-footer"> <span> <a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/Ra8VlGo0tyI"><b>Image freely available courtesy of Redd, via Unsplash</b></a> </span> </div> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## The NYC taxi data - The data set: - Original data from the New York City Taxi and Limosine Service - Pickup and dropoff locations and time, fees, tips, etc - Approximately ~1.7 billion rows in a single table, from 2009 to 2022 -- - The case study: - Extract taxi rides from the three major airports - Visualise the destinations by taxi zone <!-- https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page --> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Downloading the data .pull-left-wide[ ```r copy_files( from = s3_bucket("ursa-labs-taxi-data-v2"), to = "~/Datasets/nyc-taxi" ) ``` - Warning: this is 69GB in total - It takes a long time! ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Examining the data .pull-left[ - NYC taxi is a multi-file data set - 158 separate parquet files - Partitioned by year and month ```r fs::dir_tree("~/Datasets/nyc-taxi/") ``` ] -- .pull-right[ .scroll-box-14[ ``` ## ~/Datasets/nyc-taxi/ ## ├── year=2009 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2010 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2011 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2012 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2013 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2014 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2015 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2016 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2017 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2018 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2019 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2020 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## ├── year=2021 ## │ ├── month=1 ## │ │ └── part-0.parquet ## │ ├── month=10 ## │ │ └── part-0.parquet ## │ ├── month=11 ## │ │ └── part-0.parquet ## │ ├── month=12 ## │ │ └── part-0.parquet ## │ ├── month=2 ## │ │ └── part-0.parquet ## │ ├── month=3 ## │ │ └── part-0.parquet ## │ ├── month=4 ## │ │ └── part-0.parquet ## │ ├── month=5 ## │ │ └── part-0.parquet ## │ ├── month=6 ## │ │ └── part-0.parquet ## │ ├── month=7 ## │ │ └── part-0.parquet ## │ ├── month=8 ## │ │ └── part-0.parquet ## │ └── month=9 ## │ └── part-0.parquet ## └── year=2022 ## ├── month=1 ## │ └── part-0.parquet ## └── month=2 ## └── part-0.parquet ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Opening a dataset .pull-left-wide[ ```r nyc_taxi <- open_dataset("~/Datasets/nyc-taxi") nyc_taxi ``` ] -- .pull-left-wide[ .scroll-box-10[ ``` ## FileSystemDataset with 158 Parquet files ## vendor_name: string ## pickup_datetime: timestamp[ms] ## dropoff_datetime: timestamp[ms] ## passenger_count: int64 ## trip_distance: double ## pickup_longitude: double ## pickup_latitude: double ## rate_code: string ## store_and_fwd: string ## dropoff_longitude: double ## dropoff_latitude: double ## payment_type: string ## fare_amount: double ## extra: double ## mta_tax: double ## tip_amount: double ## tolls_amount: double ## total_amount: double ## improvement_surcharge: double ## congestion_surcharge: double ## pickup_location_id: int64 ## dropoff_location_id: int64 ## year: int32 ## month: int32 ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Opening a dataset .pull-left-wide[ - Is it really 1.7 billion rows? ] -- .pull-left-wide[ - Absolutely yes! ```r nyc_taxi |> nrow() ``` ``` ## [1] 1672590319 ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Opening a dataset .pull-left-wide[ - Is it really 1.7 billion rows? (Yes) - Is all this data somehow loaded in R? ] -- .pull-left-wide[ - Absolutely not! ```r nyc_taxi |> lobstr::obj_size() ``` ``` ## 261,088 B ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Opening a dataset .pull-left-wide[ - Is it really 1.7 billion rows? (Yes) - Is all this data somehow loaded in R? (No) - Is there a trick? (Yes: read files as needed) ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Using dplyr verbs: Select and filter .pull-left-wide[ ```r nyc_taxi |> filter(year == 2019) |> select(matches("pickup"), matches("dropoff")) ``` ] -- .pull-left-wide[ ``` ## FileSystemDataset (query) ## pickup_datetime: timestamp[ms] ## pickup_longitude: double ## pickup_latitude: double ## pickup_location_id: int64 ## dropoff_datetime: timestamp[ms] ## dropoff_longitude: double ## dropoff_latitude: double ## dropoff_location_id: int64 ## ## * Filter: (year == 2019) ## See $.data for the source Arrow object ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Using dplyr verbs: Select and filter .pull-left-wide[ ```r nyc_taxi |> filter(year == 2019) |> select(matches("pickup"), matches("dropoff")) ``` ] .pull-left-wide[ - The output is a unevaluated query ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Check before you `collect()`! .pull-left-wide[ ```r nyc_taxi |> filter(year == 2019) |> select(matches("pickup"), matches("dropoff")) |> nrow() ``` ``` ## [1] 84393604 ``` ] .pull-left-wide[ - 84 million rows is more than I want - Perhaps better to collect a subset... - Calling `compute()` evaluates in Arrow - Calling `collect()` evaluates and returns to R ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Use `head()` to collect a tiny subset ```r nyc_taxi |> filter(year == 2019) |> select(matches("pickup"), matches("dropoff")) |> head() |> collect() ``` - Only collects the first few rows - Can be evaluated efficiently <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Aha! Spatial information in "location_id" ```r nyc_taxi |> filter(year == 2019) |> select(matches("pickup"), matches("dropoff")) |> head() |> collect() ``` ``` ## # A tibble: 6 × 8 ## pickup_datetime pickup_longitude pickup_latitude pickup_location_id ## <dttm> <dbl> <dbl> <int> ## 1 2019-11-02 04:31:11 NA NA 161 ## 2 2019-11-02 04:31:11 NA NA 161 ## 3 2019-11-02 04:43:43 NA NA 233 ## 4 2019-11-02 04:57:46 NA NA 164 ## 5 2019-11-02 04:57:46 NA NA 164 ## 6 2019-11-02 04:36:58 NA NA 261 ## # … with 4 more variables: dropoff_datetime <dttm>, dropoff_longitude <dbl>, ## # dropoff_latitude <dbl>, dropoff_location_id <int> ``` <!-- <code class ='r hljs remark-code'>nyc_taxi |> <br> filter(year == 2019) |><br> select(matches("pickup"), matches("dropoff")) |><br> <span style='background-color:#000000'>head</span>() |><br> collect()</code> ``` ## # A tibble: 6 × 8 ## pickup_datetime pickup_longitude pickup_latitude pickup_location_id ## <dttm> <dbl> <dbl> <int> ## 1 2019-01-02 03:11:05 NA NA 233 ## 2 2019-01-02 02:56:45 NA NA 186 ## 3 2019-01-02 03:18:28 NA NA 114 ## 4 2019-01-02 03:25:43 NA NA 234 ## 5 2019-01-02 03:51:35 NA NA 246 ## 6 2019-01-02 03:51:54 NA NA 211 ## # … with 4 more variables: dropoff_datetime <dttm>, dropoff_longitude <dbl>, ## # dropoff_latitude <dbl>, dropoff_location_id <int> ``` --> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-minimal background-image: url("img/paolo-chiabrando-1fBiGUm_jbw-unsplash.jpg") background-size: cover background-repeat: no-repeat background-position: left ## Step 2: Find airports <div class="my-footer"> <span> <a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/1fBiGUm_jbw"><b>Image freely available courtesy of Paolo Chiabrando, via Unsplash</b></a> </span> </div> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## A secondary table - NYC TLC also supplies auxiliary files in the source data - One file is a CSV with extra information about taxi zones - It's a small table, we can handle it natively in R: .pull-left-wide[ ```r nyc_taxi_zones <- "data/taxi_zone_lookup.csv" |> read_csv_arrow() |> janitor::clean_names() ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## A secondary table .pull-left-wide[ ```r nyc_taxi_zones ``` ``` ## # A tibble: 265 × 4 ## location_id borough zone service_zone ## <int> <chr> <chr> <chr> ## 1 1 EWR Newark Airport EWR ## 2 2 Queens Jamaica Bay Boro Zone ## 3 3 Bronx Allerton/Pelham Gardens Boro Zone ## 4 4 Manhattan Alphabet City Yellow Zone ## 5 5 Staten Island Arden Heights Boro Zone ## 6 6 Staten Island Arrochar/Fort Wadsworth Boro Zone ## 7 7 Queens Astoria Boro Zone ## 8 8 Queens Astoria Park Boro Zone ## 9 9 Queens Auburndale Boro Zone ## 10 10 Queens Baisley Park Boro Zone ## # … with 255 more rows ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## A secondary table .pull-left-wide[ ```r nyc_taxi_zones |> filter(str_detect(zone, "Airport")) ``` ``` ## # A tibble: 3 × 4 ## location_id borough zone service_zone ## <int> <chr> <chr> <chr> ## 1 1 EWR Newark Airport EWR ## 2 132 Queens JFK Airport Airports ## 3 138 Queens LaGuardia Airport Airports ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Extract airport zones .pull-left-wide[ ```r airport_zones <- nyc_taxi_zones |> filter(str_detect(zone, "Airport")) |> pull(location_id) airport_zones ``` ``` ## [1] 1 132 138 ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-minimal background-image: url("img/helena-lopes-bunLmHpu1hg-unsplash.jpg") background-size: cover background-repeat: no-repeat background-position: left ## Step 3: Wrangle data <div class="my-footer"> <span> <a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/bunLmHpu1hg"><b>Image freely available courtesy of Helena Lopes, via Unsplash</b></a> </span> </div> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Let's find the airport pickups... .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) ``` ] -- .pull-right[ ``` ## FileSystemDataset (query) ## pickup_datetime: timestamp[ms] ## dropoff_datetime: timestamp[ms] ## pickup_location_id: int64 ## dropoff_location_id: int64 ## ## * Filter: is_in(pickup_location_id, {value_set=int32:[ ## 1, ## 132, ## 138 ## ], skip_nulls=true}) ## See $.data for the source Arrow object ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## We need to use database joins .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join( nyc_taxi_zones, by = c( "dropoff_location_id" = "location_id" ) ) ``` ] -- .pull-right[ .scroll-output[ ``` ## FileSystemDataset (query) ## pickup_datetime: timestamp[ms] ## dropoff_datetime: timestamp[ms] ## pickup_location_id: int64 ## dropoff_location_id: int64 ## borough: string ## zone: string ## service_zone: string ## ## See $.data for the source Arrow object ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## We need to use database joins .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join( nyc_taxi_zones, by = c( "dropoff_location_id" = "location_id" ) ) |> collect() ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Wait... why does the join throw an error? .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join( nyc_taxi_zones, by = c( "dropoff_location_id" = "location_id" ) ) |> collect() ``` ] .pull-right[ .scroll-output[ ``` ## Error in `collect()`: ## ! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(dropoff_location_id) of type int64 and FieldRef.Name(location_id) of type int32 ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Wait... why does the join throw an error? .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join( nyc_taxi_zones, by = c( "dropoff_location_id" = "location_id" ) ) |> collect() ``` ] .pull-right[ - `location_id`: - 32-bit integer - `dropoff_location_id`: - 64-bit integer - Needs to be handled explicitly - May be new to some R users ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Solution: use schemas - The problem: - Data types in R and Arrow are not identical - The data import defaults in {arrow} are good but... - ... they don't always work -- - The solution: - Schemas describe types for each variable - User supplies types explicitly <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Solution: use schemas .pull-left[ ```r schema( dropoff_location_id = int64(), dropoff_borough = utf8(), dropoff_zone = utf8() ) ``` ``` ## Schema ## dropoff_location_id: int64 ## dropoff_borough: string ## dropoff_zone: string ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Solution: use schemas .pull-left[ ```r nyc_taxi_zones_2 <- nyc_taxi_zones |> transmute( dropoff_location_id = location_id, dropoff_borough = borough, dropoff_zone = zone) |> as_arrow_table( schema = schema( dropoff_location_id = int64(), dropoff_borough = utf8(), dropoff_zone = utf8() ) ) nyc_taxi_zones_2 ``` ] -- .pull-right[ .scroll-output[ ``` ## Table ## 265 rows x 3 columns ## $dropoff_location_id <int64> ## $dropoff_borough <string> ## $dropoff_zone <string> ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Now it all works... .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join(nyc_taxi_zones_2) |> collect() ``` ] -- .pull-right[ .scroll-output[ ``` ## # A tibble: 22,228,736 × 6 ## pickup_datetime dropoff_datetime pickup_location_id dropoff_location_… ## <dttm> <dttm> <int> <int> ## 1 2016-10-01 21:25:28 2016-10-01 22:12:36 132 142 ## 2 2016-10-01 21:25:31 2016-10-01 21:54:12 132 7 ## 3 2016-10-01 21:25:38 2016-10-01 21:54:09 138 181 ## 4 2016-10-01 21:25:44 2016-10-01 21:53:04 138 17 ## 5 2016-10-01 21:25:47 2016-10-01 22:46:14 132 162 ## 6 2016-10-01 21:25:48 2016-10-01 22:22:16 132 230 ## 7 2016-10-01 21:25:53 2016-10-01 22:11:04 132 4 ## 8 2016-10-01 21:26:04 2016-10-01 21:39:15 138 56 ## 9 2016-10-01 21:26:15 2016-10-01 22:10:25 132 25 ## 10 2016-10-01 21:26:18 2016-10-01 21:46:38 138 132 ## # … with 22,228,726 more rows, and 2 more variables: dropoff_borough <chr>, ## # dropoff_zone <chr> ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-light-minimal background-image: url("img/chris-lawton-5IHz5WhosQE-unsplash.jpg") background-size: cover background-repeat: no-repeat background-position: left ## Step 4: Summarize <div class="my-footer"> <span> <a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/5IHz5WhosQE"><b>Image freely available courtesy of Chris Lawton, via Unsplash</b></a> </span> </div> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Our query so far .pull-left[ ```r nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join(nyc_taxi_zones_2) ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Count rides by dropoff zone .pull-left[ ```r nyc_taxi_zone_counts <- nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join(nyc_taxi_zones_2) |> count(dropoff_zone) |> arrange(desc(n)) |> collect() ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Count rides by dropoff zone .pull-left[ ```r nyc_taxi_zone_counts <- nyc_taxi |> filter( pickup_location_id %in% airport_zones ) |> select( matches("datetime"), matches("location_id") ) |> left_join(nyc_taxi_zones_2) |> count(dropoff_zone) |> arrange(desc(n)) |> collect() nyc_taxi_zone_counts ``` ] .pull-right[ .scroll-output[ ``` ## # A tibble: 262 × 2 ## dropoff_zone n ## <chr> <int> ## 1 Times Sq/Theatre District 1068981 ## 2 Midtown Center 691807 ## 3 Midtown East 691641 ## 4 JFK Airport 645621 ## 5 Midtown South 551206 ## 6 Clinton East 544932 ## 7 Murray Hill 534570 ## 8 Midtown North 515244 ## 9 <NA> 419348 ## 10 LaGuardia Airport 307649 ## # … with 252 more rows ``` ] ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## (A tiny bit more work) .pull-left-wide[ ```r dat <- "data/taxi_zones/taxi_zones.shp" |> sf::read_sf() |> janitor::clean_names() |> left_join(nyc_taxi_zone_counts, by = c("zone" = "dropoff_zone")) pic <- dat |> ggplot(aes(fill = n)) + geom_sf() + theme_void() ``` ] <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-light-minimal background-image: url("img/airport_map.png") background-size: cover background-repeat: no-repeat background-position: left ## Data visualization <!-- let's make this super pretty --> <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## Getting a deeper understanding <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## Where does the new R user go? - Documentation: [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/) - Cookbook: [arrow.apache.org/cookbook/r](https://arrow.apache.org/cookbook/r/) - Resource list: [github.com/thisisnic/awesome-arrow-r](https://github.com/thisisnic/awesome-arrow-r/) <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-green-minimal ## How can the Arrow community help? - Looking at our docs from a new user perspective - Putting together novice friendly tutorials - Being visible and helpful where we can! <!----------------------------- SLIDE BREAK -----------------------------> --- class: theme-dark-maximal ## Making connections with data! .pull-left[ - Apache Arrow is cool - Data wrangling in R is cool - {arrow} brings them together ] .pull-right[ <ol class="fa-ul"> <li><span class="fa-li"><i class="fa fa-twitter"></i></span>twitter.com/djnavarro</li> <li><span class="fa-li"><i class="fa fa-github"></i></span>github.com/djnavarro</li> <li><span class="fa-li"><i class="fa fa-globe"></i></span>djnavarro.net</li> </ol> ]