Doing more with data: An introduction to Arrow for R users

---

.center[.large[.boldface[Doing more with data: An introduction to Arrow for R users]]]

.center[.midi[by Danielle Navarro]] 
.center[.midi[[djnavarro.net/slides-data-thread-2022](http://djnavarro.net/slides-data-thread-2022)]]

<!--
font-size classes (relative to normal 100%)

.larger = 200% 
.large  = 130% 
.midi   = 85% 
.small  = 70% 
.xsmall = 60% 
.tiny   = 50% 
-->

---

## Who is Danielle Navarro?

.pull-right-wide[
<br>
- Developer advocate at Voltron Data
- Data scientist and generative artist
- Tweets incessantly at [@djnavarro](https://twitter.com/djnavarro)
]

---

## What is Apache Arrow?

> A multi-language toolbox <br>
> for accelerated data interchange <br> 
> and in-memory processing

---

## Accelerating data interchange

---

## Accelerating data interchange

---

## Efficient in-memory processing

---

## Efficient in-memory processing

---

## Efficient in-memory processing

---

## Efficient in-memory processing

---

## Where does R fit (in Arrow)?

---

## Where does R fit (in Arrow)?

---

## Where does R fit (in Arrow)?

---

## Where does R fit (in Arrow)?

---

## Where does Arrow fit (in R)?

---

## Where does Arrow fit (in R)?

- Analyze, process, and write multi-file parquet, csv, S3 buckets, ...
--

- Larger-than-memory data sets
--

- ... many other things...
--

- Analyze Arrow data using dplyr verbs

---

## Where does Arrow fit (in R)?

---

## Where does Arrow fit (in R)?

---

## Where does Arrow fit (in R)?

---

## dplyr connects to an Arrow backend

---

## Let's see it in action: 
## Analyzing Arrow data in R

---

## Installing {arrow}

- Install arrow by typing `install.packages("arrow")`
- To use dplyr pipelines both arrow and dplyr must be loaded

```r
library(arrow)
library(dplyr)
```

- Installation resources:
  - [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/)
  - [blog.djnavarro.net/starting-apache-arrow-in-r](https://blog.djnavarro.net/starting-apache-arrow-in-r/)

---

class: theme-dark-minimal
background-image: url("img/redd-Ra8VlGo0tyI-unsplash.jpg")
background-size: cover
background-repeat: no-repeat
background-position: left

## Step 1: Get data

<div class="my-footer">
<span>
<a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/Ra8VlGo0tyI"><b>Image freely available courtesy of Redd, via Unsplash</b></a>
</span>
</div>

---

## The NYC taxi data

- The data set:

- Original data from the New York City Taxi and Limosine Service
  - Pickup and dropoff locations and time, fees, tips, etc
  - Approximately ~1.7 billion rows in a single table, from 2009 to 2022
  
--

- The case study:
  - Extract taxi rides from the three major airports
  - Visualise the destinations by taxi zone

---

## Downloading the data

```r
copy_files(
  from = s3_bucket("ursa-labs-taxi-data-v2"),
  to = "~/Datasets/nyc-taxi"
)
```

- Warning: this is 69GB in total
- It takes a long time!
]

---

## Examining the data

- NYC taxi is a multi-file data set
  - 158 separate parquet files
  - Partitioned by year and month

```r
fs::dir_tree("~/Datasets/nyc-taxi/")
```
]

```
## ~/Datasets/nyc-taxi/
## ├── year=2009
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2010
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2011
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2012
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2013
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2014
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2015
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2016
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2017
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2018
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2019
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2020
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## ├── year=2021
## │   ├── month=1
## │   │   └── part-0.parquet
## │   ├── month=10
## │   │   └── part-0.parquet
## │   ├── month=11
## │   │   └── part-0.parquet
## │   ├── month=12
## │   │   └── part-0.parquet
## │   ├── month=2
## │   │   └── part-0.parquet
## │   ├── month=3
## │   │   └── part-0.parquet
## │   ├── month=4
## │   │   └── part-0.parquet
## │   ├── month=5
## │   │   └── part-0.parquet
## │   ├── month=6
## │   │   └── part-0.parquet
## │   ├── month=7
## │   │   └── part-0.parquet
## │   ├── month=8
## │   │   └── part-0.parquet
## │   └── month=9
## │       └── part-0.parquet
## └── year=2022
##     ├── month=1
##     │   └── part-0.parquet
##     └── month=2
##         └── part-0.parquet
```
]
]

---

## Opening a dataset

```r
nyc_taxi <- open_dataset("~/Datasets/nyc-taxi")
nyc_taxi
```
]
--
.pull-left-wide[
.scroll-box-10[

```
## FileSystemDataset with 158 Parquet files
## vendor_name: string
## pickup_datetime: timestamp[ms]
## dropoff_datetime: timestamp[ms]
## passenger_count: int64
## trip_distance: double
## pickup_longitude: double
## pickup_latitude: double
## rate_code: string
## store_and_fwd: string
## dropoff_longitude: double
## dropoff_latitude: double
## payment_type: string
## fare_amount: double
## extra: double
## mta_tax: double
## tip_amount: double
## tolls_amount: double
## total_amount: double
## improvement_surcharge: double
## congestion_surcharge: double
## pickup_location_id: int64
## dropoff_location_id: int64
## year: int32
## month: int32
```
]
]

---

## Opening a dataset

```r
nyc_taxi |> 
  nrow()
```

```
## [1] 1672590319
```
]

---

## Opening a dataset

.pull-left-wide[
- Is it really 1.7 billion rows? (Yes)
- Is all this data somehow loaded in R?
]
--
.pull-left-wide[
- Absolutely not!

```r
nyc_taxi |> 
  lobstr::obj_size()
```

```
## 261,088 B
```
]

---

## Opening a dataset

.pull-left-wide[
- Is it really 1.7 billion rows? (Yes)
- Is all this data somehow loaded in R? (No)
- Is there a trick? (Yes: read files as needed)
]

---

## Using dplyr verbs: Select and filter

```r
nyc_taxi |> 
  filter(year == 2019) |>
  select(matches("pickup"), matches("dropoff")) 
```
]
--

```
## FileSystemDataset (query)
## pickup_datetime: timestamp[ms]
## pickup_longitude: double
## pickup_latitude: double
## pickup_location_id: int64
## dropoff_datetime: timestamp[ms]
## dropoff_longitude: double
## dropoff_latitude: double
## dropoff_location_id: int64
## 
## * Filter: (year == 2019)
## See $.data for the source Arrow object
```
]

---

## Using dplyr verbs: Select and filter

```r
nyc_taxi |> 
  filter(year == 2019) |>
  select(matches("pickup"), matches("dropoff")) 
```
]

---

## Check before you `collect()`!

```r
nyc_taxi |> 
  filter(year == 2019) |>
  select(matches("pickup"), matches("dropoff")) |>
  nrow()
```

```
## [1] 84393604
```
]

- 84 million rows is more than I want
- Perhaps better to collect a subset...
  - Calling `compute()` evaluates in Arrow
  - Calling `collect()` evaluates and returns to R
]

---

## Use `head()` to collect a tiny subset

```r
nyc_taxi |> 
  filter(year == 2019) |>
  select(matches("pickup"), matches("dropoff")) |>
  head() |>
  collect()
```

- Only collects the first few rows
- Can be evaluated efficiently

---

## Aha! Spatial information in "location_id"

```r
nyc_taxi |> 
  filter(year == 2019) |>
  select(matches("pickup"), matches("dropoff")) |>
  head() |>
  collect()
```

```
## # A tibble: 6 × 8
##   pickup_datetime     pickup_longitude pickup_latitude pickup_location_id
##   <dttm>                         <dbl>           <dbl>              <int>
## 1 2019-11-02 04:31:11               NA              NA                161
## 2 2019-11-02 04:31:11               NA              NA                161
## 3 2019-11-02 04:43:43               NA              NA                233
## 4 2019-11-02 04:57:46               NA              NA                164
## 5 2019-11-02 04:57:46               NA              NA                164
## 6 2019-11-02 04:36:58               NA              NA                261
## # … with 4 more variables: dropoff_datetime <dttm>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, dropoff_location_id <int>
```

<!--

<code class ='r hljs remark-code'>nyc_taxi |> <br>&nbsp;&nbsp;filter(year == 2019) |><br>&nbsp;&nbsp;select(matches("pickup"), matches("dropoff")) |><br>&nbsp;&nbsp;<span style='background-color:#000000'>head</span>() |><br>&nbsp;&nbsp;collect()</code>

```

## # A tibble: 6 × 8
##   pickup_datetime     pickup_longitude pickup_latitude pickup_location_id
##   <dttm>                         <dbl>           <dbl>              <int>
## 1 2019-01-02 03:11:05               NA              NA                233
## 2 2019-01-02 02:56:45               NA              NA                186
## 3 2019-01-02 03:18:28               NA              NA                114
## 4 2019-01-02 03:25:43               NA              NA                234
## 5 2019-01-02 03:51:35               NA              NA                246
## 6 2019-01-02 03:51:54               NA              NA                211
## # … with 4 more variables: dropoff_datetime <dttm>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, dropoff_location_id <int>

```
-->

---

class: theme-dark-minimal
background-image: url("img/paolo-chiabrando-1fBiGUm_jbw-unsplash.jpg")
background-size: cover
background-repeat: no-repeat
background-position: left

## Step 2: Find airports

<div class="my-footer">
<span>
<a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/1fBiGUm_jbw"><b>Image freely available courtesy of Paolo Chiabrando, via Unsplash</b></a>
</span>
</div>

---

## A secondary table

- NYC TLC also supplies auxiliary files in the source data
- One file is a CSV with extra information about taxi zones
- It's a small table, we can handle it natively in R:

```r
nyc_taxi_zones <- "data/taxi_zone_lookup.csv" |> 
  read_csv_arrow() |>
  janitor::clean_names()
```
]

---

## A secondary table

```r
nyc_taxi_zones
```

```
## # A tibble: 265 × 4
##    location_id borough       zone                    service_zone
##          <int> <chr>         <chr>                   <chr>       
##  1           1 EWR           Newark Airport          EWR         
##  2           2 Queens        Jamaica Bay             Boro Zone   
##  3           3 Bronx         Allerton/Pelham Gardens Boro Zone   
##  4           4 Manhattan     Alphabet City           Yellow Zone 
##  5           5 Staten Island Arden Heights           Boro Zone   
##  6           6 Staten Island Arrochar/Fort Wadsworth Boro Zone   
##  7           7 Queens        Astoria                 Boro Zone   
##  8           8 Queens        Astoria Park            Boro Zone   
##  9           9 Queens        Auburndale              Boro Zone   
## 10          10 Queens        Baisley Park            Boro Zone   
## # … with 255 more rows
```
]

---

## A secondary table

```r
nyc_taxi_zones |> 
  filter(str_detect(zone, "Airport"))
```

```
## # A tibble: 3 × 4
##   location_id borough zone              service_zone
##         <int> <chr>   <chr>             <chr>       
## 1           1 EWR     Newark Airport    EWR         
## 2         132 Queens  JFK Airport       Airports    
## 3         138 Queens  LaGuardia Airport Airports
```
]

---

## Extract airport zones

```r
airport_zones <- nyc_taxi_zones |> 
  filter(str_detect(zone, "Airport")) |>
  pull(location_id)

airport_zones
```

```
## [1]   1 132 138
```
]

---

class: theme-dark-minimal
background-image: url("img/helena-lopes-bunLmHpu1hg-unsplash.jpg")
background-size: cover
background-repeat: no-repeat
background-position: left

## Step 3: Wrangle data

<div class="my-footer">
<span>
<a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/bunLmHpu1hg"><b>Image freely available courtesy of Helena Lopes, via Unsplash</b></a>
</span>
</div>

---

## Let's find the airport pickups...

```r
nyc_taxi |> 
  filter(
    pickup_location_id %in% airport_zones
  ) |>
  select(
    matches("datetime"), 
    matches("location_id")
  )
```
]

```
## FileSystemDataset (query)
## pickup_datetime: timestamp[ms]
## dropoff_datetime: timestamp[ms]
## pickup_location_id: int64
## dropoff_location_id: int64
## 
## * Filter: is_in(pickup_location_id, {value_set=int32:[
##   1,
##   132,
##   138
## ], skip_nulls=true})
## See $.data for the source Arrow object
```
]

---

## We need to use database joins

```r
nyc_taxi |> 
  filter(
    pickup_location_id %in% airport_zones
  ) |>
  select(
    matches("datetime"), 
    matches("location_id")
  ) |> 
  left_join(
    nyc_taxi_zones,
    by = c(
      "dropoff_location_id" = "location_id"
    )
  )
```
]
--
.pull-right[
.scroll-output[

```
## FileSystemDataset (query)
## pickup_datetime: timestamp[ms]
## dropoff_datetime: timestamp[ms]
## pickup_location_id: int64
## dropoff_location_id: int64
## borough: string
## zone: string
## service_zone: string
## 
## See $.data for the source Arrow object
```
]
]

---

## We need to use database joins

---

## Wait... why does the join throw an error?

```
## Error in `collect()`:
## ! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(dropoff_location_id) of type int64 and FieldRef.Name(location_id) of type int32
```
]
]

---

## Wait... why does the join throw an error?

.pull-right[
- `location_id`: 
    - 32-bit integer
- `dropoff_location_id`: 
    - 64-bit integer
- Needs to be handled explicitly
- May be new to some R users
]

---

## Solution: use schemas

- The problem:

- Data types in R and Arrow are not identical
  - The data import defaults in {arrow} are good but...
  - ... they don't always work

--
  
- The solution:

- Schemas describe types for each variable
  - User supplies types explicitly

---

## Solution: use schemas

```r
schema(
  dropoff_location_id = int64(),
  dropoff_borough = utf8(),
  dropoff_zone = utf8()
)
```

```
## Schema
## dropoff_location_id: int64
## dropoff_borough: string
## dropoff_zone: string
```
]

---

## Solution: use schemas

```r
nyc_taxi_zones_2 <- nyc_taxi_zones |>
  transmute(
    dropoff_location_id = location_id, 
    dropoff_borough = borough,
    dropoff_zone = zone) |>
  as_arrow_table(
    schema = schema(
      dropoff_location_id = int64(),
      dropoff_borough = utf8(),
      dropoff_zone = utf8()
    )
  )

nyc_taxi_zones_2
```
]
--
.pull-right[
.scroll-output[

```
## Table
## 265 rows x 3 columns
## $dropoff_location_id <int64>
## $dropoff_borough <string>
## $dropoff_zone <string>
```
]
]

---

## Now it all works...

```r
nyc_taxi |> 
  filter(
    pickup_location_id %in% airport_zones
  ) |>
  select(
    matches("datetime"), 
    matches("location_id")
  ) |> 
  left_join(nyc_taxi_zones_2) |>
  collect()
```
]
--
.pull-right[
.scroll-output[

```
## # A tibble: 22,228,736 × 6
##    pickup_datetime     dropoff_datetime    pickup_location_id dropoff_location_…
##    <dttm>              <dttm>                           <int>              <int>
##  1 2016-10-01 21:25:28 2016-10-01 22:12:36                132                142
##  2 2016-10-01 21:25:31 2016-10-01 21:54:12                132                  7
##  3 2016-10-01 21:25:38 2016-10-01 21:54:09                138                181
##  4 2016-10-01 21:25:44 2016-10-01 21:53:04                138                 17
##  5 2016-10-01 21:25:47 2016-10-01 22:46:14                132                162
##  6 2016-10-01 21:25:48 2016-10-01 22:22:16                132                230
##  7 2016-10-01 21:25:53 2016-10-01 22:11:04                132                  4
##  8 2016-10-01 21:26:04 2016-10-01 21:39:15                138                 56
##  9 2016-10-01 21:26:15 2016-10-01 22:10:25                132                 25
## 10 2016-10-01 21:26:18 2016-10-01 21:46:38                138                132
## # … with 22,228,726 more rows, and 2 more variables: dropoff_borough <chr>,
## #   dropoff_zone <chr>
```
]
]

---

class: theme-light-minimal
background-image: url("img/chris-lawton-5IHz5WhosQE-unsplash.jpg")
background-size: cover
background-repeat: no-repeat
background-position: left

## Step 4: Summarize

<div class="my-footer">
<span>
<a style="color:#f97b64; background-color: #22222288; padding: 5px" href="https://unsplash.com/photos/5IHz5WhosQE"><b>Image freely available courtesy of Chris Lawton, via Unsplash</b></a>
</span>
</div>

---

## Our query so far

```r
nyc_taxi |> 
  filter(
    pickup_location_id %in% airport_zones
  ) |>
  select(
    matches("datetime"), 
    matches("location_id")
  ) |> 
  left_join(nyc_taxi_zones_2)
```
]

---

## Count rides by dropoff zone

---

## Count rides by dropoff zone

nyc_taxi_zone_counts
```
]

```
## # A tibble: 262 × 2
##    dropoff_zone                    n
##    <chr>                       <int>
##  1 Times Sq/Theatre District 1068981
##  2 Midtown Center             691807
##  3 Midtown East               691641
##  4 JFK Airport                645621
##  5 Midtown South              551206
##  6 Clinton East               544932
##  7 Murray Hill                534570
##  8 Midtown North              515244
##  9 <NA>                       419348
## 10 LaGuardia Airport          307649
## # … with 252 more rows
```
]
]

---

## (A tiny bit more work)

```r
dat <- "data/taxi_zones/taxi_zones.shp" |> 
  sf::read_sf() |>
  janitor::clean_names() |>
  left_join(nyc_taxi_zone_counts,
            by = c("zone" = "dropoff_zone"))

pic <- dat |> 
  ggplot(aes(fill = n)) + 
  geom_sf() + 
  theme_void()
```
]

---

class: theme-light-minimal
background-image: url("img/airport_map.png")
background-size: cover
background-repeat: no-repeat
background-position: left

## Data visualization

---

## Getting a deeper understanding

---

## Where does the new R user go?

- Documentation: [arrow.apache.org/docs/r](https://arrow.apache.org/docs/r/)
- Cookbook: [arrow.apache.org/cookbook/r](https://arrow.apache.org/cookbook/r/)
- Resource list: [github.com/thisisnic/awesome-arrow-r](https://github.com/thisisnic/awesome-arrow-r/)

---

## How can the Arrow community help?

- Looking at our docs from a new user perspective
- Putting together novice friendly tutorials
- Being visible and helpful where we can!

---

## Making connections with data!

.pull-right[
<ol class="fa-ul">
  <li><span class="fa-li"><i class="fa fa-twitter"></i></span>twitter.com/djnavarro</li>
  <li><span class="fa-li"><i class="fa fa-github"></i></span>github.com/djnavarro</li>
  <li><span class="fa-li"><i class="fa fa-globe"></i></span>djnavarro.net</li>
</ol>
]