Appendix C — The gt datasets

Every dataset tells a story. The eighteen datasets bundled with gt were not assembled arbitrarily or pulled from dusty archives of convenient CSV files. Each one emerged from a specific need, a personal curiosity, or a gap in what was publicly available. Some model human behavior with surprising fidelity. Others preserve scientific data that would otherwise remain scattered across obscure government pages. A few exist simply because I thought “I wish this dataset existed” and then made it so.

This appendix takes you behind the scenes of these datasets. You will learn where they came from, how they were constructed, and why they matter beyond their immediate utility as demonstration data. Along the way, we will explore the broader contexts they represent: the economics of a neighborhood pizza shop, the 125-year evolution of urban transit in Paris, the delicate chemistry of Earth’s atmosphere, and the quiet dramas of population change in Ontario’s small towns. The datasets are more than rows and columns. They are windows into worlds worth understanding.

C.1 pizzaplace

The pizzaplace dataset contains 49,574 rows representing every pizza sold at a fictional pizzeria during the year 2015. Each row records a transaction with its timestamp, pizza type, size, and price. On the surface it appears to be straightforward sales data. Underneath, it is an elaborate simulation of human behavior, kitchen operations, and the unpredictable rhythms of a small food business.

The inspiration for this dataset came from Plateau Pizza, a real establishment in Coquitlam, British Columbia. The restaurant occupies a pleasant spot in a suburban plaza alongside a dollar store called Dollars & Cents and an IGA grocery store. It is the kind of neighborhood pizza place that survives on regulars and convenience, offering pizzas with names that are equal parts cheesy and memorable. Goat Supreme. The Calabrese. Names that stick in your head even if you cannot quite remember what toppings they included.

The dataset borrowed liberally from Plateau Pizza’s menu. Their category structure (Classic, Supreme, Veggie and Vegan, Chicken) was adopted directly. Many pizza names and ingredient combinations came straight from their website, though some were embellished and others invented entirely. The fictional additions drew inspiration from Food Network recipes and local salumist offerings that seemed appropriately gourmet. Pricing followed a similar pattern: real prices served as the baseline, then adjustments were made based on perceived ingredient costs. The fancier cheeses and cured meats commanded premium prices, as they should.

What makes pizzaplace interesting is not the menu but the simulation that generated a year’s worth of orders. The modeling script (preserved in the gt package’s source repository at data-raw/05-pizzaplace.R) creates synthetic customers who arrive throughout each day with realistic patterns. Barret Schloerke contributed refinements to the behavioral model, adding nuance to timing and preference distributions. Weekends see more traffic than weekdays, reflecting the work-leisure split that governs so much of consumer behavior. Holidays disrupt normal patterns in expected ways.

The simulation also includes what might be called narrative events. A kitchen fire at one point disrupts operations. These partially catastrophic incidents add verisimilitude to what would otherwise be suspiciously smooth data. Real businesses experience interruptions, equipment failures, staff shortages, and the occasional minor disaster. The pizzaplace data reflects this messiness, making it more valuable for realistic analysis exercises than perfectly clean synthetic data would be.

The dataset originally included plans for appetizers, side dishes, and beverages, but these were ultimately cut in favor of simplicity. A pizzeria that only sells pizza is easier to understand and analyze than one with a full menu. The constraint also keeps the focus on what makes the dataset distinctive: the careful modeling of how people order pizza throughout a year.

C.1.1 The pizzas of pizzaplace

The Complete pizzaplace Menu
All 32 pizza varieties by category.
Pizza Available Sizes Price Range
chicken
The Barbecue Chicken Pizza
bbq_ckn
S, M, L $12.75–$20.75
The California Chicken Pizza
cali_ckn
S, M, L $12.75–$20.75
The Chicken Alfredo Pizza
ckn_alfredo
S, M, L $12.75–$20.75
The Chicken Pesto Pizza
ckn_pesto
S, M, L $12.75–$20.75
The Southwest Chicken Pizza
southw_ckn
S, M, L $12.75–$20.75
The Thai Chicken Pizza
thai_ckn
S, M, L $12.75–$20.75
classic
The Big Meat Pizza
big_meat
S $12.00
The Classic Deluxe Pizza
classic_dlx
S, M, L $12.00–$20.50
The Greek Pizza
the_greek
S, M, L, XL, XXL $12.00–$35.95
The Hawaiian Pizza
hawaiian
S, M, L $10.50–$16.50
The Italian Capocollo Pizza
ital_cpcllo
S, M, L $12.00–$20.50
The Napolitana Pizza
napolitana
S, M, L $12.00–$20.50
The Pepperoni Pizza
pepperoni
S, M, L $9.75–$15.25
The Pepperoni, Mushroom, and Peppers Pizza
pep_msh_pep
S, M, L $11.00–$17.50
supreme
The Brie Carre Pizza
brie_carre
S $23.65
The Calabrese Pizza
calabrese
S, M, L $12.25–$20.25
The Italian Supreme Pizza
ital_supr
S, M, L $12.50–$20.75
The Pepper Salami Pizza
peppr_salami
S, M, L $12.50–$20.75
The Prosciutto and Arugula Pizza
prsc_argla
S, M, L $12.50–$20.75
The Sicilian Pizza
sicilian
S, M, L $12.25–$20.25
The Soppressata Pizza
soppressata
S, M, L $12.50–$20.75
The Spicy Italian Pizza
spicy_ital
S, M, L $12.50–$20.75
The Spinach Supreme Pizza
spinach_supr
S, M, L $12.50–$20.75
veggie
The Five Cheese Pizza
five_cheese
L $18.50
The Four Cheese Pizza
four_cheese
M, L $14.75–$17.95
The Green Garden Pizza
green_garden
S, M, L $12.00–$20.25
The Italian Vegetables Pizza
ital_veggie
S, M, L $12.75–$21.00
The Mediterranean Pizza
mediterraneo
S, M, L $12.00–$20.25
The Mexicana Pizza
mexicana
S, M, L $12.00–$20.25
The Spinach Pesto Pizza
spin_pesto
S, M, L $12.50–$20.75
The Spinach and Feta Pizza
spinach_fet
S, M, L $12.00–$20.25
The Vegetables + Vegetables Pizza
veggie_veg
S, M, L $12.00–$20.25

The menu reveals the character of the fictional establishment. Classic pizzas stick to familiar territory: pepperoni, Hawaiian, the combinations everyone recognizes. The Supreme category ventures into more elaborate ingredient lists with names that promise indulgence. Veggie options cater to those avoiding meat, while the Chicken category builds meals around that particular protein. Each pizza comes in multiple sizes, though not every pizza is available in every size. The XL (and XXL!) option appears only for a certain pizza, and presumably that one pizza is popular enough to warrant the larger formats (could be that the ingredient mix works well at scale).

Pricing follows intuitive logic. A basic cheese pizza costs less than one loaded with prosciutto and artichoke hearts. Size increases bring proportional price increases, though the per-square-inch cost typically decreases as you go larger (the eternal economic argument for ordering the bigger pizza). These pricing patterns create natural opportunities for analysis: which pizzas generate the most revenue? Which sizes sell best and for which types? How does day of week affect the popularity of the vegetarian options?

C.1.2 A pizza tier list

Any discussion of pizza invites opinions about which pizzas are best. The following tier list represents one possible ranking of the pizzaplace menu, from essential classics to more adventurous options that may not suit every palate. Reasonable people will disagree, and that disagreement is part of what makes pizza culture endlessly entertaining.

The Definitive pizzaplace Tier List
A highly subjective ranking of 15 notable pizzas.
Pizza Ingredients Assessment
S Tier — Essential
The Pepperoni Mozzarella, Pepperoni, Tomato Sauce The benchmark against which all pizzas are measured
The Big Meat Bacon, Pepperoni, Italian Sausage, Chorizo, Mozzarella, Tomato Sauce Maximalist approach that somehow works
The Classic Deluxe Pepperoni, Mushrooms, Red Onions, Red Peppers, Bacon, Mozzarella, Tomato Sauce Everything a pizza should be
A Tier — Excellent
The Barbecue Chicken Barbecued Chicken, Red Peppers, Green Peppers, Tomatoes, Red Onions, Mozzarella, Barbecue Sauce Sweet and savory in perfect balance
The Hawaiian Sliced Ham, Pineapple, Mozzarella, Tomato Sauce Controversial but undeniably popular
The Italian Capocollo Capocollo, Red Peppers, Tomatoes, Goat Cheese, Garlic, Oregano, Mozzarella, Tomato Sauce Elevated ingredients lift the whole experience
The Calabrese Nduja, Italian Sausage, Pepperoni, Tomatoes, Red Onions, Mozzarella, Tomato Sauce Spicy nduja provides serious depth
The Prosciutto Prosciutto, Arugula, Mozzarella, Tomato Sauce Simple elegance, fresh and light
B Tier — Good
The Four Cheese Ricotta, Gorgonzola, Romano, Mozzarella, Tomato Sauce For dedicated cheese enthusiasts only
The Vegetables Mushrooms, Tomatoes, Red Peppers, Green Peppers, Red Onions, Zucchini, Spinach, Garlic, Mozzarella, Tomato Sauce The vegetable abundance can overwhelm
The Spinach Pesto Spinach, Artichokes, Tomatoes, Sun-dried Tomatoes, Garlic, Pesto Sauce, Mozzarella Pesto base divides opinion
The Greek Kalamata Olives, Feta, Tomatoes, Red Onions, Red Peppers, Garlic, Mozzarella, Tomato Sauce Mediterranean flavors work better as salad
C Tier — Passable
The Brie Carre Brie, Prosciutto, Caramelized Onions, Pears, Thyme, Garlic, Mozzarella Ambitious but confused identity
The Chicken Pesto Chicken, Tomatoes, Red Peppers, Spinach, Garlic, Pesto Sauce, Mozzarella Pesto and chicken compete rather than complement
The Soppressata Soppressata, Fontina, Mozzarella, Garlic, Tomato Sauce Underwhelming given premium ingredients
Rankings reflect one person's taste. YMMV.

The S-tier pizzas represent the core of what a pizzeria should do well. The Pepperoni is the foundation, the pizza against which all others are implicitly compared. When someone says “let’s get pizza”, this is what most people imagine. The Big Meat takes maximalism seriously but avoids the trap of incoherence: bacon, pepperoni, sausage, and chorizo sound excessive, but they harmonize around a common meatiness (it really works!). The Classic Deluxe achieves balance, incorporating vegetables (mushrooms, peppers, onions) alongside meat without letting any single ingredient dominate.

The A-tier pizzas represent successful experiments. The Hawaiian remains perpetually controversial, inspiring passionate defenses and equally passionate condemnations. But I’m a fan. The combination of salty ham and sweet pineapple against tangy tomato sauce works for a lot of people, and the dataset shows it selling consistently throughout the year. The Italian Capocollo and Calabrese elevate the genre through premium cured meats, offering pizzas that could credibly appear on a trattoria menu. The Prosciutto demonstrates that restraint can be a virtue: just ham, arugula, and cheese, but executed with quality ingredients.

The B-tier pizzas are solid but not essential. The Four Cheese appeals to dedicated turophiles but can feel monotonous without the textural variety that vegetables or meats provide. The vegetable-heavy options (Vegetables, Greek, Spinach Pesto) often release too much moisture during cooking, resulting in soggy centers that undermine the crust. Don’t get me wrong, they are fine pizzas but rarely anyone’s first choice.

The C-tier pizzas represent experiments that did not quite succeed. The Brie Carre attempts a French-inspired combination of brie, pears, and prosciutto that sounds sophisticated but tastes confused. The ingredients come from different culinary traditions and never fully integrate. These pizzas do seem to sell in this particular dataset, but they probably rarely inspire repeat orders or enthusiastic recommendations.

C.1.3 What makes a good pizza

The tier list above reflects accumulated pizza wisdom, but the principles underlying it deserve articulation. A good pizza balances several competing demands. The crust must be sturdy enough to support toppings without becoming soggy, yet thin and pliable enough to fold. The sauce should be present but not overwhelming, providing acidity and moisture without drowning the other flavors. The cheese needs to melt smoothly and brown slightly without becoming rubbery or releasing pools of grease. The toppings must be distributed evenly and cut to appropriate sizes so that each bite contains a representative sample. And freshness is non-negotiable. A pizza that has been languishing in a display case for hours, slowly drying out under a heat lamp, is a shadow of its former self. The crust toughens, the cheese congeals, and the toppings develop that dispiriting sheen of oxidation. At the extreme end of neglect, one reviewer actually found mold growing on the underside of a display pizza, which is the sort of discovery that makes you reconsider every grab-and-go slice you have ever eaten.

Beyond these structural requirements, good pizza demonstrates restraint. The impulse to add more toppings is understandable but usually misguided. Each additional ingredient dilutes the impact of everything else. The best pizzas feature three to five components (beyond sauce and cheese) that complement one another. Pepperoni alone is better than pepperoni with six other meats. Mushrooms and peppers work together because their textures contrast and their flavors do not compete.

Temperature matters enormously. A pizza that has been sitting for twenty minutes bears little resemblance to one fresh from the oven. The cheese firms up, the crust softens, the toppings cool into sad disconnection. This is why pizzerias survive on dine-in and delivery rather than takeout that sits in a car. And do not even get me started on microwave pizza. The microwave does something deeply wrong to pizza crust, transforming it into a chewy, rubbery substance that no longer qualifies as bread by any reasonable definition. The cheese melts unevenly, the sauce superheats into lava pockets, and the whole experience leaves you wondering why you bothered. Cold leftover pizza eaten standing at the refrigerator at midnight is honestly preferable (I’ve actually grown to like it more and more these days). The pizzaplace dataset implicitly captures this reality: the simulation models customers who order and receive pizzas within a reasonable timeframe, not pizzas boxed and forgotten (and certainly not pizzas reheated in a microwave).

Finally, good pizza requires good ingredients. This seems obvious but explains why some inexpensive pizzas disappoint while others satisfy. The quality of the mozzarella matters. The tomatoes in the sauce matter. Even the olive oil brushed on the crust matters. Plateau Pizza, the real establishment that inspired the dataset, succeeds partly because it sources decent ingredients and does not cut corners that customers would notice. The fictional pizzaplace inherits this philosophy.

And yet, after all this talk of balance and restraint, it is worth acknowledging that pizza can abandon every one of these conventions and still be legitimate. A pizza can have no sauce. It can have no cheese. It can be nothing more than dough, olive oil, and anchovies. This sounds shocking to anyone raised on North American delivery pizza, but it is a perfectly valid expression of pizzadom with deep roots in Italian tradition. Pizza bianca, pizza marinara, and the various focaccia-adjacent flatbreads of southern Italy predate the mozzarella-laden versions by centuries. The rules above describe one tradition. Pizza is generous enough to accommodate others.

C.1.4 Why pizza data matters

Pizza occupies a unique position in the landscape of consumer goods. It is simultaneously simple (bread, sauce, cheese, toppings) and infinitely variable. A pizzeria’s menu encodes assumptions about its customers: their adventurousness, their price sensitivity, their dietary restrictions. Sales patterns encode behavior: when people eat, what they celebrate, how weather affects appetite.

Part of my motivation for creating this dataset was to draw attention to pizza analytics as a legitimate field of inquiry. The phrase sounds ridiculous, and that is partly the point. If a dataset can make someone smile while also teaching them time series decomposition and category performance analysis, it has done more work than a dataset that only accomplishes the second thing. Nobody has ever felt intimidated by pizza data. Nobody has ever opened a CSV of pizza orders and thought “I am not qualified to analyze this”. The approachability is a feature (not a frivolity!).

The pizzaplace dataset serves as a sandbox for the kinds of analysis that businesses perform constantly. Revenue breakdowns, time series decomposition, category performance, seasonal adjustment. These techniques apply far beyond pizza. Anyone learning to analyze transactional data will find the patterns in pizzaplace transferable to retail, hospitality, and service industries generally. The dataset is large enough to be realistic (nearly 50,000 transactions) but small enough to process quickly on any modern computer.

For gt specifically, pizzaplace demonstrates grouped data, aggregation, currency formatting, and the presentation of time-based information. A year of pizza sales can become a monthly summary table, a daily heatmap, a ranked list of bestsellers, or a comparison across categories. The richness of the underlying data supports dozens of different table designs.

C.2 exibble

The name exibble is a portmanteau of “example tibble” and it serves exactly that purpose. This tiny dataset of eight rows exists to demonstrate gt’s formatting capabilities without the distraction of meaningful content. Each column represents a different data type: numeric values, character strings, currency amounts, dates, times, datetimes, and logical values. Missing values appear in strategic locations to demonstrate sub_missing() and related substitution functions.

The exibble dataset
8 rows and 9 columns.
num char fctr date time datetime currency row group
1.111e-01 apricot one 2015-01-15 13:35 2018-01-01 02:22 49.950 row_1 grp_a
2.222e+00 banana two 2015-02-15 14:40 2018-02-02 14:33 17.950 row_2 grp_a
3.333e+01 coconut three 2015-03-15 15:45 2018-03-03 03:44 1.390 row_3 grp_a
4.444e+02 durian four 2015-04-15 16:50 2018-04-04 15:55 65100.000 row_4 grp_a
5.550e+03 NA five 2015-05-15 17:55 2018-05-05 04:00 1325.810 row_5 grp_b
NA fig six 2015-06-15 NA 2018-06-06 16:11 13.255 row_6 grp_b
7.770e+05 grapefruit seven NA 19:10 2018-07-07 05:22 NA row_7 grp_b
8.880e+06 honeydew eight 2015-08-15 20:20 NA 0.440 row_8 grp_b

The column names are deliberately generic (num, char, currency, date, time, datetime) because the content does not matter. What matters is having every common data type available in a single compact dataset. When documenting a date formatter, you need a date column. When showing number formatting options, you need numbers. When explaining how to handle NA values, you need NA values in predictable locations.

C.2.1 Anatomy of a reference dataset

Each row and column in exibble was chosen to exercise different aspects of table formatting:

exibble Column Structure
R Type Role in Examples
num numeric Demonstrates numeric formatting across scales
char character Provides recognizable text labels (fruits)
currency numeric Tests currency with decimals, zeros, NAs
date Date Shows date formatting patterns
time character Character-encoded times for parsing
datetime POSIXct Full datetime objects for formatting
row character Stub/rowname labels for tables
group character Group categories for row grouping

The fruit names in the char column (apricot, banana, coconut, and so forth) follow alphabetical order, which makes them easy to verify when demonstrating sorting or filtering operations. The numeric values span several orders of magnitude, from fractions to millions, ensuring that formatters must handle both small precise values and large rounded ones. The currency column includes a missing value and one very small amount, testing edge cases that might trip up naive formatting approaches.

The row and group columns transform exibble from a formatting showcase into a structural one. With row serving as a stub and group organizing rows into categories, the same eight-row dataset can demonstrate virtually every gt feature. Headers, stubs, row groups, column formatting, substitution, styling… all can be shown using just exibble.

exibble with Row Groups and Stub
Demonstrating structural features.
num char fctr date time datetime currency
grp_a
row_1 0.11 apricot one 1/15/2015 13:35 1/1/2018 02:22 $49.95
row_2 2.22 banana two 2/15/2015 14:40 2/2/2018 14:33 $17.95
row_3 33.33 coconut three 3/15/2015 15:45 3/3/2018 03:44 $1.39
row_4 444.40 durian four 4/15/2015 16:50 4/4/2018 15:55 $65,100.00
grp_b
row_5 5,550.00 five 5/15/2015 17:55 5/5/2018 04:00 $1,325.81
row_6 fig six 6/15/2015 6/6/2018 16:11 $13.26
row_7 777,000.00 grapefruit seven 19:10 7/7/2018 05:22
row_8 8,880,000.00 honeydew eight 8/15/2015 20:20 $0.44

Datasets like exibble just don’t get a lot of attention, however, they are essential as infrastructure for examples in documentation. Every example in gt’s documentation that needs to show a quick formatting demonstration reaches for exibble rather than constructing throwaway data inline. This consistency helps readers recognize the dataset and focus on what is being demonstrated rather than puzzling over unfamiliar data structures.

C.3 gtcars

The gtcars dataset contains specifications for 47 luxury and performance automobiles, with an emphasis on grand touring vehicles. The name works on two levels: these are GT (grand tourer) cars, and the dataset lives in a package called gt. The wordplay is intentional but understated. I try not to make a big deal about it.

German and Italian Grand Tourers
Precision engineering meets Mediterranean passion.
HP Torque
MPG
MSRP
City Hwy
Audi Audi
R8 430 317 11 20 $115,900
RS 7 560 516 15 25 $108,900
S6 450 406 18 27 $70,900
S7 450 406 17 27 $82,900
S8 520 481 15 25 $114,900
BMW BMW
6-Series 315 330 20 30 $77,300
M4 425 406 17 24 $65,700
M5 560 500 15 22 $94,100
M6 560 500 15 22 $113,400
i8 357 420 28 29 $140,700
Mercedes icon Mercedes-Benz
AMG GT 503 479 16 22 $129,900
SL-Class 329 354 20 27 $85,050
Porsche Porsche
718 Boxster 300 280 21 28 $56,000
718 Cayman 300 280 20 29 $53,900
911 350 287 20 28 $84,300
Panamera 310 295 18 28 $78,100
Ferrari Ferrari
458 Italia 562 398 13 17 $233,509
458 Speciale 597 398 13 17 $291,744
458 Spider 562 398 13 17 $263,553
488 GTB 661 561 15 22 $245,400
California 553 557 16 23 $198,973
F12Berlinetta 731 509 11 16 $319,995
FF 652 504 11 16 $295,000
GTC4Lusso 680 514 12 17 $298,000
LaFerrari 949 664 12 16 $1,416,362
Lamborghini Lamborghini
Aventador 700 507 11 18 $397,500
Gallardo 550 398 12 20 $191,900
Huracan 610 413 16 20 $237,250
Maserati Maserati
Ghibli 345 369 17 24 $70,600
Granturismo 454 384 13 21 $132,825
Quattroporte 404 406 16 23 $99,900

The dataset was assembled from Motor Trend articles about grand touring vehicles, with additional research filling in gaps for fuel economy and torque figures. Most vehicles date from around 2015, reflecting when the source articles were published. The selection criteria emphasized true grand tourers: vehicles designed for high-speed, long-distance driving in comfort, typically with powerful engines, refined interiors, and substantial price tags.

C.3.1 What makes a grand tourer?

The grand touring concept originated in 1950s Europe, when wealthy motorists began taking extended driving holidays across the continent. A proper GT needed range (400+ kilometers between fuel stops), performance (for the autobahns and mountain passes), and comfort (for hours behind the wheel). The Ferrari 250 GT established the template: front-mounted V12, leather interior, elegant coachwork by Pininfarina or Scaglietti. The grand tourer was always as much about aspiration as transportation.

Most Expensive Cars in gtcars 💰
Top 10 by manufacturer's suggested retail price.
Manufacturer Model MSRP
Ferrari LaFerrari $1,416,362
Ford GT $447,000
Lamborghini Aventador $397,500
Rolls-Royce Dawn $335,000
Ferrari F12Berlinetta $319,995
Rolls-Royce Wraith $304,350
Ferrari GTC4Lusso $298,000
Ferrari FF $295,000
Ferrari 458 Speciale $291,744
Aston Martin Vanquish $287,250

The top ten most expensive cars in the dataset tell a clear story about where the money goes. Ferrari dominates the list with five entries, led by the LaFerrari at over $1.4 million (more than three times the price of any other car in the dataset). The Ford GT makes a surprising appearance at number two, representing America’s answer to European exotica. A Lambo, two Rolls-Royces, and Aston Martin round out the list, each occupying a different niche of the ultra-luxury market. Notably absent from the top ten are the German manufacturers, whose cars offer serious performance at comparatively accessible price points.

Performance Tiers
Grouping the 47 cars in gtcars by horsepower output.
Models Avg Torque
(lb-ft)
Avg Price
Modest (<400 HP) 10 319 $78,545
Strong (400-499 HP) 8 374 $95,416
High (500-599 HP) 17 469 $188,800
Extreme (600+ HP) 12 548 $363,024

The relationship between horsepower and price is neither linear nor deterministic. Some modest-horsepower vehicles (certain Porsches, for instance) command premium prices through brand cachet and driving dynamics. Some extremely powerful vehicles achieve their output through brute displacement rather than exotic engineering, keeping prices relatively accessible. The correlation exists, but the exceptions tell interesting stories.

The choice to create gtcars was motivated by a desire for a modern equivalent to the venerable mtcars dataset that has shipped with R for decades. The original mtcars contains 1974 Motor Trend data on 32 automobiles, and it has been used in countless examples and tutorials. But cars from 1974 feel increasingly remote from contemporary experience. A dataset of modern luxury vehicles offers familiar reference points (Ferrari, Porsche, Aston Martin) and specifications that relate to cars people actually see on roads today.

For table-making purposes, gtcars provides natural groupings by manufacturer, multiple numeric columns suitable for formatting and comparison, and a mix of discrete and continuous variables. The manufacturer and model columns enable row grouping and stub labeling. The price column practically demands currency formatting. The horsepower and torque columns work well for bar chart visualizations within cells. It is a dataset that seems designed for beautiful tables because, in fact, it was.

C.4 countrypops

The countrypops dataset tracks population estimates for countries worldwide from 1960 through the present (currently extending to 2024). The data comes from the World Bank, which compiles demographic estimates from national statistical offices, census data, and the United Nations Population Division. With over 13,000 rows covering more than 200 countries across six decades, it is one of the larger datasets in the gt collection.

Population Growth in Five Major Nations
1960 1980 2000 2020
Brazil 72.39M 121.21M 174.02M 208.66M
China 667.07M 981.23M 1.26B 1.41B
India 435.99M 687.35M 1.06B 1.40B
Nigeria 45.05M 73.76M 126.38M 214.00M
United States 180.67M 227.22M 282.16M 331.58M

Population data might seem straightforward, but it encodes profound stories of human migration, economic development, public health, and political change. China’s population trajectory shows the demographic impact of the one-child policy. Nigeria’s explosive growth reflects patterns common across sub-Saharan Africa. European countries exhibit the stagnation and aging that accompany developed economies. Each row is a snapshot of millions of individual lives aggregated into a single number.

C.4.1 Understanding population data

Population counts are harder to obtain than one might assume. Only a handful of countries conduct reliable censuses at regular intervals. Many estimates rely on birth and death registrations (which vary in completeness), surveys of representative samples (which involve statistical uncertainty), or projections from previous counts (which compound errors over time). The World Bank’s task is to synthesize these imperfect sources into consistent estimates that allow comparison across countries and years.

World's Most Populous Countries
2023 estimated population.
Population (2023)
India 1.44B
China 1.41B
United States 337M
Indonesia 281M
Pakistan 248M
Nigeria 228M
Brazil 211M
Bangladesh 171M
Russia 144M
Mexico 130M
Ethiopia 129M
Japan 125M
Philippines 115M
Egypt 115M
Congo (DRC) 106M

The uncertainties in population data matter for policy and planning. A country that believes it has 100 million people will allocate resources differently than one that believes it has 120 million. Census undercounts (common in remote areas, among marginalized populations, and in places where people distrust government) lead to underinvestment in precisely the communities that need services most. The countrypops figures represent best estimates, not ground truth, and users should remember this limitation.

That said, the trends in population data are generally reliable even when the absolute numbers carry uncertainty. If the World Bank estimates that Nigeria’s population doubled between 1990 and 2020, the actual growth was almost certainly substantial even if the precise figures might be revised. Trends matter more than point estimates for most analytical purposes, and the countrypops dataset captures these trends across the entire modern era of demographic record-keeping.

Population Change in Aging Societies
Index: 1990 = 100
1990 2000 2010 2020
Germany 100.0 103.5 103.0 104.7
Spain 100.0 104.4 119.8 121.8
Italy 100.0 100.4 105.5 104.8
Japan 100.0 102.7 103.7 102.3
South Korea 100.0 109.7 115.6 120.9
Values show population relative to 1990 baseline

The table above shows population indexed to 1990 for five countries facing demographic aging. Japan’s population has declined in absolute terms. Germany and Italy have barely grown. South Korea’s growth is slowing rapidly. These patterns reflect low birth rates, increased longevity, and (in some cases) restrictive immigration policies. The economic and social implications of aging populations (pension systems, healthcare costs, labor force composition) represent some of the most significant policy challenges of the coming decades.

The dataset updates whenever the World Bank publishes new estimates, rather than on any fixed release schedule. This ongoing maintenance means that examples in documentation and books remain current. A population figure for China in 2024 becomes available, and shortly thereafter it appears in countrypops. This currency makes the dataset more useful for teaching than static historical data would be.

For gt demonstrations, countrypops excels at time series comparisons, geographic groupings, and the handling of large numbers. The population values range from thousands (small island nations) to billions (China and India), exercising formatters across their full dynamic range. The longitudinal structure supports year-over-year comparisons, growth rate calculations, and the kind of decade-by-decade summary tables that appear in demographic reports.

C.5 towny

While countrypops takes a global view, towny focuses on a single Canadian province: Ontario. The dataset contains population figures for 414 municipalities, including data from every Canadian census between 1996 and 2021 (conducted every five years) plus various geographic and administrative attributes. It exists because I actually live in Ontario and wanted an excuse to know more about the places surrounding me.

Ontario's Largest Municipalities
Population and density for the top 10, 2001 vs. 2021.
2001
2021
Population Density Population Density
Toronto 2,481,494 3,932.0 2,794,356 4,427.8
Ottawa 774,072 277.6 1,017,449 364.9
Mississauga 612,925 2,093.8 717,961 2,452.6
Brampton 325,428 1,223.9 656,480 2,469.0
Hamilton 490,268 438.4 569,353 509.1
London 336,359 799.9 422,324 1,004.3
Markham 208,615 989.0 338,503 1,604.8
Vaughan 182,022 668.1 323,103 1,186.0
Kitchener 190,399 1,391.7 256,885 1,877.7
Windsor 208,402 1,427.2 229,660 1,572.8
Density is measured in persons per km².

The data comes from Statistics Canada and reveals patterns that might surprise those unfamiliar with Canadian geography. Toronto dominates, of course, but the surrounding municipalities (Mississauga, Brampton, Hamilton) have grown substantially over twenty years. Some smaller towns have declined as economic opportunities concentrated elsewhere. The dataset captures this quiet drama of population redistribution that plays out across every country’s regions.

Ontario municipality names offer their own entertainment. Some are indigenous place names with beautiful sounds. Others commemorate British royalty or colonial administrators. A few seem almost whimsical when encountered for the first time. These names appear on highway signs and maps, marking places where real communities exist with their own histories and concerns. The towny dataset transforms those signs into data, inviting exploration of what lies behind the familiar names.

Fastest Growing Ontario Municipalities
Among places with 10,000+ residents in 2001.
Pop. 2001 Pop. 2021 Growth
Milton 31,471 132,979 322.5%
Whitchurch-Stouffville 22,859 49,864 118.1%
Brampton 325,428 656,480 101.7%
Wasaga Beach 12,419 24,862 100.2%
Bradford West Gwillimbury 22,228 42,880 92.9%
Vaughan 182,022 323,103 77.5%
Ajax 73,753 126,666 71.7%
East Gwillimbury 20,555 34,637 68.5%
New Tecumseth 26,141 43,948 68.1%
Markham 208,615 338,503 62.3%

The fastest-growing municipalities cluster around the Greater Toronto Area, where housing demand has driven expansion into formerly rural townships. Milton, Brampton, and Markham have transformed from small towns into substantial cities within a generation. The infrastructure challenges of this growth (roads, schools, healthcare, transit) consume enormous resources and dominate local politics. The towny data captures the before and after of this transformation but cannot convey the lived experience of watching farmland become subdivisions.

Not every municipality grew. Some communities in northern and eastern Ontario lost population as young people left for opportunities elsewhere. Factory closures, mine exhaustions, and the general drift of economic activity toward metropolitan areas hollowed out places that had thrived in earlier decades. The dataset does not distinguish between population loss from out-migration and loss from natural decrease (more deaths than births), but both dynamics contribute to the patterns visible in the numbers.

For table-making, towny provides opportunities for population density calculations, before-after comparisons, and growth rate analysis across its six census years. The land area column enables density visualization. The municipality names work naturally as row labels in grouped tables organized by population tier or geographic region.

C.6 peeps

The peeps dataset contains fictional personal information for 100 imaginary people: names, addresses, phone numbers, email addresses, and nationalities. These fake individuals were generated using an online tool that produces realistic-seeming demographic data, then verified for plausible formatting of addresses and contact information across different countries.

A Random Selection of Peeps
First Name Last Name Email Country
Krzysztof Kowalczyk krzysztof_k@example.com Poland
Gaweł Zając gawelzajac@example.com Poland
Eva Simpson eva_simpson@example.com Canada
Rolla Skov rollaskov@example.com Denmark
Oliver Mikkelsen oli_mikkelsen@example.com Denmark
Letizia Moretti l_moretti@example.com United Kingdom

The international scope was intentional. peeps was created specifically to demonstrate formatters like fmt_email(), fmt_country(), and fmt_flag(). Having people from various countries ensures that flag icons and country name formatting can be shown in realistic contexts. An address book or contact directory table should contain international entries, and peeps provides exactly that.

C.6.1 The problem of synthetic data

Generating realistic fake data is harder than it sounds. Names must fit cultural expectations (a person from Japan should have a Japanese name). Addresses must follow country-specific formats (postal codes before or after city names, province abbreviations versus full names). Phone numbers must have correct country codes and plausible internal structure. Email addresses must look like real email addresses while clearly being fictional.

The country distribution in peeps emphasizes variety over statistical representativeness. Having multiple people from smaller countries ensures that formatting edge cases get tested. A dataset with 90 Americans and 10 others would not exercise international formatting as thoroughly as one with broader distribution.

The email domains follow patterns typical of real email usage: major providers dominate, with country-specific services appearing for non-English-speaking regions. This realism helps ensure that fmt_email() handles the variety of domain lengths and TLD formats that appear in actual contact databases.

Every person in the dataset is entirely fictional. The addresses do not correspond to real residences. The phone numbers should not connect to anyone. But the formatting follows authentic patterns for each country represented. A French address looks like a French address. A Japanese name follows Japanese naming conventions. This verisimilitude matters because formatting functions must handle real-world variation, and peeps provides test cases for that variation without compromising anyone’s actual privacy.

C.7 sza (solar zenith angles)

The sza dataset originates from atmospheric chemistry research, specifically from data tables published in textbooks by Pitts and Finlayson-Pitts. It records solar zenith angles (the angle between the sun and the vertical) across different latitudes and months. The original data came from a US government source that may no longer be online, but the values remain scientifically accurate and useful.

Solar Zenith Angles by Latitude and Month
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Mid-latitude (30°)
0700 89.4 86.2 81.1 74.9 69.9 66.8 66.3 68.3 72.9 78.7 84.6 88.7
0730 83.7 80.3 74.9 68.5 63.4 60.4 60.0 61.9 66.4 72.4 78.4 82.8
0800 78.3 74.6 68.8 62.1 56.9 54.0 53.5 55.4 60.0 66.2 72.8 77.3
0830 73.2 69.1 62.9 55.8 50.4 47.5 47.1 48.9 53.6 60.1 67.1 72.2
0900 68.4 64.1 57.4 49.6 44.0 41.0 40.6 42.4 47.3 54.4 62.0 67.3
0930 64.1 59.4 52.2 43.8 37.6 34.5 34.1 36.0 41.2 48.8 57.0 63.0
1000 60.4 55.3 47.5 38.2 31.4 28.1 27.7 29.7 35.5 43.8 53.0 59.2
1030 57.3 51.9 43.5 33.3 25.6 21.7 21.2 23.6 30.2 39.5 49.3 56.0
1100 55.0 49.3 40.3 29.3 20.4 15.7 15.1 18.1 25.8 36.1 46.7 53.7
1130 53.5 47.7 38.4 26.5 16.5 10.5 9.6 13.7 22.8 33.9 44.9 52.2
1200 53.0 47.2 37.7 25.5 15.0 8.0 6.9 11.9 21.6 33.1 44.4 51.8
0630 87.5 81.4 76.3 73.1 72.5 74.7 79.4 85.1
0600 87.2 82.7 79.2 78.7 81.0 85.9
0530 88.9 85.3 84.7 87.2

Solar zenith angles matter because they determine how much solar radiation reaches Earth’s surface and at what angle. This affects everything from climate modeling to photovoltaic panel efficiency to the rate of photochemical reactions in the atmosphere. At high latitudes in winter, the sun barely rises above the horizon (large zenith angles). At the equator, the sun passes nearly overhead at noon year-round (small zenith angles). The interplay between latitude, season, and time of day creates the patterns visible in the sza data.

C.7.1 Why zenith angles matter

When the sun sits directly overhead, sunlight travels through the minimum possible amount of atmosphere before reaching the surface. As the sun moves toward the horizon, light must traverse increasingly long atmospheric paths. This path length, often expressed as “air mass” affects both the intensity and spectral composition of sunlight reaching the ground. Ultraviolet radiation attenuates more strongly through longer path lengths, which is why sunburns are most severe around solar noon in summer at lower latitudes.

Solar Zenith Angles Throughout the Day
By latitude and season.
Jan Apr Jul Oct
20ºN
09:00 62° 46° 42° 50°
12:00 43° 16° 23°
06:00 88° 82°
40ºN
09:00 76° 54° 41° 60°
12:00 63° 36° 17° 43°
06:00 87° 75°

The seasonal pattern emerges clearly at mid and high latitudes. January brings high zenith angles in the Northern Hemisphere as the sun traces its winter arc low in the southern sky. July reverses the pattern, with the sun climbing high overhead at noon. At the equator, seasonal variation is minimal: the sun is always nearly overhead at midday. These patterns have shaped human civilization, determining growing seasons, driving migration patterns, and inspiring astronomical observations throughout history.

Atmospheric chemists care about zenith angles because photochemical reactions require light. Photolysis rates vary with the intensity and spectrum of incoming solar radiation. A nitrogen dioxide molecule that might photolyze within seconds at tropical noon may persist for hours at polar twilight. Models simulating urban smog formation or stratospheric ozone depletion must account for these variations, typically by looking up appropriate photolysis rates from tables indexed by zenith angle and altitude.

The dataset reflects my background in atmospheric chemistry. Such data tables would be directly imported into atmospheric box models for simulating photolysis-based reactions of volatile organic compounds (VOCs). Having the data readily available in R format eliminates the tedious work of transcribing values from printed tables or scraping data from web pages.

For gt, the sza dataset demonstrates heatmap-style coloring (values naturally vary from low to high in meaningful patterns), missing value handling (the sun does not rise at certain latitude-time combinations), and the presentation of scientific lookup tables. The structure (rows indexed by time, columns by month, grouped by latitude) maps naturally onto the table designs that scientists create when presenting such reference data.

C.8 constants, reactions, photolysis, and nuclides

These four datasets form a cluster of scientific reference data, each addressing a gap in publicly available resources. While the underlying information exists in various forms (government pages, textbooks, specialized databases), it had not been consolidated into convenient data frames. The gt package provided an opportunity to change that.

The constants dataset contains 30 fundamental physical constants with their values, units, and uncertainties. These are the numbers that appear in physics and chemistry textbooks: the speed of light, Planck’s constant, Avogadro’s number, the gravitational constant. Each value comes with associated metadata specifying units and measurement precision.

Numbers That Define the Universe
Ten fundamental physical constants.
Value Uncertainty Units
Avogadro constant 6.02 × 1023
mol−1
Bohr radius 5.29 × 10−11 8.00 × 10−21 m
Boltzmann constant 1.38 × 10−23
J K−1
electron mass 9.11 × 10−31 2.80 × 10−40 kg
elementary charge 1.60 × 10−19
C
fine-structure constant 7.30 × 10−3 1.10 × 10−12
Newtonian constant of gravitation 6.67 × 10−11 1.50 × 10−15 m3 kg−1 s−2
Planck constant 6.63 × 10−34
J Hz−1
proton mass 1.67 × 10−27 5.10 × 10−37 kg
speed of light in vacuum 3.00 × 108
m s−1

C.8.1 The certainty of physical constants

Physical constants occupy a peculiar position in science. They are measured quantities, subject to experimental uncertainty, yet they describe fundamental features of the universe that we believe to be genuinely constant. The speed of light in vacuum, for instance, is now defined as exactly 299,792,458 meters per second, but this definition only became possible after decades of increasingly precise measurements. Before 1983, we measured the speed of light; now the meter is defined in terms of light’s speed. The historical progression of uncertainty shrinking toward zero tells a story of experimental ingenuity.

Physical Constants by Measurement Precision
Constants Example
Extraordinarily precise 51 alpha particle-electron mass ratio
Extremely precise 133 alpha particle mass
Moderately precise 30 deuteron rms charge radius
Very precise 59 Angstrom star

The measurement precision varies dramatically across constants. The fine-structure constant (approximately 1/137) has been measured to extraordinary precision through quantum electrodynamics experiments. The gravitational constant, despite being one of the first constants ever measured (by Cavendish in 1798), remains relatively imprecise because gravity is so weak that experiments must contend with tiny forces and correspondingly large relative uncertainties.

C.8.2 Atmospheric chemistry

The reactions dataset catalogs 1,344 atmospheric chemical reactions with their rate constants and temperature dependencies. The photolysis dataset provides photolysis rates for organic compounds, including spectral data stored in list columns. These datasets rarely exist in such accessible form. Researchers typically extract reaction rates from individual journal articles or specialized databases with restrictive access. Having a curated collection in R format simplifies atmospheric modeling exercises.

Selected Atmospheric Reactions with OH
Compound Formula OH Rate at 298K
beta-caryophyllene C15H24 2.00 × 10−10
3-hydroxy-2-butanone C4H8O2 9.70 × 10−12
O-methyl-N-ethylcarbamate C4H9NO2 1.05 × 10−11
4-methyl-1,3-dioxane C5H10O2 1.13 × 10−11
morpholine C4H9NO 1.10 × 10−10
n-propyl propanoate C6H12O2 4.00 × 10−12
anthracene C14H10 1.17 × 10−10
triethyl phosphate C6H15O4P 4.68 × 10−11

Understanding atmospheric chemistry requires knowing how fast different reactions proceed under various conditions. The hydroxyl radical (OH) drives much of daytime atmospheric chemistry, attacking volatile organic compounds and beginning their oxidation chains. Nitrate radicals take over at night. Photolysis reactions require sunlight, their rates varying with solar zenith angle and wavelength. Each rate constant in the dataset represents experimental determinations, often from smog chamber studies or theoretical calculations validated against field measurements.

C.8.3 Nuclear data

The nuclides dataset compiles nuclear data for isotopes: half-lives, decay modes, particle emissions. Like the chemical datasets, this information exists scattered across various sources but had not been unified into a single convenient data frame. Nuclear chemistry courses and research often require looking up isotope properties, and nuclides consolidates those lookups.

Radioactive Decay Modes
Nuclides
Alpha 473
Beta plus 4
Beta minus 1224
Electron capture 137

Radioactive decay follows several pathways depending on the nuclear configuration. Beta-minus decay converts a neutron to a proton, moving the nucleus one step higher in atomic number. Alpha decay ejects a helium nucleus, reducing both atomic number and mass. Electron capture pulls an inner-shell electron into the nucleus. Each decay mode appears in the dataset, along with the half-life governing how quickly unstable nuclei transform.

For gt demonstrations, these scientific datasets showcase formatters like fmt_scientific(), fmt_chem(), and fmt_units(). They also demonstrate from_column() usage, where the number of decimal places might come from an adjacent precision column rather than being hard-coded. Scientific communication demands careful attention to significant figures and uncertainty, and these datasets provide realistic contexts for that attention.

C.9 sp500

The sp500 dataset contains daily stock market data for the S&P 500 index from 1950 through 2015: opening price, closing price, high, low, and trading volume for each trading day. With over 16,000 rows, it provides a substantial corpus of financial time series data.

Black Monday and the Week That Shook Wall Street
S&P 500 daily prices, October 14–23, 1987.
Day Open Close Volume
1987-10-14 Wednesday $314.52 $305.23 207.40M
1987-10-15 Thursday $305.21 $298.08 263.20M
1987-10-16 Friday $298.08 $282.70 338.50M
The Weekend


1987-10-19 Monday $282.70 $224.84 604.30M
1987-10-20 Tuesday $225.06 $236.83 608.10M
1987-10-21 Wednesday $236.83 $258.38 449.60M
1987-10-22 Thursday $258.24 $248.25 392.20M
1987-10-23 Friday $248.29 $248.22 245.60M

Financial data demands specific formatting conventions: currency symbols, appropriate decimal precision, volume abbreviations. The sp500 dataset exercises these requirements across decades of market history. Bull markets, bear markets, crashes, and recoveries all appear in the data. The 1987 Black Monday crash, the dot-com bubble, the 2008 financial crisis: each left its signature in closing prices and trading volumes.

C.9.1 Reading the market’s history

The S&P 500 tracks 500 large-cap American companies, weighted by market capitalization. It serves as the benchmark against which most US equity investments are measured. A fund that “beats the market” outperforms the S&P 500. An investor who wants “market returns” buys an index fund tracking the S&P 500. The index represents the collective judgment of millions of market participants about the value of corporate America.

S&P 500 Annual Summary
2006-2015
Open Close Annual
Return
Year High Year Low
2006 $1,248 $1,418 +13.6% $1,432 $1,219
2007 $1,418 $1,468 +3.5% $1,576 $1,364
2008 $1,468 $903 −38.5% $1,472 $741
2009 $903 $1,115 +23.5% $1,130 $667
2010 $1,117 $1,258 +12.6% $1,263 $1,011
2011 $1,258 $1,258 0.0% $1,371 $1,075
2012 $1,259 $1,426 +13.3% $1,475 $1,259
2013 $1,426 $1,848 +29.6% $1,849 $1,426
2014 $1,846 $2,059 +11.5% $2,094 $1,738
2015 $2,059 $2,044 −0.7% $2,135 $1,867

The years 2006 through 2015 illustrate the market’s capacity for dramatic swings. 2008 stands out with its catastrophic decline during the financial crisis, when the index lost more than a third of its value. The subsequent years show gradual recovery, with the index reaching new highs by the early 2010s. Anyone who sold during the panic of 2008 locked in losses. Anyone who held through the crisis recovered and then some. The data tells both stories depending on which slices you examine.

The dataset originated from a web search that turned up historical market data, possibly compiled from Kaggle or similar sources. The exact provenance matters less than the utility: a long time series of financial data in a format ready for analysis and visualization. For teaching purposes, the sp500 dataset provides realistic data for demonstrating time series analysis, returns calculations, volatility measurement, and the kind of financial tables that appear in annual reports and investment presentations.

C.10 metro

The Paris Métro is one of the world’s great urban transit systems. Opened in 1900, it has grown to 16 lines serving 308 stations across the city and surrounding communes. The metro dataset captures this network: station names, locations, line assignments, opening dates, and ridership figures.

Busiest Paris Métro Stations
Lines Annual Passengers
Gare du Nord 4, 5 34.50M
Saint-Lazare 3, 12, 13, 14 33.13M
Gare de Lyon 1, 14 28.64M
Montparnasse—Bienvenüe 4, 6, 12, 13 20.41M
Gare de l'Est 4, 5, 7 15.54M
Bibliothèque François Mitterrand 14 11.10M
République 3, 5, 8, 9, 11 11.08M
Les Halles 4 10.62M
La Défense 1 9.26M
Châtelet 1, 4, 7, 11, 14 8.35M

The dataset exists because I really admire the Paris Métro. Among the world’s subway systems, it stands out for its density, connectivity, and integration with other transit modes (RER commuter rail, buses, trams, and high-speed TGV connections). The wayfinding and signage are exemplary. The expansion plans are ambitious and consistently executed. It represents what urban transit can be when treated as essential infrastructure rather than an afterthought.

C.10.1 A brief history of the Métro

The story of the Paris Métro begins in the late nineteenth century, when Paris faced the same urban transportation crisis that afflicted every growing industrial city. Horse-drawn omnibuses clogged the boulevards. The wealthy rode in private carriages while workers walked miles to reach their jobs. London had opened its Underground in 1863, demonstrating that subterranean railways could move masses of people efficiently. Paris, perennially competitive with its cross-Channel rival, needed its own solution.

The first line opened on July 19, 1900, timed to coincide with the Exposition Universelle that drew millions of visitors to Paris that summer. Line 1 ran from Porte de Vincennes to Porte Maillot, connecting the eastern and western edges of the city through its commercial heart. The stations featured distinctive Art Nouveau entrances designed by Hector Guimard, with their sinuous cast-iron curves and amber glass panels that remain iconic more than a century later. Not all survived (many were removed during mid-century modernization campaigns and later regretted), but those that remain are protected monuments.

The network expanded rapidly in its early decades. By 1910, Paris had six lines. By 1920, ten. This breakneck pace was not entirely the product of unified planning. The Compagnie du chemin de fer métropolitain de Paris (CMP) held the primary concession, but it faced competition from the Nord-Sud Company, which built what would become Lines 12 and 13. The two companies raced to serve lucrative routes, and their rivalry accelerated construction beyond what a single monopoly might have achieved. The Nord-Sud stations were arguably more elegant, with ceramic tile work and distinctive lettering that enthusiasts still admire. When the companies merged in 1930, Paris inherited a network that had been built fast precisely because multiple actors were competing to build it.

This frenetic early growth distinguished Paris from its European peers. London’s Underground, though older, expanded more cautiously under a patchwork of private companies that often duplicated routes rather than extending coverage. The Berlin U-Bahn, which opened in 1902, grew steadily but faced the complication of serving multiple municipalities that would not unify until 1920. Paris benefited from centralized city planning within the relatively compact boundaries of the twenty arrondissements, allowing the CMP and Nord-Sud to build a coherent network even while competing. By 1930, Paris had more stations than London despite London’s forty-year head start.

The guiding philosophy was density: stations placed close together (often just 500 meters apart) so that no Parisian would have to walk more than a few minutes to reach the Métro. This density distinguishes Paris from systems like Washington DC or the Bay Area’s BART, where stations are spaced miles apart and require feeder buses or long walks. The Paris approach sacrifices speed between stations for convenience of access, a tradeoff that makes sense for a compact, dense city.

125 Years of Métro Expansion
How the network grew decade by decade.
Stations Opened Notable Events
1900s 65 Line 1 opens for World's Fair
1910s 85 Rapid expansion across Paris
1920s 60 Network reaches most arrondissements
1930s 25 Great Depression slows construction
1940s 5 World War II occupation
1950s 15 Post-war reconstruction begins
1960s 12 RER regional express network planned
1970s 8 RER lines A and B open
1980s 18 Line 14 planning begins
1990s 10 Line 14 construction starts
2000s 8 Line 14 opens (first automated line)
2010s 14 Line extensions to suburbs
2020s 10 Grand Paris Express under construction

The interwar period saw continued expansion but also financial difficulties. The 1930s depression slowed construction, and the network that had seemed destined for endless growth began to stabilize. World War II brought occupation and disruption. The Métro continued to operate (the Germans found it useful for moving troops and supplies), but expansion halted and maintenance suffered. Several stations were closed and converted to other uses, some serving as air raid shelters.

Post-war reconstruction proceeded slowly. The immediate decades after 1945 focused on repairing damage and updating aging infrastructure rather than building new lines. The real transformation came in the 1960s and 1970s with the creation of the RER (Réseau Express Régional), a network of express lines that tunneled through central Paris but extended far into the suburbs. The RER was not technically part of the Métro but integrated seamlessly with it, allowing commuters to transfer between the dense inner-city network and the faster regional lines.

C.10.2 The modern network

Today’s Métro comprises 16 lines totaling over 220 kilometers of track. The numbering seems haphazard (there are lines 1 through 14, plus 3bis and 7bis), reflecting historical accidents rather than logical planning. Lines 3bis and 7bis were originally branches of their parent lines that later gained operational independence. The system carries approximately 4 million passengers daily, making it one of the world’s busiest rapid transit networks.

Paris Métro Lines
Current network statistics.
Line Length (km) Stations Automated
1 16.6 25
2 12.4 25
3 11.7 25
3bis 1.3 4
4 12.1 27
5 14.6 22
6 13.6 28
7 22.4 38
7bis 3.1 8
8 23.4 38
9 19.6 37
10 11.7 23
11 6.3 13
12 13.9 29
13 24.3 32
14 14.0 13

Line 14 deserves special attention as the system’s showcase. Opened in 1998, it was the first fully automated line on the network, operating without drivers. Platform screen doors prevent accidents (a significant concern on older lines) and allow trains to run with shorter headways. The stations feel modern and spacious compared to the cramped nineteenth-century tunnels of earlier lines. Line 14 demonstrated that new construction was possible and could achieve standards superior to the historical network. It has since been extended multiple times and serves as the template for future expansion.

The Grand Paris Express, currently under construction, represents the most ambitious expansion since the network’s founding. This project will add four new automated lines (15, 16, 17, and 18) encircling the existing network and connecting suburban centers that currently require traveling through central Paris to reach one another. When complete, probably sometime in the 2030s, the Grand Paris Express will nearly double the length of the automated network and fundamentally reshape mobility patterns in the Île-de-France region.

C.10.3 What makes the Métro work

Several design principles distinguish the Paris Métro from less successful transit systems. First, the high density of stations means that walking to the Métro is almost always faster than driving to a parking lot. This convenience generates ridership that justifies the investment. Second, the integration with other modes is seamless. The same ticket works on Métro, RER, buses, and trams within Paris. Transfer stations connect lines at useful angles rather than requiring passengers to exit one system and enter another. Third, the frequency of service makes timetables irrelevant. During peak hours, trains arrive every two minutes on busy lines. Even late at night, waits rarely exceed ten minutes. Passengers simply show up and go.

The signage and wayfinding deserve particular praise. Station names appear in a consistent typeface (Parisine, designed specifically for the Métro in 1996) on tiled walls visible from passing trains. Corridor signs point toward exits, transfers, and surface landmarks with clarity that serves tourists and commuters alike. The colored line numbers and terminus names provide all the information needed to navigate without consulting maps. Many transit systems aspire to this legibility but few achieve it so thoroughly.

Ridership by Line Assignment
Stations grouped by their line connections.
Line(s) Station Count Total Ridership Avg per Station
7 28 63.28M 2.26M
9 23 60.64M 2.64M
13 23 57.83M 2.51M
1 14 55.93M 3.99M
4 16 53.66M 3.35M
8 26 50.48M 1.94M
12 21 39.59M 1.89M
3 17 38.72M 2.28M
2 15 35.11M 2.34M
4, 5 1 34.50M 34.50M

The Métro also benefits from Paris’s urban form. The city is dense and compact, with most destinations within walking distance of a station. Zoning never separated residential from commercial uses as strictly as in American cities, so people live near where they work and shop. The Métro did not create this urban form (it predates the Métro by centuries), but the two reinforce one another. Dense cities need mass transit, and mass transit makes density livable.

For the metro dataset, this context matters. The station names are not arbitrary labels but markers of neighborhoods with distinct characters. The ridership figures reflect how Parisians actually move through their city. The line assignments show which routes carry the heaviest loads and which serve more specialized purposes. Understanding the Métro as a living system, constantly adapting over 125 years of operation, makes the dataset more meaningful than raw numbers alone could convey.

C.10.4 The future of Paris transit

The Grand Paris Express will transform the region, but it is only part of a broader vision. Line 14 continues to extend northward and southward. Line 11 is being extended to connect new suburbs. Tram lines are expanding along the outer boulevards, and bus networks are being reorganized to feed into the rail system more efficiently. The goal is a regional transit network that allows travel between any two points without necessarily passing through central Paris.

The dataset updates periodically to reflect these changes. New stations appear as they open. Ridership figures are updated with each annual release. The metro data is not a static snapshot but an evolving portrait of a transit system that continues to grow and adapt. Future versions will include the Grand Paris Express stations, extending coverage far beyond the historical city limits.

For gt demonstrations, the dataset provides geographic data with French language station names, offering opportunities to demonstrate locale handling and the presentation of transit network information. The ridership figures support ranking tables. The line assignments (stored as comma-separated values) demonstrate handling of multi-valued fields. The opening dates span over a century, creating interesting timelines. But beyond these technical uses, the dataset offers a window into one of humanity’s great collective achievements: a transit system that moves millions of people daily, efficiently and reliably, through one of the world’s most beautiful cities.

C.11 gibraltar

The gibraltar dataset contains hourly weather observations from Gibraltar during May 2023: temperature, humidity, wind speed, cloud cover, and other meteorological variables. It provides 744 rows representing each hour of a single month in this small but fascinating territory.

Gibraltar Morning Weather
May 1, 2023: fog clearing to fair skies.
Time Temp (°C) Humidity Wind (km/h) Direction Condition
06:50 17.2 72% 0.4 W Fair
07:50 17.8 88% 0.9 NE Patches of Fog
08:50 17.2 82% 0.9 W Patches of Fog
09:20 17.8 77% 2.7 WSW Patches of Fog
09:50 17.8 77% 2.2 WSW Fair
10:20 18.9 73% 2.7 SW Fair
10:50 21.1 64% 1.3 WSW Fair
11:20 21.1 68% 2.7 ESE Fair
11:50 22.2 60% 2.2 SE Fair
12:20 22.2 60% 2.2 E Fair
12:50 22.2 60% 2.2 E Fair
13:20 22.2 64% 2.7 E Fair
13:50 22.2 64% 2.7 E Fair

Gibraltar sits at the southern tip of the Iberian Peninsula, a British Overseas Territory of barely seven square kilometers guarding the entrance to the Mediterranean Sea. It is the kind of place that captures the imagination precisely because it seems improbable: a limestone promontory with its own airport runway crossing the main road, Barbary macaques roaming the upper rock, and a rather complex history.

C.11.1 Understanding the Rock

The Rock of Gibraltar rises 426 meters above sea level, a dramatic limestone formation that has served as a strategic landmark for millennia. The ancient Greeks called it one of the Pillars of Hercules, marking the edge of the known world. Every Mediterranean power has recognized its importance: control Gibraltar and you control access between the Atlantic and the Mediterranean. The British acquired it in 1704 during the War of Spanish Succession and have held it ever since, despite periodic Spanish objections and one famous siege that lasted nearly four years.

Gibraltar Weather by Time of Day
May 2023
Temperature (°C)
Humidity Wind
Average Maximum Minimum
Morning 19.0 23.9 13.9 1% 3.5
Afternoon 21.6 30.0 15.0 1% 4.7
Evening 21.3 28.9 15.0 1% 4.4
Night 18.9 27.2 13.9 1% 4.0

The May weather data captures Gibraltar in spring, before the intense heat of Mediterranean summer arrives. Temperatures climb through the afternoon hours and descend through the evening, following the familiar diurnal pattern. Humidity inversely tracks temperature, rising as the air cools. Wind direction matters at Gibraltar: the Levante wind blows from the east through the strait, often bringing fog as Mediterranean moisture condenses against the Rock. The Poniente arrives from the west, drier and clearer. These wind patterns shaped navigation through the strait for centuries of sailing ships.

Wind Direction Frequency
May 2023
Hours
E 271
W 216
ENE 171
WSW 165
NE 118
SSW 115
ESE 111
SW 102
S 52
NNE 33
SE 33
WNW 15
NNW 10
NW 8
N 6
SSE 4
CALM 1

The predominance of certain wind directions reflects the geography of the strait. Air flows through the narrow gap between Europe and Africa, channeled by the mountains on either side. Local topography further complicates matters: the Rock itself creates wind shadows and acceleration zones. Pilots landing at Gibraltar Airport must contend with these effects, making it one of the more challenging airports in Europe. The runway crosses Winston Churchill Avenue, requiring traffic to stop when aircraft land or take off.

May was chosen simply to provide pre-summer weather data. Gibraltar’s Mediterranean climate means mild, pleasant conditions that month, with temperatures climbing toward but not yet reaching peak summer heat. The specific year (2023) holds no particular significance beyond being recent enough for the data to feel current. The data comes from weather APIs providing historical observations, typical of the sources that make meteorological data increasingly accessible for analysis and visualization.

For gt, the dataset demonstrates time series formatting, weather data presentation, and the handling of multiple related numeric columns. Temperature formatting, wind direction encoding, and the diurnal patterns visible in hourly data all provide teaching opportunities.

C.12 films

The films dataset is a labor of love: a comprehensive record of every film that has competed for the Palme d’Or at the Cannes Film Festival. It contains 1,607 entries spanning the festival’s history, with each row recording a film’s title (in both English and original language), director, year, country of origin, spoken languages, and IMDB link.

Cannes Film Festival 2019
Official Competition
Film Director Country
A Hidden Life Terrence Malick United Kingdom, Germany, United States
Atlantics Mati Diop France, Senegal, Belgium
Bacurau Juliano Dornelles, Kleber Mendonça Filho Brazil, France
Pain and Glory Pedro Almodóvar Spain, France
Frankie Ira Sachs France, Portugal
Parasite Bong Joon Ho South Korea
The Traitor Marco Bellocchio Italy, France, Germany, Brazil
It Must Be Heaven Elia Suleiman France, Qatar, Germany, Canada, Turkey, Palestine
The Whistlers Corneliu Porumboiu Romania, France, Germany, Switzerland, Sweden
Young Ahmed Jean-Pierre Dardenne, Luc Dardenne Belgium, France
Les Misérables Ladj Ly France
Little Joe Jessica Hausner Austria, United Kingdom, Germany, France
Matthias & Maxime Xavier Dolan Canada
Mektoub, My Love: Intermezzo Abdellatif Kechiche France
The Wild Goose Lake Yi'nan Diao China, France
Once Upon a Time in... Hollywood Quentin Tarantino United States, United Kingdom, China
Portrait of a Lady on Fire Céline Sciamma France
Oh Mercy! Arnaud Desplechin France
Sibyl Justine Triet France, Belgium
Sorry We Missed You Ken Loach United Kingdom, France, Belgium
The Dead Don't Die Jim Jarmusch United States

The dataset exists because I really like watching movies. My letterboxd account (letterboxd.com/rich_i/) tracks my viewing history and provides an ongoing record of films watched and opinions formed (manifesting in star ratings). Film festivals provide endless opportunities for discovery, surfacing works that might never reach mainstream distribution. The Cannes Film Festival, as the most prestigious venue for international cinema, seemed like essential data that should exist in an accessible format. But no such dataset was publicly available. The only logical solution was to create one.

C.12.1 Building the Cannes dataset

Construction required extensive research spanning months of work. The festival’s official website provided the foundation, listing competition entries by year. But the website alone was insufficient. Many older entries appeared only with French titles, requiring investigation to find corresponding English names (or vice versa for English-language films shown under French titles). Some films had been released under multiple names in different markets, demanding careful verification of which title was authoritative.

Cannes Competition Entries by Year
Sample of years from 1970 onward.
In-Competition Films
1970 25
1971 26
1972 25
1973 24
1974 26
1975 22
1976 20
1977 23
1978 23
1979 21
1980 23
1981 22
1982 22
1983 22
1984 19
1985 20
1986 20
1987 20
1988 21
1989 22

IMDB links were tracked down for each entry, providing viewers easy access to cast lists, synopses, and user ratings. This was straightforward for recent films but required detective work for older or more obscure entries. Some films from the 1950s and 1960s had minimal online presence, with IMDB pages containing little additional information. But the links exist for completeness, allowing interested viewers to explore further.

Spoken languages and countries of origin required the most careful coding. International co-productions muddy the concept of a film’s “country”. Is a film shot in France, funded by German and Italian producers, directed by a Polish filmmaker, and starring British actors a French film? The dataset records all countries involved in production, accepting that many films defy simple national categorization. Languages posed similar challenges: a film might be primarily in French with scenes in Arabic and English, and all three languages deserve acknowledgment. Where multiple languages appear, I tried to arrange them roughly by the quantity of words spoken in each, so the first language listed is generally the one that dominates the dialogue.

C.12.2 The festival and its significance

The Cannes Film Festival has operated since 1946 (with a brief predecessor event in 1939 interrupted by war). It functions simultaneously as a trade show for film distribution, a competition for artistic achievement, and a showcase for celebrity culture. The Palme d’Or, awarded to the best film in competition, carries considerable prestige. Winners enter the canon of international cinema, their directors’ careers transformed by the recognition.

Countries by Cannes Competition Entries
Single-country productions across festival history.
Films in Competition
United States 201
France 150
United Kingdom 81
Italy 73
Japan 56
USSR 44
Spain 39
Germany 36
Hungary 32
Sweden 27
Mexico 25
Poland 24

The table above reveals which national cinemas have received the most recognition at Cannes. France dominates, unsurprisingly given that Cannes is a French festival on the French Riviera. The United States and Italy follow, both countries with robust film industries and strong traditions of auteur filmmaking. Japan’s presence reflects the festival’s long appreciation for directors like Kurosawa, Ozu, and more recently Kore-eda and Hamaguchi. The geographic diversity of competition entries has increased over time, with films from Korea, Iran, Thailand, and other countries appearing regularly in recent decades.

The festival also reflects changing tastes and priorities in world cinema. In its early decades, Cannes emphasized European art cinema and established masters. The 1970s brought more adventurous programming, with controversial entries and recognition for directors working outside commercial constraints. Recent years have seen increased attention to women directors (historically underrepresented) and to cinemas from regions previously marginalized in international distribution.

Most Frequent Cannes Competitors
Directors with 5+ competition entries.
Competition Entries Years Active
Ken Loach 15 1981-2023
Jean-Pierre Dardenne, Luc Dardenne 10 1999-2025
Wim Wenders 10 1976-2023
Carlos Saura 9 1960-1988
Lars von Trier 9 1984-2011
Nanni Moretti 9 1978-2023
Ettore Scola 8 1970-1989
Jim Jarmusch 8 1986-2019
Marco Bellocchio 8 1980-2023
Marco Ferreri 8 1963-1991

The directors who return repeatedly to Cannes competition form a roster of international cinema’s most celebrated figures. Their repeated presence reflects both the festival’s loyalty to directors it has championed and these filmmakers’ continued production of work deemed worthy of the world’s most competitive showcase. For many, a Cannes premiere represents the peak of artistic recognition, the moment when a new work enters the conversation of global cinema.

C.12.3 Film as data

The films dataset demonstrates that even cultural artifacts can be structured for analysis. Each film becomes a row with attributes: title, year, director, country, language. These attributes support queries that would be tedious to answer through casual browsing. Which directors have competed most often? How has the linguistic diversity of competition entries changed over time? What proportion of recent competitors are first-time Cannes directors versus returning favorites?

For gt specifically, films demonstrates fmt_flag() and fmt_country() in realistic contexts. The country codes translate directly to flag icons, creating visual tables that communicate nationality at a glance. The categorical structure (years, directors, countries) provides natural grouping opportunities. The IMDB URLs demonstrate link formatting for external references. It is a dataset that makes beautiful tables almost by accident, because film data is inherently interesting to display.

The dataset updates annually as each new festival adds to the historical record. Every May, the Cannes competition announces its official selection, and those entries will appear in future versions of films. The ongoing maintenance reflects both practical utility (keeping examples current) and personal interest (following each year’s festival with the attention of a devoted fan).

C.12.4 My Letterboxd

The films dataset exists because I really enjoy movies, and that love extends well beyond festival competition entries. Below is a searchable, sortable table of every film I’ve watched (and mostly rated) on Letterboxd. The data was assembled using scripts in this book’s repository (scripts/scrape-letterboxd.R), which merge the Letterboxd data export files and fetch director information from individual film pages.

Rich’s Letterboxd
All 966 watched films.

One detail worth noting is that the star ratings used fmt() for formatting rather than a pre-formatted text column. This matters for interactive tables because fmt() changes only the display while preserving the underlying numeric values. When a user clicks the Rating column header to sort, the table sorts on the original numbers (5, 4.5, 4, …) rather than on rendered text like “★★★★½”, which would sort alphabetically and produce nonsensical results. It is a small trick but an important one whenever you need sortable columns with custom formatting in opt_interactive() tables!

C.13 illness

The illness dataset takes a different approach than the others. Rather than modeling behavior or compiling reference data, it reproduces a single table from a published scientific article. The source is “A fatal yellow fever virus infection in China: description and lessons” from Emerging Microbes & Infections (July 2016), which documented laboratory test results for a patient who contracted yellow fever during travel to Angola.

Units Day 3 Day 7 Day 9
Viral load copies per mL 12000.00 760.00 250.00
WBC ×109/L 5.26 24.77 19.03
Neutrophils ×109/L 4.87 22.08 16.59
RBC ×1012/L 5.72 4.12 3.32
Hb g/L 153.00 75.00 95.00
PLT ×109/L 67.00 74.10 25.60
ALT U/L 12835.00 1623.70 512.40
AST U/L 23672.00 2189.00 782.50
TBIL µmol/L 117.20 127.30 163.20
DBIL µmol/L 71.40 117.80 126.30

The article is freely available under a Creative Commons license, making reproduction appropriate. The dataset was created specifically to test gt’s fmt_units() function and its ability to render scientific unit notation correctly. Medical laboratory results frequently include units like mL, μL, g/dL, U/L, and ×10³/μL that require careful formatting (the last of those being particularly tedious to typeset correctly without dedicated tooling). The question was whether gt could faithfully reproduce the original Table 1 from the article.

C.13.1 Reading laboratory values

Medical laboratory tests generate data that require specialized interpretation. Each test has reference ranges defining normal values, and deviations above or below those ranges signal pathology. A white blood cell count of 3.0 × 10⁹/L might indicate leukopenia (low white cells), potentially signifying infection, bone marrow problems, or medication side effects. Liver enzymes elevated beyond normal ranges suggest hepatic damage. Reading the illness dataset means tracking multiple indicators as they evolve day by day through a fatal disease progression.

Laboratory Test Reference Ranges
Units
Normal Range
Low High
WBC x10^9 / L 4.0 10.0
Neutrophils x10^9 / L 2.0 8.0
RBC x10^12 / L 4.0 5.5
Hb g / L 120.0 160.0
PLT x10^9 / L 100.0 300.0
ALT U/L 9.0 50.0
AST U/L 15.0 40.0
TBIL umol/L 0.0 18.8
DBIL umol/L 0.0 6.8
NH3 mmol/L 10.0 47.0
PT s 9.4 12.5
APTT s 25.1 36.5

The normal ranges provide context for interpreting measurements. When day 9 values fall far outside these ranges, the severity becomes apparent. Bilirubin rising dramatically indicates liver failure. Creatinine elevation signals kidney involvement. The cascade of organ dysfunction visible in sequential laboratory values explains why this case study merited publication and why it serves as a teaching resource.

The dataset thus serves as a benchmark: if you can recreate a published scientific table using gt, the package’s formatting capabilities are proven sufficient for real-world use. The illness data provides that proof of concept while also documenting a tragic case that contributed to medical understanding of yellow fever progression.

C.14 rx_adsl and rx_addv

These two datasets represent gt’s connection to the pharmaceutical industry, where clinical trial tables must meet rigorous standards for regulatory submission. The datasets follow CDISC (Clinical Data Interchange Standards Consortium) conventions, specifically the ADaM (Analysis Data Model) structure used throughout the pharmaceutical industry.

rx_adsl contains subject-level data (ADSL format) for 182 participants in a fictional clinical trial. rx_addv provides protocol deviation records (ADDV format) with 910 entries documenting when and how trial participants deviated from study protocols. Both datasets use standard variable names and coding conventions that pharmaceutical statisticians will immediately recognize.

Subject ID Age Sex Ethnicity Treatment
GT1000 37 Male Hispanic or Latino NA
GT1001 41 Male Not Hispanic or Latino Placebo
GT1002 39 Female Not Hispanic or Latino Placebo
GT1003 38 Male Not Hispanic or Latino Placebo
GT1004 45 Male Not Hispanic or Latino Placebo
GT1005 35 Female Hispanic or Latino Placebo
GT1006 42 Female Not Hispanic or Latino Placebo
GT1007 35 Male Not Hispanic or Latino Placebo

C.14.1 The language of clinical trials

Pharmaceutical data follows conventions that seem arcane to outsiders but enable precise communication among specialists. USUBJID uniquely identifies a subject across all studies from a sponsor. TRTA indicates the actual treatment received (as opposed to the treatment assigned). SAFFL flags subjects in the safety population. This vocabulary, defined by CDISC standards, appears in regulatory submissions worldwide. A statistician in Switzerland reviewing a submission from Japan knows exactly what TRTA means because the standards are universal.

Treatment Arm Demographics
Subjects Mean Age Female Ethnicities
Placebo 90 41.2 0% 3
Drug 1 90 39.2 0% 3
NA 2 38.5 0% 1

The treatment arms in clinical trials typically include the experimental treatment at one or more doses, a placebo or active comparator, and sometimes multiple dosing regimens. Demographic balance across arms helps ensure that observed differences reflect treatment effects rather than baseline differences. Age, sex, ethnicity, disease severity at baseline, and prior treatments all require documentation and comparison.

Protocol Deviation Categories
Deviations
187
Major 104

Protocol deviations document when trial participants did not follow the study plan. Some deviations are minor (a visit occurring outside the allowed window). Others are major (taking prohibited medications, missing doses). The rx_addv dataset catalogs these deviations, enabling sensitivity analyses that exclude subjects with major violations. Regulators scrutinize deviation patterns for evidence that the trial was conducted properly and that deviations do not undermine the conclusions.

These datasets were contributed by Alexandra Lauer as part of ongoing collaboration between gt developers and pharmaceutical industry users. The package website includes a dedicated case study article demonstrating how to create clinical tables that meet industry standards. For pharmaceutical statisticians evaluating gt for regulatory work, these datasets provide immediately relevant examples.

The inclusion of pharmaceutical data reflects gt’s ambition to serve professional communities with specialized requirements. Clinical trials generate enormous quantities of tabular output, much of it following strict formatting conventions. Having sample datasets in the standard format lowers the barrier for pharmaceutical users to adopt gt and verify that it meets their needs.

C.15 The value of curated datasets

Looking across all eighteen datasets, certain patterns emerge. Many fill gaps where public data existed but not in convenient form. The scientific datasets consolidate information scattered across journals and government pages. The films dataset creates a resource that simply did not exist before. Even the simulated pizzaplace data serves a purpose: realistic transactional data is rarely available publicly due to business confidentiality.

Other datasets reflect personal curiosity. Ontario towns. The Paris Métro. Gibraltar’s weather. Cannes films. These choices say something about my interests and the particular corners of the world that captured my attention. The datasets are better for this personal investment. Someone who cares about the Paris Métro will notice details that a disinterested compiler would miss.

For users of gt, the datasets provide reliable materials for learning and experimentation. The variety ensures that nearly any table type (financial, scientific, demographic, geographic, categorical) has relevant sample data available. The careful construction means edge cases are present: missing values, unusual formatting requirements, multi-valued fields. The documentation grounds each dataset in context that makes the data more meaningful to work with.

Datasets are infrastructure. Good ones get used for years, appearing in examples, tutorials, homework assignments, and documentation. The eighteen datasets in gt aspire to that longevity. They are not throwaway data generated to fill a requirement but carefully assembled resources intended to remain useful across many versions and use cases. The stories behind them, now recorded here, add another layer of value: not just what the data contains but why it exists and where it came from.