Appendix C — The gt datasets

Every dataset tells a story. The eighteen datasets bundled with gt were not assembled arbitrarily or pulled from dusty archives of convenient CSV files. Each one emerged from a specific need, a personal curiosity, or a gap in what was publicly available. Some model human behavior with surprising fidelity. Others preserve scientific data that would otherwise remain scattered across obscure government pages. A few exist simply because I thought “I wish this dataset existed” and then made it so.

This appendix takes you behind the scenes of these datasets. You will learn where they came from, how they were constructed, and why they matter beyond their immediate utility as demonstration data. Along the way, we will explore the broader contexts they represent: the economics of a neighborhood pizza shop, the 125-year evolution of urban transit in Paris, the delicate chemistry of Earth’s atmosphere, and the quiet dramas of population change in Ontario’s small towns. The datasets are more than rows and columns. They are windows into worlds worth understanding.

C.1 `pizzaplace`

The pizzaplace dataset contains 49,574 rows representing every pizza sold at a fictional pizzeria during the year 2015. Each row records a transaction with its timestamp, pizza type, size, and price. On the surface it appears to be straightforward sales data. Underneath, it is an elaborate simulation of human behavior, kitchen operations, and the unpredictable rhythms of a small food business.

The inspiration for this dataset came from Plateau Pizza, a real establishment in Coquitlam, British Columbia. The restaurant occupies a pleasant spot in a suburban plaza alongside a dollar store called Dollars & Cents and an IGA grocery store. It is the kind of neighborhood pizza place that survives on regulars and convenience, offering pizzas with names that are equal parts cheesy and memorable. Goat Supreme. The Calabrese. Names that stick in your head even if you cannot quite remember what toppings they included.

The dataset borrowed liberally from Plateau Pizza’s menu. Their category structure (Classic, Supreme, Veggie and Vegan, Chicken) was adopted directly. Many pizza names and ingredient combinations came straight from their website, though some were embellished and others invented entirely. The fictional additions drew inspiration from Food Network recipes and local salumist offerings that seemed appropriately gourmet. Pricing followed a similar pattern: real prices served as the baseline, then adjustments were made based on perceived ingredient costs. The fancier cheeses and cured meats commanded premium prices, as they should.

What makes pizzaplace interesting is not the menu but the simulation that generated a year’s worth of orders. The modeling script (preserved in the gt package’s source repository at data-raw/05-pizzaplace.R) creates synthetic customers who arrive throughout each day with realistic patterns. Barret Schloerke contributed refinements to the behavioral model, adding nuance to timing and preference distributions. Weekends see more traffic than weekdays, reflecting the work-leisure split that governs so much of consumer behavior. Holidays disrupt normal patterns in expected ways.

The simulation also includes what might be called narrative events. A kitchen fire at one point disrupts operations. These partially catastrophic incidents add verisimilitude to what would otherwise be suspiciously smooth data. Real businesses experience interruptions, equipment failures, staff shortages, and the occasional minor disaster. The pizzaplace data reflects this messiness, making it more valuable for realistic analysis exercises than perfectly clean synthetic data would be.

The dataset originally included plans for appetizers, side dishes, and beverages, but these were ultimately cut in favor of simplicity. A pizzeria that only sells pizza is easier to understand and analyze than one with a full menu. The constraint also keeps the focus on what makes the dataset distinctive: the careful modeling of how people order pizza throughout a year.

C.1.1 The pizzas of `pizzaplace`

Pizza	Available Sizes	Price Range
The Complete pizzaplace Menu
All 32 pizza varieties by category.
chicken
The Barbecue Chicken Pizza bbq_ckn	S, M, L	$12.75–$20.75
The California Chicken Pizza cali_ckn	S, M, L	$12.75–$20.75
The Chicken Alfredo Pizza ckn_alfredo	S, M, L	$12.75–$20.75
The Chicken Pesto Pizza ckn_pesto	S, M, L	$12.75–$20.75
The Southwest Chicken Pizza southw_ckn	S, M, L	$12.75–$20.75
The Thai Chicken Pizza thai_ckn	S, M, L	$12.75–$20.75
classic
The Big Meat Pizza big_meat	S	$12.00
The Classic Deluxe Pizza classic_dlx	S, M, L	$12.00–$20.50
The Greek Pizza the_greek	S, M, L, XL, XXL	$12.00–$35.95
The Hawaiian Pizza hawaiian	S, M, L	$10.50–$16.50
The Italian Capocollo Pizza ital_cpcllo	S, M, L	$12.00–$20.50
The Napolitana Pizza napolitana	S, M, L	$12.00–$20.50
The Pepperoni Pizza pepperoni	S, M, L	$9.75–$15.25
The Pepperoni, Mushroom, and Peppers Pizza pep_msh_pep	S, M, L	$11.00–$17.50
supreme
The Brie Carre Pizza brie_carre	S	$23.65
The Calabrese Pizza calabrese	S, M, L	$12.25–$20.25
The Italian Supreme Pizza ital_supr	S, M, L	$12.50–$20.75
The Pepper Salami Pizza peppr_salami	S, M, L	$12.50–$20.75
The Prosciutto and Arugula Pizza prsc_argla	S, M, L	$12.50–$20.75
The Sicilian Pizza sicilian	S, M, L	$12.25–$20.25
The Soppressata Pizza soppressata	S, M, L	$12.50–$20.75
The Spicy Italian Pizza spicy_ital	S, M, L	$12.50–$20.75
The Spinach Supreme Pizza spinach_supr	S, M, L	$12.50–$20.75
veggie
The Five Cheese Pizza five_cheese	L	$18.50
The Four Cheese Pizza four_cheese	M, L	$14.75–$17.95
The Green Garden Pizza green_garden	S, M, L	$12.00–$20.25
The Italian Vegetables Pizza ital_veggie	S, M, L	$12.75–$21.00
The Mediterranean Pizza mediterraneo	S, M, L	$12.00–$20.25
The Mexicana Pizza mexicana	S, M, L	$12.00–$20.25
The Spinach Pesto Pizza spin_pesto	S, M, L	$12.50–$20.75
The Spinach and Feta Pizza spinach_fet	S, M, L	$12.00–$20.25
The Vegetables + Vegetables Pizza veggie_veg	S, M, L	$12.00–$20.25

The menu reveals the character of the fictional establishment. Classic pizzas stick to familiar territory: pepperoni, Hawaiian, the combinations everyone recognizes. The Supreme category ventures into more elaborate ingredient lists with names that promise indulgence. Veggie options cater to those avoiding meat, while the Chicken category builds meals around that particular protein. Each pizza comes in multiple sizes, though not every pizza is available in every size. The XL (and XXL!) option appears only for a certain pizza, and presumably that one pizza is popular enough to warrant the larger formats (could be that the ingredient mix works well at scale).

Pricing follows intuitive logic. A basic cheese pizza costs less than one loaded with prosciutto and artichoke hearts. Size increases bring proportional price increases, though the per-square-inch cost typically decreases as you go larger (the eternal economic argument for ordering the bigger pizza). These pricing patterns create natural opportunities for analysis: which pizzas generate the most revenue? Which sizes sell best and for which types? How does day of week affect the popularity of the vegetarian options?

C.1.2 A pizza tier list

Any discussion of pizza invites opinions about which pizzas are best. The following tier list represents one possible ranking of the pizzaplace menu, from essential classics to more adventurous options that may not suit every palate. Reasonable people will disagree, and that disagreement is part of what makes pizza culture endlessly entertaining.

Pizza	Ingredients	Assessment
The Definitive pizzaplace Tier List
A highly subjective ranking of 15 notable pizzas.
S Tier — Essential
The Pepperoni	Mozzarella, Pepperoni, Tomato Sauce	The benchmark against which all pizzas are measured
The Big Meat	Bacon, Pepperoni, Italian Sausage, Chorizo, Mozzarella, Tomato Sauce	Maximalist approach that somehow works
The Classic Deluxe	Pepperoni, Mushrooms, Red Onions, Red Peppers, Bacon, Mozzarella, Tomato Sauce	Everything a pizza should be
A Tier — Excellent
The Barbecue Chicken	Barbecued Chicken, Red Peppers, Green Peppers, Tomatoes, Red Onions, Mozzarella, Barbecue Sauce	Sweet and savory in perfect balance
The Hawaiian	Sliced Ham, Pineapple, Mozzarella, Tomato Sauce	Controversial but undeniably popular
The Italian Capocollo	Capocollo, Red Peppers, Tomatoes, Goat Cheese, Garlic, Oregano, Mozzarella, Tomato Sauce	Elevated ingredients lift the whole experience
The Calabrese	Nduja, Italian Sausage, Pepperoni, Tomatoes, Red Onions, Mozzarella, Tomato Sauce	Spicy nduja provides serious depth
The Prosciutto	Prosciutto, Arugula, Mozzarella, Tomato Sauce	Simple elegance, fresh and light
B Tier — Good
The Four Cheese	Ricotta, Gorgonzola, Romano, Mozzarella, Tomato Sauce	For dedicated cheese enthusiasts only
The Vegetables	Mushrooms, Tomatoes, Red Peppers, Green Peppers, Red Onions, Zucchini, Spinach, Garlic, Mozzarella, Tomato Sauce	The vegetable abundance can overwhelm
The Spinach Pesto	Spinach, Artichokes, Tomatoes, Sun-dried Tomatoes, Garlic, Pesto Sauce, Mozzarella	Pesto base divides opinion
The Greek	Kalamata Olives, Feta, Tomatoes, Red Onions, Red Peppers, Garlic, Mozzarella, Tomato Sauce	Mediterranean flavors work better as salad
C Tier — Passable
The Brie Carre	Brie, Prosciutto, Caramelized Onions, Pears, Thyme, Garlic, Mozzarella	Ambitious but confused identity
The Chicken Pesto	Chicken, Tomatoes, Red Peppers, Spinach, Garlic, Pesto Sauce, Mozzarella	Pesto and chicken compete rather than complement
The Soppressata	Soppressata, Fontina, Mozzarella, Garlic, Tomato Sauce	Underwhelming given premium ingredients
Rankings reflect one person's taste. YMMV.

The S-tier pizzas represent the core of what a pizzeria should do well. The Pepperoni is the foundation, the pizza against which all others are implicitly compared. When someone says “let’s get pizza”, this is what most people imagine. The Big Meat takes maximalism seriously but avoids the trap of incoherence: bacon, pepperoni, sausage, and chorizo sound excessive, but they harmonize around a common meatiness (it really works!). The Classic Deluxe achieves balance, incorporating vegetables (mushrooms, peppers, onions) alongside meat without letting any single ingredient dominate.

The A-tier pizzas represent successful experiments. The Hawaiian remains perpetually controversial, inspiring passionate defenses and equally passionate condemnations. But I’m a fan. The combination of salty ham and sweet pineapple against tangy tomato sauce works for a lot of people, and the dataset shows it selling consistently throughout the year. The Italian Capocollo and Calabrese elevate the genre through premium cured meats, offering pizzas that could credibly appear on a trattoria menu. The Prosciutto demonstrates that restraint can be a virtue: just ham, arugula, and cheese, but executed with quality ingredients.

The B-tier pizzas are solid but not essential. The Four Cheese appeals to dedicated turophiles but can feel monotonous without the textural variety that vegetables or meats provide. The vegetable-heavy options (Vegetables, Greek, Spinach Pesto) often release too much moisture during cooking, resulting in soggy centers that undermine the crust. Don’t get me wrong, they are fine pizzas but rarely anyone’s first choice.

The C-tier pizzas represent experiments that did not quite succeed. The Brie Carre attempts a French-inspired combination of brie, pears, and prosciutto that sounds sophisticated but tastes confused. The ingredients come from different culinary traditions and never fully integrate. These pizzas do seem to sell in this particular dataset, but they probably rarely inspire repeat orders or enthusiastic recommendations.

C.1.3 What makes a good pizza

The tier list above reflects accumulated pizza wisdom, but the principles underlying it deserve articulation. A good pizza balances several competing demands. The crust must be sturdy enough to support toppings without becoming soggy, yet thin and pliable enough to fold. The sauce should be present but not overwhelming, providing acidity and moisture without drowning the other flavors. The cheese needs to melt smoothly and brown slightly without becoming rubbery or releasing pools of grease. The toppings must be distributed evenly and cut to appropriate sizes so that each bite contains a representative sample. And freshness is non-negotiable. A pizza that has been languishing in a display case for hours, slowly drying out under a heat lamp, is a shadow of its former self. The crust toughens, the cheese congeals, and the toppings develop that dispiriting sheen of oxidation. At the extreme end of neglect, one reviewer actually found mold growing on the underside of a display pizza, which is the sort of discovery that makes you reconsider every grab-and-go slice you have ever eaten.

Beyond these structural requirements, good pizza demonstrates restraint. The impulse to add more toppings is understandable but usually misguided. Each additional ingredient dilutes the impact of everything else. The best pizzas feature three to five components (beyond sauce and cheese) that complement one another. Pepperoni alone is better than pepperoni with six other meats. Mushrooms and peppers work together because their textures contrast and their flavors do not compete.

Temperature matters enormously. A pizza that has been sitting for twenty minutes bears little resemblance to one fresh from the oven. The cheese firms up, the crust softens, the toppings cool into sad disconnection. This is why pizzerias survive on dine-in and delivery rather than takeout that sits in a car. And do not even get me started on microwave pizza. The microwave does something deeply wrong to pizza crust, transforming it into a chewy, rubbery substance that no longer qualifies as bread by any reasonable definition. The cheese melts unevenly, the sauce superheats into lava pockets, and the whole experience leaves you wondering why you bothered. Cold leftover pizza eaten standing at the refrigerator at midnight is honestly preferable (I’ve actually grown to like it more and more these days). The pizzaplace dataset implicitly captures this reality: the simulation models customers who order and receive pizzas within a reasonable timeframe, not pizzas boxed and forgotten (and certainly not pizzas reheated in a microwave).

Finally, good pizza requires good ingredients. This seems obvious but explains why some inexpensive pizzas disappoint while others satisfy. The quality of the mozzarella matters. The tomatoes in the sauce matter. Even the olive oil brushed on the crust matters. Plateau Pizza, the real establishment that inspired the dataset, succeeds partly because it sources decent ingredients and does not cut corners that customers would notice. The fictional pizzaplace inherits this philosophy.

And yet, after all this talk of balance and restraint, it is worth acknowledging that pizza can abandon every one of these conventions and still be legitimate. A pizza can have no sauce. It can have no cheese. It can be nothing more than dough, olive oil, and anchovies. This sounds shocking to anyone raised on North American delivery pizza, but it is a perfectly valid expression of pizzadom with deep roots in Italian tradition. Pizza bianca, pizza marinara, and the various focaccia-adjacent flatbreads of southern Italy predate the mozzarella-laden versions by centuries. The rules above describe one tradition. Pizza is generous enough to accommodate others.

C.1.4 Why pizza data matters

Pizza occupies a unique position in the landscape of consumer goods. It is simultaneously simple (bread, sauce, cheese, toppings) and infinitely variable. A pizzeria’s menu encodes assumptions about its customers: their adventurousness, their price sensitivity, their dietary restrictions. Sales patterns encode behavior: when people eat, what they celebrate, how weather affects appetite.

Part of my motivation for creating this dataset was to draw attention to pizza analytics as a legitimate field of inquiry. The phrase sounds ridiculous, and that is partly the point. If a dataset can make someone smile while also teaching them time series decomposition and category performance analysis, it has done more work than a dataset that only accomplishes the second thing. Nobody has ever felt intimidated by pizza data. Nobody has ever opened a CSV of pizza orders and thought “I am not qualified to analyze this”. The approachability is a feature (not a frivolity!).

The pizzaplace dataset serves as a sandbox for the kinds of analysis that businesses perform constantly. Revenue breakdowns, time series decomposition, category performance, seasonal adjustment. These techniques apply far beyond pizza. Anyone learning to analyze transactional data will find the patterns in pizzaplace transferable to retail, hospitality, and service industries generally. The dataset is large enough to be realistic (nearly 50,000 transactions) but small enough to process quickly on any modern computer.

For gt specifically, pizzaplace demonstrates grouped data, aggregation, currency formatting, and the presentation of time-based information. A year of pizza sales can become a monthly summary table, a daily heatmap, a ranked list of bestsellers, or a comparison across categories. The richness of the underlying data supports dozens of different table designs.

C.2 `exibble`

The name exibble is a portmanteau of “example tibble” and it serves exactly that purpose. This tiny dataset of eight rows exists to demonstrate gt’s formatting capabilities without the distraction of meaningful content. Each column represents a different data type: numeric values, character strings, currency amounts, dates, times, datetimes, and logical values. Missing values appear in strategic locations to demonstrate sub_missing() and related substitution functions.

num	char	fctr	date	time	datetime	currency	row	group
The `exibble` dataset
8 rows and 9 columns.
1.111e-01	apricot	one	2015-01-15	13:35	2018-01-01 02:22	49.950	row_1	grp_a
2.222e+00	banana	two	2015-02-15	14:40	2018-02-02 14:33	17.950	row_2	grp_a
3.333e+01	coconut	three	2015-03-15	15:45	2018-03-03 03:44	1.390	row_3	grp_a
4.444e+02	durian	four	2015-04-15	16:50	2018-04-04 15:55	65100.000	row_4	grp_a
5.550e+03	NA	five	2015-05-15	17:55	2018-05-05 04:00	1325.810	row_5	grp_b
NA	fig	six	2015-06-15	NA	2018-06-06 16:11	13.255	row_6	grp_b
7.770e+05	grapefruit	seven	NA	19:10	2018-07-07 05:22	NA	row_7	grp_b
8.880e+06	honeydew	eight	2015-08-15	20:20	NA	0.440	row_8	grp_b

The column names are deliberately generic (num, char, currency, date, time, datetime) because the content does not matter. What matters is having every common data type available in a single compact dataset. When documenting a date formatter, you need a date column. When showing number formatting options, you need numbers. When explaining how to handle NA values, you need NA values in predictable locations.

C.2.1 Anatomy of a reference dataset

Each row and column in exibble was chosen to exercise different aspects of table formatting:

	R Type	Role in Examples
exibble Column Structure
num	numeric	Demonstrates numeric formatting across scales
char	character	Provides recognizable text labels (fruits)
currency	numeric	Tests currency with decimals, zeros, NAs
date	Date	Shows date formatting patterns
time	character	Character-encoded times for parsing
datetime	POSIXct	Full datetime objects for formatting
row	character	Stub/rowname labels for tables
group	character	Group categories for row grouping

The fruit names in the char column (apricot, banana, coconut, and so forth) follow alphabetical order, which makes them easy to verify when demonstrating sorting or filtering operations. The numeric values span several orders of magnitude, from fractions to millions, ensuring that formatters must handle both small precise values and large rounded ones. The currency column includes a missing value and one very small amount, testing edge cases that might trip up naive formatting approaches.

The row and group columns transform exibble from a formatting showcase into a structural one. With row serving as a stub and group organizing rows into categories, the same eight-row dataset can demonstrate virtually every gt feature. Headers, stubs, row groups, column formatting, substitution, styling… all can be shown using just exibble.

	num	char	fctr	date	time	datetime	currency
exibble with Row Groups and Stub
Demonstrating structural features.
grp_a
row_1	0.11	apricot	one	1/15/2015	13:35	1/1/2018 02:22	$49.95
row_2	2.22	banana	two	2/15/2015	14:40	2/2/2018 14:33	$17.95
row_3	33.33	coconut	three	3/15/2015	15:45	3/3/2018 03:44	$1.39
row_4	444.40	durian	four	4/15/2015	16:50	4/4/2018 15:55	$65,100.00
grp_b
row_5	5,550.00	—	five	5/15/2015	17:55	5/5/2018 04:00	$1,325.81
row_6	—	fig	six	6/15/2015	—	6/6/2018 16:11	$13.26
row_7	777,000.00	grapefruit	seven	—	19:10	7/7/2018 05:22	—
row_8	8,880,000.00	honeydew	eight	8/15/2015	20:20	—	$0.44

Datasets like exibble just don’t get a lot of attention, however, they are essential as infrastructure for examples in documentation. Every example in gt’s documentation that needs to show a quick formatting demonstration reaches for exibble rather than constructing throwaway data inline. This consistency helps readers recognize the dataset and focus on what is being demonstrated rather than puzzling over unfamiliar data structures.

C.3 `gtcars`

The gtcars dataset contains specifications for 47 luxury and performance automobiles, with an emphasis on grand touring vehicles. The name works on two levels: these are GT (grand tourer) cars, and the dataset lives in a package called gt. The wordplay is intentional but understated. I try not to make a big deal about it.

	HP	Torque	MPG		MSRP
German and Italian Grand Tourers
Precision engineering meets Mediterranean passion.
	HP	Torque	City	Hwy	MSRP
Audi
R8	430	317	11	20	$115,900
RS 7	560	516	15	25	$108,900
S6	450	406	18	27	$70,900
S7	450	406	17	27	$82,900
S8	520	481	15	25	$114,900
BMW
6-Series	315	330	20	30	$77,300
M4	425	406	17	24	$65,700
M5	560	500	15	22	$94,100
M6	560	500	15	22	$113,400
i8	357	420	28	29	$140,700
Mercedes-Benz
AMG GT	503	479	16	22	$129,900
SL-Class	329	354	20	27	$85,050
Porsche
718 Boxster	300	280	21	28	$56,000
718 Cayman	300	280	20	29	$53,900
911	350	287	20	28	$84,300
Panamera	310	295	18	28	$78,100
Ferrari
458 Italia	562	398	13	17	$233,509
458 Speciale	597	398	13	17	$291,744
458 Spider	562	398	13	17	$263,553
488 GTB	661	561	15	22	$245,400
California	553	557	16	23	$198,973
F12Berlinetta	731	509	11	16	$319,995
FF	652	504	11	16	$295,000
GTC4Lusso	680	514	12	17	$298,000
LaFerrari	949	664	12	16	$1,416,362
Lamborghini
Aventador	700	507	11	18	$397,500
Gallardo	550	398	12	20	$191,900
Huracan	610	413	16	20	$237,250
Maserati
Ghibli	345	369	17	24	$70,600
Granturismo	454	384	13	21	$132,825
Quattroporte	404	406	16	23	$99,900

The dataset was assembled from Motor Trend articles about grand touring vehicles, with additional research filling in gaps for fuel economy and torque figures. Most vehicles date from around 2015, reflecting when the source articles were published. The selection criteria emphasized true grand tourers: vehicles designed for high-speed, long-distance driving in comfort, typically with powerful engines, refined interiors, and substantial price tags.

C.3.1 What makes a grand tourer?

The grand touring concept originated in 1950s Europe, when wealthy motorists began taking extended driving holidays across the continent. A proper GT needed range (400+ kilometers between fuel stops), performance (for the autobahns and mountain passes), and comfort (for hours behind the wheel). The Ferrari 250 GT established the template: front-mounted V12, leather interior, elegant coachwork by Pininfarina or Scaglietti. The grand tourer was always as much about aspiration as transportation.

Manufacturer	Model	MSRP
Most Expensive Cars in `gtcars` 💰
Top 10 by manufacturer's suggested retail price.
Ferrari	LaFerrari	$1,416,362
Ford	GT	$447,000
Lamborghini	Aventador	$397,500
Rolls-Royce	Dawn	$335,000
Ferrari	F12Berlinetta	$319,995
Rolls-Royce	Wraith	$304,350
Ferrari	GTC4Lusso	$298,000
Ferrari	FF	$295,000
Ferrari	458 Speciale	$291,744
Aston Martin	Vanquish	$287,250

The top ten most expensive cars in the dataset tell a clear story about where the money goes. Ferrari dominates the list with five entries, led by the LaFerrari at over $1.4 million (more than three times the price of any other car in the dataset). The Ford GT makes a surprising appearance at number two, representing America’s answer to European exotica. A Lambo, two Rolls-Royces, and Aston Martin round out the list, each occupying a different niche of the ultra-luxury market. Notably absent from the top ten are the German manufacturers, whose cars offer serious performance at comparatively accessible price points.

	Models	Avg Torque (lb-ft)	Avg Price
Performance Tiers
Grouping the 47 cars in `gtcars` by horsepower output.
Modest (<400 HP)	10	319	$78,545
Strong (400-499 HP)	8	374	$95,416
High (500-599 HP)	17	469	$188,800
Extreme (600+ HP)	12	548	$363,024

The relationship between horsepower and price is neither linear nor deterministic. Some modest-horsepower vehicles (certain Porsches, for instance) command premium prices through brand cachet and driving dynamics. Some extremely powerful vehicles achieve their output through brute displacement rather than exotic engineering, keeping prices relatively accessible. The correlation exists, but the exceptions tell interesting stories.

The choice to create gtcars was motivated by a desire for a modern equivalent to the venerable mtcars dataset that has shipped with R for decades. The original mtcars contains 1974 Motor Trend data on 32 automobiles, and it has been used in countless examples and tutorials. But cars from 1974 feel increasingly remote from contemporary experience. A dataset of modern luxury vehicles offers familiar reference points (Ferrari, Porsche, Aston Martin) and specifications that relate to cars people actually see on roads today.

For table-making purposes, gtcars provides natural groupings by manufacturer, multiple numeric columns suitable for formatting and comparison, and a mix of discrete and continuous variables. The manufacturer and model columns enable row grouping and stub labeling. The price column practically demands currency formatting. The horsepower and torque columns work well for bar chart visualizations within cells. It is a dataset that seems designed for beautiful tables because, in fact, it was.

C.4 `countrypops`

The countrypops dataset tracks population estimates for countries worldwide from 1960 through the present (currently extending to 2024). The data comes from the World Bank, which compiles demographic estimates from national statistical offices, census data, and the United Nations Population Division. With over 13,000 rows covering more than 200 countries across six decades, it is one of the larger datasets in the gt collection.

	1960	1980	2000	2020
Population Growth in Five Major Nations
Brazil	72.39M	121.21M	174.02M	208.66M
China	667.07M	981.23M	1.26B	1.41B
India	435.99M	687.35M	1.06B	1.40B
Nigeria	45.05M	73.76M	126.38M	214.00M
United States	180.67M	227.22M	282.16M	331.58M

Population data might seem straightforward, but it encodes profound stories of human migration, economic development, public health, and political change. China’s population trajectory shows the demographic impact of the one-child policy. Nigeria’s explosive growth reflects patterns common across sub-Saharan Africa. European countries exhibit the stagnation and aging that accompany developed economies. Each row is a snapshot of millions of individual lives aggregated into a single number.

C.4.1 Understanding population data

Population counts are harder to obtain than one might assume. Only a handful of countries conduct reliable censuses at regular intervals. Many estimates rely on birth and death registrations (which vary in completeness), surveys of representative samples (which involve statistical uncertainty), or projections from previous counts (which compound errors over time). The World Bank’s task is to synthesize these imperfect sources into consistent estimates that allow comparison across countries and years.

		Population (2023)
World's Most Populous Countries
2023 estimated population.
India		1.44B
China		1.41B
United States		337M
Indonesia		281M
Pakistan		248M
Nigeria		228M
Brazil		211M
Bangladesh		171M
Russia		144M
Mexico		130M
Ethiopia		129M
Japan		125M
Philippines		115M
Egypt		115M
Congo (DRC)		106M

The uncertainties in population data matter for policy and planning. A country that believes it has 100 million people will allocate resources differently than one that believes it has 120 million. Census undercounts (common in remote areas, among marginalized populations, and in places where people distrust government) lead to underinvestment in precisely the communities that need services most. The countrypops figures represent best estimates, not ground truth, and users should remember this limitation.

That said, the trends in population data are generally reliable even when the absolute numbers carry uncertainty. If the World Bank estimates that Nigeria’s population doubled between 1990 and 2020, the actual growth was almost certainly substantial even if the precise figures might be revised. Trends matter more than point estimates for most analytical purposes, and the countrypops dataset captures these trends across the entire modern era of demographic record-keeping.

	1990	2000	2010	2020
Population Change in Aging Societies
Index: 1990 = 100
Germany	100.0	103.5	103.0	104.7
Spain	100.0	104.4	119.8	121.8
Italy	100.0	100.4	105.5	104.8
Japan	100.0	102.7	103.7	102.3
South Korea	100.0	109.7	115.6	120.9
Values show population relative to 1990 baseline

The table above shows population indexed to 1990 for five countries facing demographic aging. Japan’s population has declined in absolute terms. Germany and Italy have barely grown. South Korea’s growth is slowing rapidly. These patterns reflect low birth rates, increased longevity, and (in some cases) restrictive immigration policies. The economic and social implications of aging populations (pension systems, healthcare costs, labor force composition) represent some of the most significant policy challenges of the coming decades.

The dataset updates whenever the World Bank publishes new estimates, rather than on any fixed release schedule. This ongoing maintenance means that examples in documentation and books remain current. A population figure for China in 2024 becomes available, and shortly thereafter it appears in countrypops. This currency makes the dataset more useful for teaching than static historical data would be.

For gt demonstrations, countrypops excels at time series comparisons, geographic groupings, and the handling of large numbers. The population values range from thousands (small island nations) to billions (China and India), exercising formatters across their full dynamic range. The longitudinal structure supports year-over-year comparisons, growth rate calculations, and the kind of decade-by-decade summary tables that appear in demographic reports.

C.5 `towny`

While countrypops takes a global view, towny focuses on a single Canadian province: Ontario. The dataset contains population figures for 414 municipalities, including data from every Canadian census between 1996 and 2021 (conducted every five years) plus various geographic and administrative attributes. It exists because I actually live in Ontario and wanted an excuse to know more about the places surrounding me.

	2001		2021
Ontario's Largest Municipalities
Population and density for the top 10, 2001 vs. 2021.
	Population	Density	Population	Density
Toronto	2,481,494	3,932.0	2,794,356	4,427.8
Ottawa	774,072	277.6	1,017,449	364.9
Mississauga	612,925	2,093.8	717,961	2,452.6
Brampton	325,428	1,223.9	656,480	2,469.0
Hamilton	490,268	438.4	569,353	509.1
London	336,359	799.9	422,324	1,004.3
Markham	208,615	989.0	338,503	1,604.8
Vaughan	182,022	668.1	323,103	1,186.0
Kitchener	190,399	1,391.7	256,885	1,877.7
Windsor	208,402	1,427.2	229,660	1,572.8
Density is measured in persons per km².

The data comes from Statistics Canada and reveals patterns that might surprise those unfamiliar with Canadian geography. Toronto dominates, of course, but the surrounding municipalities (Mississauga, Brampton, Hamilton) have grown substantially over twenty years. Some smaller towns have declined as economic opportunities concentrated elsewhere. The dataset captures this quiet drama of population redistribution that plays out across every country’s regions.

Ontario municipality names offer their own entertainment. Some are indigenous place names with beautiful sounds. Others commemorate British royalty or colonial administrators. A few seem almost whimsical when encountered for the first time. These names appear on highway signs and maps, marking places where real communities exist with their own histories and concerns. The towny dataset transforms those signs into data, inviting exploration of what lies behind the familiar names.

	Pop. 2001	Pop. 2021	Growth
Fastest Growing Ontario Municipalities
Among places with 10,000+ residents in 2001.
Milton	31,471	132,979	322.5%
Whitchurch-Stouffville	22,859	49,864	118.1%
Brampton	325,428	656,480	101.7%
Wasaga Beach	12,419	24,862	100.2%
Bradford West Gwillimbury	22,228	42,880	92.9%
Vaughan	182,022	323,103	77.5%
Ajax	73,753	126,666	71.7%
East Gwillimbury	20,555	34,637	68.5%
New Tecumseth	26,141	43,948	68.1%
Markham	208,615	338,503	62.3%

The fastest-growing municipalities cluster around the Greater Toronto Area, where housing demand has driven expansion into formerly rural townships. Milton, Brampton, and Markham have transformed from small towns into substantial cities within a generation. The infrastructure challenges of this growth (roads, schools, healthcare, transit) consume enormous resources and dominate local politics. The towny data captures the before and after of this transformation but cannot convey the lived experience of watching farmland become subdivisions.

Not every municipality grew. Some communities in northern and eastern Ontario lost population as young people left for opportunities elsewhere. Factory closures, mine exhaustions, and the general drift of economic activity toward metropolitan areas hollowed out places that had thrived in earlier decades. The dataset does not distinguish between population loss from out-migration and loss from natural decrease (more deaths than births), but both dynamics contribute to the patterns visible in the numbers.

For table-making, towny provides opportunities for population density calculations, before-after comparisons, and growth rate analysis across its six census years. The land area column enables density visualization. The municipality names work naturally as row labels in grouped tables organized by population tier or geographic region.

C.6 `peeps`

The peeps dataset contains fictional personal information for 100 imaginary people: names, addresses, phone numbers, email addresses, and nationalities. These fake individuals were generated using an online tool that produces realistic-seeming demographic data, then verified for plausible formatting of addresses and contact information across different countries.

First Name	Last Name	Email	Country
A Random Selection of Peeps
Krzysztof	Kowalczyk	krzysztof_k@example.com	Poland
Gaweł	Zając	gawelzajac@example.com	Poland
Eva	Simpson	eva_simpson@example.com	Canada
Rolla	Skov	rollaskov@example.com	Denmark
Oliver	Mikkelsen	oli_mikkelsen@example.com	Denmark
Letizia	Moretti	l_moretti@example.com	United Kingdom

The international scope was intentional. peeps was created specifically to demonstrate formatters like fmt_email(), fmt_country(), and fmt_flag(). Having people from various countries ensures that flag icons and country name formatting can be shown in realistic contexts. An address book or contact directory table should contain international entries, and peeps provides exactly that.

C.6.1 The problem of synthetic data

Generating realistic fake data is harder than it sounds. Names must fit cultural expectations (a person from Japan should have a Japanese name). Addresses must follow country-specific formats (postal codes before or after city names, province abbreviations versus full names). Phone numbers must have correct country codes and plausible internal structure. Email addresses must look like real email addresses while clearly being fictional.

The country distribution in peeps emphasizes variety over statistical representativeness. Having multiple people from smaller countries ensures that formatting edge cases get tested. A dataset with 90 Americans and 10 others would not exercise international formatting as thoroughly as one with broader distribution.

The email domains follow patterns typical of real email usage: major providers dominate, with country-specific services appearing for non-English-speaking regions. This realism helps ensure that fmt_email() handles the variety of domain lengths and TLD formats that appear in actual contact databases.

Every person in the dataset is entirely fictional. The addresses do not correspond to real residences. The phone numbers should not connect to anyone. But the formatting follows authentic patterns for each country represented. A French address looks like a French address. A Japanese name follows Japanese naming conventions. This verisimilitude matters because formatting functions must handle real-world variation, and peeps provides test cases for that variation without compromising anyone’s actual privacy.

C.7 `sza` (solar zenith angles)

The sza dataset originates from atmospheric chemistry research, specifically from data tables published in textbooks by Pitts and Finlayson-Pitts. It records solar zenith angles (the angle between the sun and the vertical) across different latitudes and months. The original data came from a US government source that may no longer be online, but the values remain scientifically accurate and useful.

	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
Solar Zenith Angles by Latitude and Month
Mid-latitude (30°)
0700	89.4	86.2	81.1	74.9	69.9	66.8	66.3	68.3	72.9	78.7	84.6	88.7
0730	83.7	80.3	74.9	68.5	63.4	60.4	60.0	61.9	66.4	72.4	78.4	82.8
0800	78.3	74.6	68.8	62.1	56.9	54.0	53.5	55.4	60.0	66.2	72.8	77.3
0830	73.2	69.1	62.9	55.8	50.4	47.5	47.1	48.9	53.6	60.1	67.1	72.2
0900	68.4	64.1	57.4	49.6	44.0	41.0	40.6	42.4	47.3	54.4	62.0	67.3
0930	64.1	59.4	52.2	43.8	37.6	34.5	34.1	36.0	41.2	48.8	57.0	63.0
1000	60.4	55.3	47.5	38.2	31.4	28.1	27.7	29.7	35.5	43.8	53.0	59.2
1030	57.3	51.9	43.5	33.3	25.6	21.7	21.2	23.6	30.2	39.5	49.3	56.0
1100	55.0	49.3	40.3	29.3	20.4	15.7	15.1	18.1	25.8	36.1	46.7	53.7
1130	53.5	47.7	38.4	26.5	16.5	10.5	9.6	13.7	22.8	33.9	44.9	52.2
1200	53.0	47.2	37.7	25.5	15.0	8.0	6.9	11.9	21.6	33.1	44.4	51.8
0630	—	—	87.5	81.4	76.3	73.1	72.5	74.7	79.4	85.1	—	—
0600	—	—	—	87.2	82.7	79.2	78.7	81.0	85.9	—	—	—
0530	—	—	—	—	88.9	85.3	84.7	87.2	—	—	—	—

Solar zenith angles matter because they determine how much solar radiation reaches Earth’s surface and at what angle. This affects everything from climate modeling to photovoltaic panel efficiency to the rate of photochemical reactions in the atmosphere. At high latitudes in winter, the sun barely rises above the horizon (large zenith angles). At the equator, the sun passes nearly overhead at noon year-round (small zenith angles). The interplay between latitude, season, and time of day creates the patterns visible in the sza data.

C.7.1 Why zenith angles matter

When the sun sits directly overhead, sunlight travels through the minimum possible amount of atmosphere before reaching the surface. As the sun moves toward the horizon, light must traverse increasingly long atmospheric paths. This path length, often expressed as “air mass” affects both the intensity and spectral composition of sunlight reaching the ground. Ultraviolet radiation attenuates more strongly through longer path lengths, which is why sunburns are most severe around solar noon in summer at lower latitudes.

	Jan	Apr	Jul	Oct
Solar Zenith Angles Throughout the Day
By latitude and season.
20ºN
09:00	62°	46°	42°	50°
12:00	43°	16°	3°	23°
06:00	—	88°	82°	—
40ºN
09:00	76°	54°	41°	60°
12:00	63°	36°	17°	43°
06:00	—	87°	75°	—

The seasonal pattern emerges clearly at mid and high latitudes. January brings high zenith angles in the Northern Hemisphere as the sun traces its winter arc low in the southern sky. July reverses the pattern, with the sun climbing high overhead at noon. At the equator, seasonal variation is minimal: the sun is always nearly overhead at midday. These patterns have shaped human civilization, determining growing seasons, driving migration patterns, and inspiring astronomical observations throughout history.

Atmospheric chemists care about zenith angles because photochemical reactions require light. Photolysis rates vary with the intensity and spectrum of incoming solar radiation. A nitrogen dioxide molecule that might photolyze within seconds at tropical noon may persist for hours at polar twilight. Models simulating urban smog formation or stratospheric ozone depletion must account for these variations, typically by looking up appropriate photolysis rates from tables indexed by zenith angle and altitude.

The dataset reflects my background in atmospheric chemistry. Such data tables would be directly imported into atmospheric box models for simulating photolysis-based reactions of volatile organic compounds (VOCs). Having the data readily available in R format eliminates the tedious work of transcribing values from printed tables or scraping data from web pages.

For gt, the sza dataset demonstrates heatmap-style coloring (values naturally vary from low to high in meaningful patterns), missing value handling (the sun does not rise at certain latitude-time combinations), and the presentation of scientific lookup tables. The structure (rows indexed by time, columns by month, grouped by latitude) maps naturally onto the table designs that scientists create when presenting such reference data.

C.8 `constants`, `reactions`, `photolysis`, and `nuclides`

These four datasets form a cluster of scientific reference data, each addressing a gap in publicly available resources. While the underlying information exists in various forms (government pages, textbooks, specialized databases), it had not been consolidated into convenient data frames. The gt package provided an opportunity to change that.

The constants dataset contains 30 fundamental physical constants with their values, units, and uncertainties. These are the numbers that appear in physics and chemistry textbooks: the speed of light, Planck’s constant, Avogadro’s number, the gravitational constant. Each value comes with associated metadata specifying units and measurement precision.

	Value	Uncertainty	Units
Numbers That Define the Universe
Ten fundamental physical constants.
Avogadro constant	6.02 × 10²³		mol⁻¹
Bohr radius	5.29 × 10⁻¹¹	8.00 × 10⁻²¹	m
Boltzmann constant	1.38 × 10⁻²³		J K⁻¹
electron mass	9.11 × 10⁻³¹	2.80 × 10⁻⁴⁰	kg
elementary charge	1.60 × 10⁻¹⁹		C
fine-structure constant	7.30 × 10⁻³	1.10 × 10⁻¹²
Newtonian constant of gravitation	6.67 × 10⁻¹¹	1.50 × 10⁻¹⁵	m³ kg⁻¹ s⁻²
Planck constant	6.63 × 10⁻³⁴		J Hz⁻¹
proton mass	1.67 × 10⁻²⁷	5.10 × 10⁻³⁷	kg
speed of light in vacuum	3.00 × 10⁸		m s⁻¹

C.8.1 The certainty of physical constants

Physical constants occupy a peculiar position in science. They are measured quantities, subject to experimental uncertainty, yet they describe fundamental features of the universe that we believe to be genuinely constant. The speed of light in vacuum, for instance, is now defined as exactly 299,792,458 meters per second, but this definition only became possible after decades of increasingly precise measurements. Before 1983, we measured the speed of light; now the meter is defined in terms of light’s speed. The historical progression of uncertainty shrinking toward zero tells a story of experimental ingenuity.

	Constants	Example
Physical Constants by Measurement Precision
Extraordinarily precise	51	alpha particle-electron mass ratio
Extremely precise	133	alpha particle mass
Moderately precise	30	deuteron rms charge radius
Very precise	59	Angstrom star

The measurement precision varies dramatically across constants. The fine-structure constant (approximately 1/137) has been measured to extraordinary precision through quantum electrodynamics experiments. The gravitational constant, despite being one of the first constants ever measured (by Cavendish in 1798), remains relatively imprecise because gravity is so weak that experiments must contend with tiny forces and correspondingly large relative uncertainties.

C.8.2 Atmospheric chemistry

The reactions dataset catalogs 1,344 atmospheric chemical reactions with their rate constants and temperature dependencies. The photolysis dataset provides photolysis rates for organic compounds, including spectral data stored in list columns. These datasets rarely exist in such accessible form. Researchers typically extract reaction rates from individual journal articles or specialized databases with restrictive access. Having a curated collection in R format simplifies atmospheric modeling exercises.

Compound	Formula	OH Rate at 298K
Selected Atmospheric Reactions with OH
beta-caryophyllene	C₁₅H₂₄	2.00 × 10⁻¹⁰
3-hydroxy-2-butanone	C₄H₈O₂	9.70 × 10⁻¹²
O-methyl-N-ethylcarbamate	C₄H₉NO₂	1.05 × 10⁻¹¹
4-methyl-1,3-dioxane	C₅H₁₀O₂	1.13 × 10⁻¹¹
morpholine	C₄H₉NO	1.10 × 10⁻¹⁰
n-propyl propanoate	C₆H₁₂O₂	4.00 × 10⁻¹²
anthracene	C₁₄H₁₀	1.17 × 10⁻¹⁰
triethyl phosphate	C₆H₁₅O₄P	4.68 × 10⁻¹¹

Understanding atmospheric chemistry requires knowing how fast different reactions proceed under various conditions. The hydroxyl radical (OH) drives much of daytime atmospheric chemistry, attacking volatile organic compounds and beginning their oxidation chains. Nitrate radicals take over at night. Photolysis reactions require sunlight, their rates varying with solar zenith angle and wavelength. Each rate constant in the dataset represents experimental determinations, often from smog chamber studies or theoretical calculations validated against field measurements.

C.8.3 Nuclear data

The nuclides dataset compiles nuclear data for isotopes: half-lives, decay modes, particle emissions. Like the chemical datasets, this information exists scattered across various sources but had not been unified into a single convenient data frame. Nuclear chemistry courses and research often require looking up isotope properties, and nuclides consolidates those lookups.

	Nuclides
Radioactive Decay Modes
Alpha	473
Beta plus	4
Beta minus	1224
Electron capture	137

Radioactive decay follows several pathways depending on the nuclear configuration. Beta-minus decay converts a neutron to a proton, moving the nucleus one step higher in atomic number. Alpha decay ejects a helium nucleus, reducing both atomic number and mass. Electron capture pulls an inner-shell electron into the nucleus. Each decay mode appears in the dataset, along with the half-life governing how quickly unstable nuclei transform.

For gt demonstrations, these scientific datasets showcase formatters like fmt_scientific(), fmt_chem(), and fmt_units(). They also demonstrate from_column() usage, where the number of decimal places might come from an adjacent precision column rather than being hard-coded. Scientific communication demands careful attention to significant figures and uncertainty, and these datasets provide realistic contexts for that attention.

C.9 `sp500`

The sp500 dataset contains daily stock market data for the S&P 500 index from 1950 through 2015: opening price, closing price, high, low, and trading volume for each trading day. With over 16,000 rows, it provides a substantial corpus of financial time series data.

	Day	Open	Close	Volume
Black Monday and the Week That Shook Wall Street
S&P 500 daily prices, October 14–23, 1987.
1987-10-14	Wednesday	$314.52	$305.23	207.40M
1987-10-15	Thursday	$305.21	$298.08	263.20M
1987-10-16	Friday	$298.08	$282.70	338.50M
The Weekend
1987-10-19	Monday	$282.70	$224.84	604.30M
1987-10-20	Tuesday	$225.06	$236.83	608.10M
1987-10-21	Wednesday	$236.83	$258.38	449.60M
1987-10-22	Thursday	$258.24	$248.25	392.20M
1987-10-23	Friday	$248.29	$248.22	245.60M

Financial data demands specific formatting conventions: currency symbols, appropriate decimal precision, volume abbreviations. The sp500 dataset exercises these requirements across decades of market history. Bull markets, bear markets, crashes, and recoveries all appear in the data. The 1987 Black Monday crash, the dot-com bubble, the 2008 financial crisis: each left its signature in closing prices and trading volumes.

C.9.1 Reading the market’s history

The S&P 500 tracks 500 large-cap American companies, weighted by market capitalization. It serves as the benchmark against which most US equity investments are measured. A fund that “beats the market” outperforms the S&P 500. An investor who wants “market returns” buys an index fund tracking the S&P 500. The index represents the collective judgment of millions of market participants about the value of corporate America.

	Open	Close	Annual Return	Year High	Year Low
S&P 500 Annual Summary
2006-2015
2006	$1,248	$1,418	+13.6%	$1,432	$1,219
2007	$1,418	$1,468	+3.5%	$1,576	$1,364
2008	$1,468	$903	−38.5%	$1,472	$741
2009	$903	$1,115	+23.5%	$1,130	$667
2010	$1,117	$1,258	+12.6%	$1,263	$1,011
2011	$1,258	$1,258	0.0%	$1,371	$1,075
2012	$1,259	$1,426	+13.3%	$1,475	$1,259
2013	$1,426	$1,848	+29.6%	$1,849	$1,426
2014	$1,846	$2,059	+11.5%	$2,094	$1,738
2015	$2,059	$2,044	−0.7%	$2,135	$1,867

The years 2006 through 2015 illustrate the market’s capacity for dramatic swings. 2008 stands out with its catastrophic decline during the financial crisis, when the index lost more than a third of its value. The subsequent years show gradual recovery, with the index reaching new highs by the early 2010s. Anyone who sold during the panic of 2008 locked in losses. Anyone who held through the crisis recovered and then some. The data tells both stories depending on which slices you examine.

The dataset originated from a web search that turned up historical market data, possibly compiled from Kaggle or similar sources. The exact provenance matters less than the utility: a long time series of financial data in a format ready for analysis and visualization. For teaching purposes, the sp500 dataset provides realistic data for demonstrating time series analysis, returns calculations, volatility measurement, and the kind of financial tables that appear in annual reports and investment presentations.

C.10 `metro`

The Paris Métro is one of the world’s great urban transit systems. Opened in 1900, it has grown to 16 lines serving 308 stations across the city and surrounding communes. The metro dataset captures this network: station names, locations, line assignments, opening dates, and ridership figures.

	Lines	Annual Passengers
Busiest Paris Métro Stations
Gare du Nord	4, 5	34.50M
Saint-Lazare	3, 12, 13, 14	33.13M
Gare de Lyon	1, 14	28.64M
Montparnasse—Bienvenüe	4, 6, 12, 13	20.41M
Gare de l'Est	4, 5, 7	15.54M
Bibliothèque François Mitterrand	14	11.10M
République	3, 5, 8, 9, 11	11.08M
Les Halles	4	10.62M
La Défense	1	9.26M
Châtelet	1, 4, 7, 11, 14	8.35M

The dataset exists because I really admire the Paris Métro. Among the world’s subway systems, it stands out for its density, connectivity, and integration with other transit modes (RER commuter rail, buses, trams, and high-speed TGV connections). The wayfinding and signage are exemplary. The expansion plans are ambitious and consistently executed. It represents what urban transit can be when treated as essential infrastructure rather than an afterthought.

C.10.1 A brief history of the Métro

The story of the Paris Métro begins in the late nineteenth century, when Paris faced the same urban transportation crisis that afflicted every growing industrial city. Horse-drawn omnibuses clogged the boulevards. The wealthy rode in private carriages while workers walked miles to reach their jobs. London had opened its Underground in 1863, demonstrating that subterranean railways could move masses of people efficiently. Paris, perennially competitive with its cross-Channel rival, needed its own solution.

The first line opened on July 19, 1900, timed to coincide with the Exposition Universelle that drew millions of visitors to Paris that summer. Line 1 ran from Porte de Vincennes to Porte Maillot, connecting the eastern and western edges of the city through its commercial heart. The stations featured distinctive Art Nouveau entrances designed by Hector Guimard, with their sinuous cast-iron curves and amber glass panels that remain iconic more than a century later. Not all survived (many were removed during mid-century modernization campaigns and later regretted), but those that remain are protected monuments.

The network expanded rapidly in its early decades. By 1910, Paris had six lines. By 1920, ten. This breakneck pace was not entirely the product of unified planning. The Compagnie du chemin de fer métropolitain de Paris (CMP) held the primary concession, but it faced competition from the Nord-Sud Company, which built what would become Lines 12 and 13. The two companies raced to serve lucrative routes, and their rivalry accelerated construction beyond what a single monopoly might have achieved. The Nord-Sud stations were arguably more elegant, with ceramic tile work and distinctive lettering that enthusiasts still admire. When the companies merged in 1930, Paris inherited a network that had been built fast precisely because multiple actors were competing to build it.

This frenetic early growth distinguished Paris from its European peers. London’s Underground, though older, expanded more cautiously under a patchwork of private companies that often duplicated routes rather than extending coverage. The Berlin U-Bahn, which opened in 1902, grew steadily but faced the complication of serving multiple municipalities that would not unify until 1920. Paris benefited from centralized city planning within the relatively compact boundaries of the twenty arrondissements, allowing the CMP and Nord-Sud to build a coherent network even while competing. By 1930, Paris had more stations than London despite London’s forty-year head start.

The guiding philosophy was density: stations placed close together (often just 500 meters apart) so that no Parisian would have to walk more than a few minutes to reach the Métro. This density distinguishes Paris from systems like Washington DC or the Bay Area’s BART, where stations are spaced miles apart and require feeder buses or long walks. The Paris approach sacrifices speed between stations for convenience of access, a tradeoff that makes sense for a compact, dense city.

	Stations Opened	Notable Events
125 Years of Métro Expansion
How the network grew decade by decade.
1900s	65	Line 1 opens for World's Fair
1910s	85	Rapid expansion across Paris
1920s	60	Network reaches most arrondissements
1930s	25	Great Depression slows construction
1940s	5	World War II occupation
1950s	15	Post-war reconstruction begins
1960s	12	RER regional express network planned
1970s	8	RER lines A and B open
1980s	18	Line 14 planning begins
1990s	10	Line 14 construction starts
2000s	8	Line 14 opens (first automated line)
2010s	14	Line extensions to suburbs
2020s	10	Grand Paris Express under construction

The interwar period saw continued expansion but also financial difficulties. The 1930s depression slowed construction, and the network that had seemed destined for endless growth began to stabilize. World War II brought occupation and disruption. The Métro continued to operate (the Germans found it useful for moving troops and supplies), but expansion halted and maintenance suffered. Several stations were closed and converted to other uses, some serving as air raid shelters.

Post-war reconstruction proceeded slowly. The immediate decades after 1945 focused on repairing damage and updating aging infrastructure rather than building new lines. The real transformation came in the 1960s and 1970s with the creation of the RER (Réseau Express Régional), a network of express lines that tunneled through central Paris but extended far into the suburbs. The RER was not technically part of the Métro but integrated seamlessly with it, allowing commuters to transfer between the dense inner-city network and the faster regional lines.

C.10.2 The modern network

Today’s Métro comprises 16 lines totaling over 220 kilometers of track. The numbering seems haphazard (there are lines 1 through 14, plus 3bis and 7bis), reflecting historical accidents rather than logical planning. Lines 3bis and 7bis were originally branches of their parent lines that later gained operational independence. The system carries approximately 4 million passengers daily, making it one of the world’s busiest rapid transit networks.

Line	Length (km)	Stations	Automated
Paris Métro Lines
Current network statistics.
1	16.6	25	✔
2	12.4	25	✘
3	11.7	25	✘
3bis	1.3	4	✘
4	12.1	27	✔
5	14.6	22	✘
6	13.6	28	✘
7	22.4	38	✘
7bis	3.1	8	✘
8	23.4	38	✘
9	19.6	37	✘
10	11.7	23	✘
11	6.3	13	✘
12	13.9	29	✘
13	24.3	32	✘
14	14.0	13	✔

Line 14 deserves special attention as the system’s showcase. Opened in 1998, it was the first fully automated line on the network, operating without drivers. Platform screen doors prevent accidents (a significant concern on older lines) and allow trains to run with shorter headways. The stations feel modern and spacious compared to the cramped nineteenth-century tunnels of earlier lines. Line 14 demonstrated that new construction was possible and could achieve standards superior to the historical network. It has since been extended multiple times and serves as the template for future expansion.

The Grand Paris Express, currently under construction, represents the most ambitious expansion since the network’s founding. This project will add four new automated lines (15, 16, 17, and 18) encircling the existing network and connecting suburban centers that currently require traveling through central Paris to reach one another. When complete, probably sometime in the 2030s, the Grand Paris Express will nearly double the length of the automated network and fundamentally reshape mobility patterns in the Île-de-France region.

C.10.3 What makes the Métro work

Several design principles distinguish the Paris Métro from less successful transit systems. First, the high density of stations means that walking to the Métro is almost always faster than driving to a parking lot. This convenience generates ridership that justifies the investment. Second, the integration with other modes is seamless. The same ticket works on Métro, RER, buses, and trams within Paris. Transfer stations connect lines at useful angles rather than requiring passengers to exit one system and enter another. Third, the frequency of service makes timetables irrelevant. During peak hours, trains arrive every two minutes on busy lines. Even late at night, waits rarely exceed ten minutes. Passengers simply show up and go.

The signage and wayfinding deserve particular praise. Station names appear in a consistent typeface (Parisine, designed specifically for the Métro in 1996) on tiled walls visible from passing trains. Corridor signs point toward exits, transfers, and surface landmarks with clarity that serves tourists and commuters alike. The colored line numbers and terminus names provide all the information needed to navigate without consulting maps. Many transit systems aspire to this legibility but few achieve it so thoroughly.

Line(s)	Station Count	Total Ridership	Avg per Station
Ridership by Line Assignment
Stations grouped by their line connections.
7	28	63.28M	2.26M
9	23	60.64M	2.64M
13	23	57.83M	2.51M
1	14	55.93M	3.99M
4	16	53.66M	3.35M
8	26	50.48M	1.94M
12	21	39.59M	1.89M
3	17	38.72M	2.28M
2	15	35.11M	2.34M
4, 5	1	34.50M	34.50M

The Métro also benefits from Paris’s urban form. The city is dense and compact, with most destinations within walking distance of a station. Zoning never separated residential from commercial uses as strictly as in American cities, so people live near where they work and shop. The Métro did not create this urban form (it predates the Métro by centuries), but the two reinforce one another. Dense cities need mass transit, and mass transit makes density livable.

For the metro dataset, this context matters. The station names are not arbitrary labels but markers of neighborhoods with distinct characters. The ridership figures reflect how Parisians actually move through their city. The line assignments show which routes carry the heaviest loads and which serve more specialized purposes. Understanding the Métro as a living system, constantly adapting over 125 years of operation, makes the dataset more meaningful than raw numbers alone could convey.

C.10.4 The future of Paris transit

The Grand Paris Express will transform the region, but it is only part of a broader vision. Line 14 continues to extend northward and southward. Line 11 is being extended to connect new suburbs. Tram lines are expanding along the outer boulevards, and bus networks are being reorganized to feed into the rail system more efficiently. The goal is a regional transit network that allows travel between any two points without necessarily passing through central Paris.

The dataset updates periodically to reflect these changes. New stations appear as they open. Ridership figures are updated with each annual release. The metro data is not a static snapshot but an evolving portrait of a transit system that continues to grow and adapt. Future versions will include the Grand Paris Express stations, extending coverage far beyond the historical city limits.

For gt demonstrations, the dataset provides geographic data with French language station names, offering opportunities to demonstrate locale handling and the presentation of transit network information. The ridership figures support ranking tables. The line assignments (stored as comma-separated values) demonstrate handling of multi-valued fields. The opening dates span over a century, creating interesting timelines. But beyond these technical uses, the dataset offers a window into one of humanity’s great collective achievements: a transit system that moves millions of people daily, efficiently and reliably, through one of the world’s most beautiful cities.

C.11 `gibraltar`

The gibraltar dataset contains hourly weather observations from Gibraltar during May 2023: temperature, humidity, wind speed, cloud cover, and other meteorological variables. It provides 744 rows representing each hour of a single month in this small but fascinating territory.

Time	Temp (°C)	Humidity	Wind (km/h)	Direction	Condition
Gibraltar Morning Weather
May 1, 2023: fog clearing to fair skies.
06:50	17.2	72%	0.4	W	Fair
07:50	17.8	88%	0.9	NE	Patches of Fog
08:50	17.2	82%	0.9	W	Patches of Fog
09:20	17.8	77%	2.7	WSW	Patches of Fog
09:50	17.8	77%	2.2	WSW	Fair
10:20	18.9	73%	2.7	SW	Fair
10:50	21.1	64%	1.3	WSW	Fair
11:20	21.1	68%	2.7	ESE	Fair
11:50	22.2	60%	2.2	SE	Fair
12:20	22.2	60%	2.2	E	Fair
12:50	22.2	60%	2.2	E	Fair
13:20	22.2	64%	2.7	E	Fair
13:50	22.2	64%	2.7	E	Fair

Gibraltar sits at the southern tip of the Iberian Peninsula, a British Overseas Territory of barely seven square kilometers guarding the entrance to the Mediterranean Sea. It is the kind of place that captures the imagination precisely because it seems improbable: a limestone promontory with its own airport runway crossing the main road, Barbary macaques roaming the upper rock, and a rather complex history.

C.11.1 Understanding the Rock

The Rock of Gibraltar rises 426 meters above sea level, a dramatic limestone formation that has served as a strategic landmark for millennia. The ancient Greeks called it one of the Pillars of Hercules, marking the edge of the known world. Every Mediterranean power has recognized its importance: control Gibraltar and you control access between the Atlantic and the Mediterranean. The British acquired it in 1704 during the War of Spanish Succession and have held it ever since, despite periodic Spanish objections and one famous siege that lasted nearly four years.

	Temperature (°C)			Humidity	Wind
Gibraltar Weather by Time of Day
May 2023
	Average	Maximum	Minimum	Humidity	Wind
Morning	19.0	23.9	13.9	1%	3.5
Afternoon	21.6	30.0	15.0	1%	4.7
Evening	21.3	28.9	15.0	1%	4.4
Night	18.9	27.2	13.9	1%	4.0

The May weather data captures Gibraltar in spring, before the intense heat of Mediterranean summer arrives. Temperatures climb through the afternoon hours and descend through the evening, following the familiar diurnal pattern. Humidity inversely tracks temperature, rising as the air cools. Wind direction matters at Gibraltar: the Levante wind blows from the east through the strait, often bringing fog as Mediterranean moisture condenses against the Rock. The Poniente arrives from the west, drier and clearer. These wind patterns shaped navigation through the strait for centuries of sailing ships.

	Hours
Wind Direction Frequency
May 2023
E	271
W	216
ENE	171
WSW	165
NE	118
SSW	115
ESE	111
SW	102
S	52
NNE	33
SE	33
WNW	15
NNW	10
NW	8
N	6
SSE	4
CALM	1

The predominance of certain wind directions reflects the geography of the strait. Air flows through the narrow gap between Europe and Africa, channeled by the mountains on either side. Local topography further complicates matters: the Rock itself creates wind shadows and acceleration zones. Pilots landing at Gibraltar Airport must contend with these effects, making it one of the more challenging airports in Europe. The runway crosses Winston Churchill Avenue, requiring traffic to stop when aircraft land or take off.

May was chosen simply to provide pre-summer weather data. Gibraltar’s Mediterranean climate means mild, pleasant conditions that month, with temperatures climbing toward but not yet reaching peak summer heat. The specific year (2023) holds no particular significance beyond being recent enough for the data to feel current. The data comes from weather APIs providing historical observations, typical of the sources that make meteorological data increasingly accessible for analysis and visualization.

For gt, the dataset demonstrates time series formatting, weather data presentation, and the handling of multiple related numeric columns. Temperature formatting, wind direction encoding, and the diurnal patterns visible in hourly data all provide teaching opportunities.

C.12 `films`

The films dataset is a labor of love: a comprehensive record of every film that has competed for the Palme d’Or at the Cannes Film Festival. It contains 1,607 entries spanning the festival’s history, with each row recording a film’s title (in both English and original language), director, year, country of origin, spoken languages, and IMDB link.

Film	Director	Country
Cannes Film Festival 2019
Official Competition
A Hidden Life	Terrence Malick	United Kingdom, Germany, United States
Atlantics	Mati Diop	France, Senegal, Belgium
Bacurau	Juliano Dornelles, Kleber Mendonça Filho	Brazil, France
Pain and Glory	Pedro Almodóvar	Spain, France
Frankie	Ira Sachs	France, Portugal
Parasite	Bong Joon Ho	South Korea
The Traitor	Marco Bellocchio	Italy, France, Germany, Brazil
It Must Be Heaven	Elia Suleiman	France, Qatar, Germany, Canada, Turkey, Palestine
The Whistlers	Corneliu Porumboiu	Romania, France, Germany, Switzerland, Sweden
Young Ahmed	Jean-Pierre Dardenne, Luc Dardenne	Belgium, France
Les Misérables	Ladj Ly	France
Little Joe	Jessica Hausner	Austria, United Kingdom, Germany, France
Matthias & Maxime	Xavier Dolan	Canada
Mektoub, My Love: Intermezzo	Abdellatif Kechiche	France
The Wild Goose Lake	Yi'nan Diao	China, France
Once Upon a Time in... Hollywood	Quentin Tarantino	United States, United Kingdom, China
Portrait of a Lady on Fire	Céline Sciamma	France
Oh Mercy!	Arnaud Desplechin	France
Sibyl	Justine Triet	France, Belgium
Sorry We Missed You	Ken Loach	United Kingdom, France, Belgium
The Dead Don't Die	Jim Jarmusch	United States

The dataset exists because I really like watching movies. My letterboxd account (letterboxd.com/rich_i/) tracks my viewing history and provides an ongoing record of films watched and opinions formed (manifesting in star ratings). Film festivals provide endless opportunities for discovery, surfacing works that might never reach mainstream distribution. The Cannes Film Festival, as the most prestigious venue for international cinema, seemed like essential data that should exist in an accessible format. But no such dataset was publicly available. The only logical solution was to create one.

C.12.1 Building the Cannes dataset

Construction required extensive research spanning months of work. The festival’s official website provided the foundation, listing competition entries by year. But the website alone was insufficient. Many older entries appeared only with French titles, requiring investigation to find corresponding English names (or vice versa for English-language films shown under French titles). Some films had been released under multiple names in different markets, demanding careful verification of which title was authoritative.

	In-Competition Films
Cannes Competition Entries by Year
Sample of years from 1970 onward.
1970	25
1971	26
1972	25
1973	24
1974	26
1975	22
1976	20
1977	23
1978	23
1979	21
1980	23
1981	22
1982	22
1983	22
1984	19
1985	20
1986	20
1987	20
1988	21
1989	22

IMDB links were tracked down for each entry, providing viewers easy access to cast lists, synopses, and user ratings. This was straightforward for recent films but required detective work for older or more obscure entries. Some films from the 1950s and 1960s had minimal online presence, with IMDB pages containing little additional information. But the links exist for completeness, allowing interested viewers to explore further.

Spoken languages and countries of origin required the most careful coding. International co-productions muddy the concept of a film’s “country”. Is a film shot in France, funded by German and Italian producers, directed by a Polish filmmaker, and starring British actors a French film? The dataset records all countries involved in production, accepting that many films defy simple national categorization. Languages posed similar challenges: a film might be primarily in French with scenes in Arabic and English, and all three languages deserve acknowledgment. Where multiple languages appear, I tried to arrange them roughly by the quantity of words spoken in each, so the first language listed is generally the one that dominates the dialogue.

C.12.2 The festival and its significance

The Cannes Film Festival has operated since 1946 (with a brief predecessor event in 1939 interrupted by war). It functions simultaneously as a trade show for film distribution, a competition for artistic achievement, and a showcase for celebrity culture. The Palme d’Or, awarded to the best film in competition, carries considerable prestige. Winners enter the canon of international cinema, their directors’ careers transformed by the recognition.

	Films in Competition
Countries by Cannes Competition Entries
Single-country productions across festival history.
United States	201
France	150
United Kingdom	81
Italy	73
Japan	56
USSR	44
Spain	39
Germany	36
Hungary	32
Sweden	27
Mexico	25
Poland	24

The table above reveals which national cinemas have received the most recognition at Cannes. France dominates, unsurprisingly given that Cannes is a French festival on the French Riviera. The United States and Italy follow, both countries with robust film industries and strong traditions of auteur filmmaking. Japan’s presence reflects the festival’s long appreciation for directors like Kurosawa, Ozu, and more recently Kore-eda and Hamaguchi. The geographic diversity of competition entries has increased over time, with films from Korea, Iran, Thailand, and other countries appearing regularly in recent decades.

The festival also reflects changing tastes and priorities in world cinema. In its early decades, Cannes emphasized European art cinema and established masters. The 1970s brought more adventurous programming, with controversial entries and recognition for directors working outside commercial constraints. Recent years have seen increased attention to women directors (historically underrepresented) and to cinemas from regions previously marginalized in international distribution.

	Competition Entries	Years Active
Most Frequent Cannes Competitors
Directors with 5+ competition entries.
Ken Loach	15	1981-2023
Jean-Pierre Dardenne, Luc Dardenne	10	1999-2025
Wim Wenders	10	1976-2023
Carlos Saura	9	1960-1988
Lars von Trier	9	1984-2011
Nanni Moretti	9	1978-2023
Ettore Scola	8	1970-1989
Jim Jarmusch	8	1986-2019
Marco Bellocchio	8	1980-2023
Marco Ferreri	8	1963-1991

The directors who return repeatedly to Cannes competition form a roster of international cinema’s most celebrated figures. Their repeated presence reflects both the festival’s loyalty to directors it has championed and these filmmakers’ continued production of work deemed worthy of the world’s most competitive showcase. For many, a Cannes premiere represents the peak of artistic recognition, the moment when a new work enters the conversation of global cinema.

C.12.3 Film as data

The films dataset demonstrates that even cultural artifacts can be structured for analysis. Each film becomes a row with attributes: title, year, director, country, language. These attributes support queries that would be tedious to answer through casual browsing. Which directors have competed most often? How has the linguistic diversity of competition entries changed over time? What proportion of recent competitors are first-time Cannes directors versus returning favorites?

For gt specifically, films demonstrates fmt_flag() and fmt_country() in realistic contexts. The country codes translate directly to flag icons, creating visual tables that communicate nationality at a glance. The categorical structure (years, directors, countries) provides natural grouping opportunities. The IMDB URLs demonstrate link formatting for external references. It is a dataset that makes beautiful tables almost by accident, because film data is inherently interesting to display.

The dataset updates annually as each new festival adds to the historical record. Every May, the Cannes competition announces its official selection, and those entries will appear in future versions of films. The ongoing maintenance reflects both practical utility (keeping examples current) and personal interest (following each year’s festival with the attention of a devoted fan).

C.12.4 My Letterboxd

The films dataset exists because I really enjoy movies, and that love extends well beyond festival competition entries. Below is a searchable, sortable table of every film I’ve watched (and mostly rated) on Letterboxd. The data was assembled using scripts in this book’s repository (scripts/scrape-letterboxd.R), which merge the Letterboxd data export files and fetch director information from individual film pages.

Rich’s Letterboxd

All 966 watched films.

Source: letterboxd.com/rich_i

One detail worth noting is that the star ratings used fmt() for formatting rather than a pre-formatted text column. This matters for interactive tables because fmt() changes only the display while preserving the underlying numeric values. When a user clicks the Rating column header to sort, the table sorts on the original numbers (5, 4.5, 4, …) rather than on rendered text like “★★★★½”, which would sort alphabetically and produce nonsensical results. It is a small trick but an important one whenever you need sortable columns with custom formatting in opt_interactive() tables!

C.13 `illness`

The illness dataset takes a different approach than the others. Rather than modeling behavior or compiling reference data, it reproduces a single table from a published scientific article. The source is “A fatal yellow fever virus infection in China: description and lessons” from Emerging Microbes & Infections (July 2016), which documented laboratory test results for a patient who contracted yellow fever during travel to Angola.

	Units	Day 3	Day 7	Day 9
Viral load	copies per mL	12000.00	760.00	250.00
WBC	×10⁹/L	5.26	24.77	19.03
Neutrophils	×10⁹/L	4.87	22.08	16.59
RBC	×10¹²/L	5.72	4.12	3.32
Hb	g/L	153.00	75.00	95.00
PLT	×10⁹/L	67.00	74.10	25.60
ALT	U/L	12835.00	1623.70	512.40
AST	U/L	23672.00	2189.00	782.50
TBIL	µmol/L	117.20	127.30	163.20
DBIL	µmol/L	71.40	117.80	126.30

The article is freely available under a Creative Commons license, making reproduction appropriate. The dataset was created specifically to test gt’s fmt_units() function and its ability to render scientific unit notation correctly. Medical laboratory results frequently include units like mL, μL, g/dL, U/L, and ×10³/μL that require careful formatting (the last of those being particularly tedious to typeset correctly without dedicated tooling). The question was whether gt could faithfully reproduce the original Table 1 from the article.

C.13.1 Reading laboratory values

Medical laboratory tests generate data that require specialized interpretation. Each test has reference ranges defining normal values, and deviations above or below those ranges signal pathology. A white blood cell count of 3.0 × 10⁹/L might indicate leukopenia (low white cells), potentially signifying infection, bone marrow problems, or medication side effects. Liver enzymes elevated beyond normal ranges suggest hepatic damage. Reading the illness dataset means tracking multiple indicators as they evolve day by day through a fatal disease progression.

	Units	Normal Range
Laboratory Test Reference Ranges
	Units	Low	High
WBC	x10^9 / L	4.0	10.0
Neutrophils	x10^9 / L	2.0	8.0
RBC	x10^12 / L	4.0	5.5
Hb	g / L	120.0	160.0
PLT	x10^9 / L	100.0	300.0
ALT	U/L	9.0	50.0
AST	U/L	15.0	40.0
TBIL	umol/L	0.0	18.8
DBIL	umol/L	0.0	6.8
NH3	mmol/L	10.0	47.0
PT	s	9.4	12.5
APTT	s	25.1	36.5

The normal ranges provide context for interpreting measurements. When day 9 values fall far outside these ranges, the severity becomes apparent. Bilirubin rising dramatically indicates liver failure. Creatinine elevation signals kidney involvement. The cascade of organ dysfunction visible in sequential laboratory values explains why this case study merited publication and why it serves as a teaching resource.

The dataset thus serves as a benchmark: if you can recreate a published scientific table using gt, the package’s formatting capabilities are proven sufficient for real-world use. The illness data provides that proof of concept while also documenting a tragic case that contributed to medical understanding of yellow fever progression.

C.14 `rx_adsl` and `rx_addv`

These two datasets represent gt’s connection to the pharmaceutical industry, where clinical trial tables must meet rigorous standards for regulatory submission. The datasets follow CDISC (Clinical Data Interchange Standards Consortium) conventions, specifically the ADaM (Analysis Data Model) structure used throughout the pharmaceutical industry.

rx_adsl contains subject-level data (ADSL format) for 182 participants in a fictional clinical trial. rx_addv provides protocol deviation records (ADDV format) with 910 entries documenting when and how trial participants deviated from study protocols. Both datasets use standard variable names and coding conventions that pharmaceutical statisticians will immediately recognize.

Subject ID	Age	Sex	Ethnicity	Treatment
GT1000	37	Male	Hispanic or Latino	NA
GT1001	41	Male	Not Hispanic or Latino	Placebo
GT1002	39	Female	Not Hispanic or Latino	Placebo
GT1003	38	Male	Not Hispanic or Latino	Placebo
GT1004	45	Male	Not Hispanic or Latino	Placebo
GT1005	35	Female	Hispanic or Latino	Placebo
GT1006	42	Female	Not Hispanic or Latino	Placebo
GT1007	35	Male	Not Hispanic or Latino	Placebo

C.14.1 The language of clinical trials

Pharmaceutical data follows conventions that seem arcane to outsiders but enable precise communication among specialists. USUBJID uniquely identifies a subject across all studies from a sponsor. TRTA indicates the actual treatment received (as opposed to the treatment assigned). SAFFL flags subjects in the safety population. This vocabulary, defined by CDISC standards, appears in regulatory submissions worldwide. A statistician in Switzerland reviewing a submission from Japan knows exactly what TRTA means because the standards are universal.

	Subjects	Mean Age	Female	Ethnicities
Treatment Arm Demographics
Placebo	90	41.2	0%	3
Drug 1	90	39.2	0%	3
NA	2	38.5	0%	1

The treatment arms in clinical trials typically include the experimental treatment at one or more doses, a placebo or active comparator, and sometimes multiple dosing regimens. Demographic balance across arms helps ensure that observed differences reflect treatment effects rather than baseline differences. Age, sex, ethnicity, disease severity at baseline, and prior treatments all require documentation and comparison.

	Deviations
Protocol Deviation Categories
	187
Major	104

Protocol deviations document when trial participants did not follow the study plan. Some deviations are minor (a visit occurring outside the allowed window). Others are major (taking prohibited medications, missing doses). The rx_addv dataset catalogs these deviations, enabling sensitivity analyses that exclude subjects with major violations. Regulators scrutinize deviation patterns for evidence that the trial was conducted properly and that deviations do not undermine the conclusions.

These datasets were contributed by Alexandra Lauer as part of ongoing collaboration between gt developers and pharmaceutical industry users. The package website includes a dedicated case study article demonstrating how to create clinical tables that meet industry standards. For pharmaceutical statisticians evaluating gt for regulatory work, these datasets provide immediately relevant examples.

The inclusion of pharmaceutical data reflects gt’s ambition to serve professional communities with specialized requirements. Clinical trials generate enormous quantities of tabular output, much of it following strict formatting conventions. Having sample datasets in the standard format lowers the barrier for pharmaceutical users to adopt gt and verify that it meets their needs.

C.15 The value of curated datasets

Looking across all eighteen datasets, certain patterns emerge. Many fill gaps where public data existed but not in convenient form. The scientific datasets consolidate information scattered across journals and government pages. The films dataset creates a resource that simply did not exist before. Even the simulated pizzaplace data serves a purpose: realistic transactional data is rarely available publicly due to business confidentiality.

Other datasets reflect personal curiosity. Ontario towns. The Paris Métro. Gibraltar’s weather. Cannes films. These choices say something about my interests and the particular corners of the world that captured my attention. The datasets are better for this personal investment. Someone who cares about the Paris Métro will notice details that a disinterested compiler would miss.

For users of gt, the datasets provide reliable materials for learning and experimentation. The variety ensures that nearly any table type (financial, scientific, demographic, geographic, categorical) has relevant sample data available. The careful construction means edge cases are present: missing values, unusual formatting requirements, multi-valued fields. The documentation grounds each dataset in context that makes the data more meaningful to work with.

Datasets are infrastructure. Good ones get used for years, appearing in examples, tutorials, homework assignments, and documentation. The eighteen datasets in gt aspire to that longevity. They are not throwaway data generated to fill a requirement but carefully assembled resources intended to remain useful across many versions and use cases. The stories behind them, now recorded here, add another layer of value: not just what the data contains but why it exists and where it came from.

C.1 pizzaplace

C.1.1 The pizzas of pizzaplace