| The Complete pizzaplace Menu | ||
| All 32 pizza varieties by category. | ||
| Pizza | Available Sizes | Price Range |
|---|---|---|
| chicken | ||
| The Barbecue Chicken Pizza bbq_ckn |
S, M, L | $12.75–$20.75 |
| The California Chicken Pizza cali_ckn |
S, M, L | $12.75–$20.75 |
| The Chicken Alfredo Pizza ckn_alfredo |
S, M, L | $12.75–$20.75 |
| The Chicken Pesto Pizza ckn_pesto |
S, M, L | $12.75–$20.75 |
| The Southwest Chicken Pizza southw_ckn |
S, M, L | $12.75–$20.75 |
| The Thai Chicken Pizza thai_ckn |
S, M, L | $12.75–$20.75 |
| classic | ||
| The Big Meat Pizza big_meat |
S | $12.00 |
| The Classic Deluxe Pizza classic_dlx |
S, M, L | $12.00–$20.50 |
| The Greek Pizza the_greek |
S, M, L, XL, XXL | $12.00–$35.95 |
| The Hawaiian Pizza hawaiian |
S, M, L | $10.50–$16.50 |
| The Italian Capocollo Pizza ital_cpcllo |
S, M, L | $12.00–$20.50 |
| The Napolitana Pizza napolitana |
S, M, L | $12.00–$20.50 |
| The Pepperoni Pizza pepperoni |
S, M, L | $9.75–$15.25 |
| The Pepperoni, Mushroom, and Peppers Pizza pep_msh_pep |
S, M, L | $11.00–$17.50 |
| supreme | ||
| The Brie Carre Pizza brie_carre |
S | $23.65 |
| The Calabrese Pizza calabrese |
S, M, L | $12.25–$20.25 |
| The Italian Supreme Pizza ital_supr |
S, M, L | $12.50–$20.75 |
| The Pepper Salami Pizza peppr_salami |
S, M, L | $12.50–$20.75 |
| The Prosciutto and Arugula Pizza prsc_argla |
S, M, L | $12.50–$20.75 |
| The Sicilian Pizza sicilian |
S, M, L | $12.25–$20.25 |
| The Soppressata Pizza soppressata |
S, M, L | $12.50–$20.75 |
| The Spicy Italian Pizza spicy_ital |
S, M, L | $12.50–$20.75 |
| The Spinach Supreme Pizza spinach_supr |
S, M, L | $12.50–$20.75 |
| veggie | ||
| The Five Cheese Pizza five_cheese |
L | $18.50 |
| The Four Cheese Pizza four_cheese |
M, L | $14.75–$17.95 |
| The Green Garden Pizza green_garden |
S, M, L | $12.00–$20.25 |
| The Italian Vegetables Pizza ital_veggie |
S, M, L | $12.75–$21.00 |
| The Mediterranean Pizza mediterraneo |
S, M, L | $12.00–$20.25 |
| The Mexicana Pizza mexicana |
S, M, L | $12.00–$20.25 |
| The Spinach Pesto Pizza spin_pesto |
S, M, L | $12.50–$20.75 |
| The Spinach and Feta Pizza spinach_fet |
S, M, L | $12.00–$20.25 |
| The Vegetables + Vegetables Pizza veggie_veg |
S, M, L | $12.00–$20.25 |
Appendix C — The gt datasets
Every dataset tells a story. The eighteen datasets bundled with gt were not assembled arbitrarily or pulled from dusty archives of convenient CSV files. Each one emerged from a specific need, a personal curiosity, or a gap in what was publicly available. Some model human behavior with surprising fidelity. Others preserve scientific data that would otherwise remain scattered across obscure government pages. A few exist simply because I thought “I wish this dataset existed” and then made it so.
This appendix takes you behind the scenes of these datasets. You will learn where they came from, how they were constructed, and why they matter beyond their immediate utility as demonstration data. Along the way, we will explore the broader contexts they represent: the economics of a neighborhood pizza shop, the 125-year evolution of urban transit in Paris, the delicate chemistry of Earth’s atmosphere, and the quiet dramas of population change in Ontario’s small towns. The datasets are more than rows and columns. They are windows into worlds worth understanding.
C.1 pizzaplace
The pizzaplace dataset contains 49,574 rows representing every pizza sold at a fictional pizzeria during the year 2015. Each row records a transaction with its timestamp, pizza type, size, and price. On the surface it appears to be straightforward sales data. Underneath, it is an elaborate simulation of human behavior, kitchen operations, and the unpredictable rhythms of a small food business.
The inspiration for this dataset came from Plateau Pizza, a real establishment in Coquitlam, British Columbia. The restaurant occupies a pleasant spot in a suburban plaza alongside a dollar store called Dollars & Cents and an IGA grocery store. It is the kind of neighborhood pizza place that survives on regulars and convenience, offering pizzas with names that are equal parts cheesy and memorable. Goat Supreme. The Calabrese. Names that stick in your head even if you cannot quite remember what toppings they included.
The dataset borrowed liberally from Plateau Pizza’s menu. Their category structure (Classic, Supreme, Veggie and Vegan, Chicken) was adopted directly. Many pizza names and ingredient combinations came straight from their website, though some were embellished and others invented entirely. The fictional additions drew inspiration from Food Network recipes and local salumist offerings that seemed appropriately gourmet. Pricing followed a similar pattern: real prices served as the baseline, then adjustments were made based on perceived ingredient costs. The fancier cheeses and cured meats commanded premium prices, as they should.
What makes pizzaplace interesting is not the menu but the simulation that generated a year’s worth of orders. The modeling script (preserved in the gt package’s source repository at data-raw/05-pizzaplace.R) creates synthetic customers who arrive throughout each day with realistic patterns. Barret Schloerke contributed refinements to the behavioral model, adding nuance to timing and preference distributions. Weekends see more traffic than weekdays, reflecting the work-leisure split that governs so much of consumer behavior. Holidays disrupt normal patterns in expected ways.
The simulation also includes what might be called narrative events. A kitchen fire at one point disrupts operations. These partially catastrophic incidents add verisimilitude to what would otherwise be suspiciously smooth data. Real businesses experience interruptions, equipment failures, staff shortages, and the occasional minor disaster. The pizzaplace data reflects this messiness, making it more valuable for realistic analysis exercises than perfectly clean synthetic data would be.
The dataset originally included plans for appetizers, side dishes, and beverages, but these were ultimately cut in favor of simplicity. A pizzeria that only sells pizza is easier to understand and analyze than one with a full menu. The constraint also keeps the focus on what makes the dataset distinctive: the careful modeling of how people order pizza throughout a year.
C.1.1 The pizzas of pizzaplace
The menu reveals the character of the fictional establishment. Classic pizzas stick to familiar territory: pepperoni, Hawaiian, the combinations everyone recognizes. The Supreme category ventures into more elaborate ingredient lists with names that promise indulgence. Veggie options cater to those avoiding meat, while the Chicken category builds meals around that particular protein. Each pizza comes in multiple sizes, though not every pizza is available in every size. The XL (and XXL!) option appears only for a certain pizza, and presumably that one pizza is popular enough to warrant the larger formats (could be that the ingredient mix works well at scale).
Pricing follows intuitive logic. A basic cheese pizza costs less than one loaded with prosciutto and artichoke hearts. Size increases bring proportional price increases, though the per-square-inch cost typically decreases as you go larger (the eternal economic argument for ordering the bigger pizza). These pricing patterns create natural opportunities for analysis: which pizzas generate the most revenue? Which sizes sell best and for which types? How does day of week affect the popularity of the vegetarian options?
C.1.2 A pizza tier list
Any discussion of pizza invites opinions about which pizzas are best. The following tier list represents one possible ranking of the pizzaplace menu, from essential classics to more adventurous options that may not suit every palate. Reasonable people will disagree, and that disagreement is part of what makes pizza culture endlessly entertaining.
| The Definitive pizzaplace Tier List | ||
| A highly subjective ranking of 15 notable pizzas. | ||
| Pizza | Ingredients | Assessment |
|---|---|---|
| S Tier — Essential | ||
| The Pepperoni | Mozzarella, Pepperoni, Tomato Sauce | The benchmark against which all pizzas are measured |
| The Big Meat | Bacon, Pepperoni, Italian Sausage, Chorizo, Mozzarella, Tomato Sauce | Maximalist approach that somehow works |
| The Classic Deluxe | Pepperoni, Mushrooms, Red Onions, Red Peppers, Bacon, Mozzarella, Tomato Sauce | Everything a pizza should be |
| A Tier — Excellent | ||
| The Barbecue Chicken | Barbecued Chicken, Red Peppers, Green Peppers, Tomatoes, Red Onions, Mozzarella, Barbecue Sauce | Sweet and savory in perfect balance |
| The Hawaiian | Sliced Ham, Pineapple, Mozzarella, Tomato Sauce | Controversial but undeniably popular |
| The Italian Capocollo | Capocollo, Red Peppers, Tomatoes, Goat Cheese, Garlic, Oregano, Mozzarella, Tomato Sauce | Elevated ingredients lift the whole experience |
| The Calabrese | Nduja, Italian Sausage, Pepperoni, Tomatoes, Red Onions, Mozzarella, Tomato Sauce | Spicy nduja provides serious depth |
| The Prosciutto | Prosciutto, Arugula, Mozzarella, Tomato Sauce | Simple elegance, fresh and light |
| B Tier — Good | ||
| The Four Cheese | Ricotta, Gorgonzola, Romano, Mozzarella, Tomato Sauce | For dedicated cheese enthusiasts only |
| The Vegetables | Mushrooms, Tomatoes, Red Peppers, Green Peppers, Red Onions, Zucchini, Spinach, Garlic, Mozzarella, Tomato Sauce | The vegetable abundance can overwhelm |
| The Spinach Pesto | Spinach, Artichokes, Tomatoes, Sun-dried Tomatoes, Garlic, Pesto Sauce, Mozzarella | Pesto base divides opinion |
| The Greek | Kalamata Olives, Feta, Tomatoes, Red Onions, Red Peppers, Garlic, Mozzarella, Tomato Sauce | Mediterranean flavors work better as salad |
| C Tier — Passable | ||
| The Brie Carre | Brie, Prosciutto, Caramelized Onions, Pears, Thyme, Garlic, Mozzarella | Ambitious but confused identity |
| The Chicken Pesto | Chicken, Tomatoes, Red Peppers, Spinach, Garlic, Pesto Sauce, Mozzarella | Pesto and chicken compete rather than complement |
| The Soppressata | Soppressata, Fontina, Mozzarella, Garlic, Tomato Sauce | Underwhelming given premium ingredients |
| Rankings reflect one person's taste. YMMV. | ||
The S-tier pizzas represent the core of what a pizzeria should do well. The Pepperoni is the foundation, the pizza against which all others are implicitly compared. When someone says “let’s get pizza”, this is what most people imagine. The Big Meat takes maximalism seriously but avoids the trap of incoherence: bacon, pepperoni, sausage, and chorizo sound excessive, but they harmonize around a common meatiness (it really works!). The Classic Deluxe achieves balance, incorporating vegetables (mushrooms, peppers, onions) alongside meat without letting any single ingredient dominate.
The A-tier pizzas represent successful experiments. The Hawaiian remains perpetually controversial, inspiring passionate defenses and equally passionate condemnations. But I’m a fan. The combination of salty ham and sweet pineapple against tangy tomato sauce works for a lot of people, and the dataset shows it selling consistently throughout the year. The Italian Capocollo and Calabrese elevate the genre through premium cured meats, offering pizzas that could credibly appear on a trattoria menu. The Prosciutto demonstrates that restraint can be a virtue: just ham, arugula, and cheese, but executed with quality ingredients.
The B-tier pizzas are solid but not essential. The Four Cheese appeals to dedicated turophiles but can feel monotonous without the textural variety that vegetables or meats provide. The vegetable-heavy options (Vegetables, Greek, Spinach Pesto) often release too much moisture during cooking, resulting in soggy centers that undermine the crust. Don’t get me wrong, they are fine pizzas but rarely anyone’s first choice.
The C-tier pizzas represent experiments that did not quite succeed. The Brie Carre attempts a French-inspired combination of brie, pears, and prosciutto that sounds sophisticated but tastes confused. The ingredients come from different culinary traditions and never fully integrate. These pizzas do seem to sell in this particular dataset, but they probably rarely inspire repeat orders or enthusiastic recommendations.
C.1.3 What makes a good pizza
The tier list above reflects accumulated pizza wisdom, but the principles underlying it deserve articulation. A good pizza balances several competing demands. The crust must be sturdy enough to support toppings without becoming soggy, yet thin and pliable enough to fold. The sauce should be present but not overwhelming, providing acidity and moisture without drowning the other flavors. The cheese needs to melt smoothly and brown slightly without becoming rubbery or releasing pools of grease. The toppings must be distributed evenly and cut to appropriate sizes so that each bite contains a representative sample. And freshness is non-negotiable. A pizza that has been languishing in a display case for hours, slowly drying out under a heat lamp, is a shadow of its former self. The crust toughens, the cheese congeals, and the toppings develop that dispiriting sheen of oxidation. At the extreme end of neglect, one reviewer actually found mold growing on the underside of a display pizza, which is the sort of discovery that makes you reconsider every grab-and-go slice you have ever eaten.
Beyond these structural requirements, good pizza demonstrates restraint. The impulse to add more toppings is understandable but usually misguided. Each additional ingredient dilutes the impact of everything else. The best pizzas feature three to five components (beyond sauce and cheese) that complement one another. Pepperoni alone is better than pepperoni with six other meats. Mushrooms and peppers work together because their textures contrast and their flavors do not compete.
Temperature matters enormously. A pizza that has been sitting for twenty minutes bears little resemblance to one fresh from the oven. The cheese firms up, the crust softens, the toppings cool into sad disconnection. This is why pizzerias survive on dine-in and delivery rather than takeout that sits in a car. And do not even get me started on microwave pizza. The microwave does something deeply wrong to pizza crust, transforming it into a chewy, rubbery substance that no longer qualifies as bread by any reasonable definition. The cheese melts unevenly, the sauce superheats into lava pockets, and the whole experience leaves you wondering why you bothered. Cold leftover pizza eaten standing at the refrigerator at midnight is honestly preferable (I’ve actually grown to like it more and more these days). The pizzaplace dataset implicitly captures this reality: the simulation models customers who order and receive pizzas within a reasonable timeframe, not pizzas boxed and forgotten (and certainly not pizzas reheated in a microwave).
Finally, good pizza requires good ingredients. This seems obvious but explains why some inexpensive pizzas disappoint while others satisfy. The quality of the mozzarella matters. The tomatoes in the sauce matter. Even the olive oil brushed on the crust matters. Plateau Pizza, the real establishment that inspired the dataset, succeeds partly because it sources decent ingredients and does not cut corners that customers would notice. The fictional pizzaplace inherits this philosophy.
And yet, after all this talk of balance and restraint, it is worth acknowledging that pizza can abandon every one of these conventions and still be legitimate. A pizza can have no sauce. It can have no cheese. It can be nothing more than dough, olive oil, and anchovies. This sounds shocking to anyone raised on North American delivery pizza, but it is a perfectly valid expression of pizzadom with deep roots in Italian tradition. Pizza bianca, pizza marinara, and the various focaccia-adjacent flatbreads of southern Italy predate the mozzarella-laden versions by centuries. The rules above describe one tradition. Pizza is generous enough to accommodate others.
C.1.4 Why pizza data matters
Pizza occupies a unique position in the landscape of consumer goods. It is simultaneously simple (bread, sauce, cheese, toppings) and infinitely variable. A pizzeria’s menu encodes assumptions about its customers: their adventurousness, their price sensitivity, their dietary restrictions. Sales patterns encode behavior: when people eat, what they celebrate, how weather affects appetite.
Part of my motivation for creating this dataset was to draw attention to pizza analytics as a legitimate field of inquiry. The phrase sounds ridiculous, and that is partly the point. If a dataset can make someone smile while also teaching them time series decomposition and category performance analysis, it has done more work than a dataset that only accomplishes the second thing. Nobody has ever felt intimidated by pizza data. Nobody has ever opened a CSV of pizza orders and thought “I am not qualified to analyze this”. The approachability is a feature (not a frivolity!).
The pizzaplace dataset serves as a sandbox for the kinds of analysis that businesses perform constantly. Revenue breakdowns, time series decomposition, category performance, seasonal adjustment. These techniques apply far beyond pizza. Anyone learning to analyze transactional data will find the patterns in pizzaplace transferable to retail, hospitality, and service industries generally. The dataset is large enough to be realistic (nearly 50,000 transactions) but small enough to process quickly on any modern computer.
For gt specifically, pizzaplace demonstrates grouped data, aggregation, currency formatting, and the presentation of time-based information. A year of pizza sales can become a monthly summary table, a daily heatmap, a ranked list of bestsellers, or a comparison across categories. The richness of the underlying data supports dozens of different table designs.
C.2 exibble
The name exibble is a portmanteau of “example tibble” and it serves exactly that purpose. This tiny dataset of eight rows exists to demonstrate gt’s formatting capabilities without the distraction of meaningful content. Each column represents a different data type: numeric values, character strings, currency amounts, dates, times, datetimes, and logical values. Missing values appear in strategic locations to demonstrate sub_missing() and related substitution functions.
The exibble dataset |
||||||||
| 8 rows and 9 columns. | ||||||||
| num | char | fctr | date | time | datetime | currency | row | group |
|---|---|---|---|---|---|---|---|---|
| 1.111e-01 | apricot | one | 2015-01-15 | 13:35 | 2018-01-01 02:22 | 49.950 | row_1 | grp_a |
| 2.222e+00 | banana | two | 2015-02-15 | 14:40 | 2018-02-02 14:33 | 17.950 | row_2 | grp_a |
| 3.333e+01 | coconut | three | 2015-03-15 | 15:45 | 2018-03-03 03:44 | 1.390 | row_3 | grp_a |
| 4.444e+02 | durian | four | 2015-04-15 | 16:50 | 2018-04-04 15:55 | 65100.000 | row_4 | grp_a |
| 5.550e+03 | NA | five | 2015-05-15 | 17:55 | 2018-05-05 04:00 | 1325.810 | row_5 | grp_b |
| NA | fig | six | 2015-06-15 | NA | 2018-06-06 16:11 | 13.255 | row_6 | grp_b |
| 7.770e+05 | grapefruit | seven | NA | 19:10 | 2018-07-07 05:22 | NA | row_7 | grp_b |
| 8.880e+06 | honeydew | eight | 2015-08-15 | 20:20 | NA | 0.440 | row_8 | grp_b |
The column names are deliberately generic (num, char, currency, date, time, datetime) because the content does not matter. What matters is having every common data type available in a single compact dataset. When documenting a date formatter, you need a date column. When showing number formatting options, you need numbers. When explaining how to handle NA values, you need NA values in predictable locations.
C.2.1 Anatomy of a reference dataset
Each row and column in exibble was chosen to exercise different aspects of table formatting:
| exibble Column Structure | ||
| R Type | Role in Examples | |
|---|---|---|
| num | numeric | Demonstrates numeric formatting across scales |
| char | character | Provides recognizable text labels (fruits) |
| currency | numeric | Tests currency with decimals, zeros, NAs |
| date | Date | Shows date formatting patterns |
| time | character | Character-encoded times for parsing |
| datetime | POSIXct | Full datetime objects for formatting |
| row | character | Stub/rowname labels for tables |
| group | character | Group categories for row grouping |
The fruit names in the char column (apricot, banana, coconut, and so forth) follow alphabetical order, which makes them easy to verify when demonstrating sorting or filtering operations. The numeric values span several orders of magnitude, from fractions to millions, ensuring that formatters must handle both small precise values and large rounded ones. The currency column includes a missing value and one very small amount, testing edge cases that might trip up naive formatting approaches.
The row and group columns transform exibble from a formatting showcase into a structural one. With row serving as a stub and group organizing rows into categories, the same eight-row dataset can demonstrate virtually every gt feature. Headers, stubs, row groups, column formatting, substitution, styling… all can be shown using just exibble.
| exibble with Row Groups and Stub | |||||||
| Demonstrating structural features. | |||||||
| num | char | fctr | date | time | datetime | currency | |
|---|---|---|---|---|---|---|---|
| grp_a | |||||||
| row_1 | 0.11 | apricot | one | 1/15/2015 | 13:35 | 1/1/2018 02:22 | $49.95 |
| row_2 | 2.22 | banana | two | 2/15/2015 | 14:40 | 2/2/2018 14:33 | $17.95 |
| row_3 | 33.33 | coconut | three | 3/15/2015 | 15:45 | 3/3/2018 03:44 | $1.39 |
| row_4 | 444.40 | durian | four | 4/15/2015 | 16:50 | 4/4/2018 15:55 | $65,100.00 |
| grp_b | |||||||
| row_5 | 5,550.00 | — | five | 5/15/2015 | 17:55 | 5/5/2018 04:00 | $1,325.81 |
| row_6 | — | fig | six | 6/15/2015 | — | 6/6/2018 16:11 | $13.26 |
| row_7 | 777,000.00 | grapefruit | seven | — | 19:10 | 7/7/2018 05:22 | — |
| row_8 | 8,880,000.00 | honeydew | eight | 8/15/2015 | 20:20 | — | $0.44 |
Datasets like exibble just don’t get a lot of attention, however, they are essential as infrastructure for examples in documentation. Every example in gt’s documentation that needs to show a quick formatting demonstration reaches for exibble rather than constructing throwaway data inline. This consistency helps readers recognize the dataset and focus on what is being demonstrated rather than puzzling over unfamiliar data structures.
C.3 gtcars
The gtcars dataset contains specifications for 47 luxury and performance automobiles, with an emphasis on grand touring vehicles. The name works on two levels: these are GT (grand tourer) cars, and the dataset lives in a package called gt. The wordplay is intentional but understated. I try not to make a big deal about it.
| German and Italian Grand Tourers | |||||
| Precision engineering meets Mediterranean passion. | |||||
| HP | Torque |
MPG
|
MSRP | ||
|---|---|---|---|---|---|
| City | Hwy | ||||
| Audi | |||||
| R8 | 430 | 317 | 11 | 20 | $115,900 |
| RS 7 | 560 | 516 | 15 | 25 | $108,900 |
| S6 | 450 | 406 | 18 | 27 | $70,900 |
| S7 | 450 | 406 | 17 | 27 | $82,900 |
| S8 | 520 | 481 | 15 | 25 | $114,900 |
| BMW | |||||
| 6-Series | 315 | 330 | 20 | 30 | $77,300 |
| M4 | 425 | 406 | 17 | 24 | $65,700 |
| M5 | 560 | 500 | 15 | 22 | $94,100 |
| M6 | 560 | 500 | 15 | 22 | $113,400 |
| i8 | 357 | 420 | 28 | 29 | $140,700 |
| Mercedes-Benz | |||||
| AMG GT | 503 | 479 | 16 | 22 | $129,900 |
| SL-Class | 329 | 354 | 20 | 27 | $85,050 |
| Porsche | |||||
| 718 Boxster | 300 | 280 | 21 | 28 | $56,000 |
| 718 Cayman | 300 | 280 | 20 | 29 | $53,900 |
| 911 | 350 | 287 | 20 | 28 | $84,300 |
| Panamera | 310 | 295 | 18 | 28 | $78,100 |
| Ferrari | |||||
| 458 Italia | 562 | 398 | 13 | 17 | $233,509 |
| 458 Speciale | 597 | 398 | 13 | 17 | $291,744 |
| 458 Spider | 562 | 398 | 13 | 17 | $263,553 |
| 488 GTB | 661 | 561 | 15 | 22 | $245,400 |
| California | 553 | 557 | 16 | 23 | $198,973 |
| F12Berlinetta | 731 | 509 | 11 | 16 | $319,995 |
| FF | 652 | 504 | 11 | 16 | $295,000 |
| GTC4Lusso | 680 | 514 | 12 | 17 | $298,000 |
| LaFerrari | 949 | 664 | 12 | 16 | $1,416,362 |
| Lamborghini | |||||
| Aventador | 700 | 507 | 11 | 18 | $397,500 |
| Gallardo | 550 | 398 | 12 | 20 | $191,900 |
| Huracan | 610 | 413 | 16 | 20 | $237,250 |
| Maserati | |||||
| Ghibli | 345 | 369 | 17 | 24 | $70,600 |
| Granturismo | 454 | 384 | 13 | 21 | $132,825 |
| Quattroporte | 404 | 406 | 16 | 23 | $99,900 |
The dataset was assembled from Motor Trend articles about grand touring vehicles, with additional research filling in gaps for fuel economy and torque figures. Most vehicles date from around 2015, reflecting when the source articles were published. The selection criteria emphasized true grand tourers: vehicles designed for high-speed, long-distance driving in comfort, typically with powerful engines, refined interiors, and substantial price tags.
C.3.1 What makes a grand tourer?
The grand touring concept originated in 1950s Europe, when wealthy motorists began taking extended driving holidays across the continent. A proper GT needed range (400+ kilometers between fuel stops), performance (for the autobahns and mountain passes), and comfort (for hours behind the wheel). The Ferrari 250 GT established the template: front-mounted V12, leather interior, elegant coachwork by Pininfarina or Scaglietti. The grand tourer was always as much about aspiration as transportation.
Most Expensive Cars in gtcars 💰 |
||
| Top 10 by manufacturer's suggested retail price. | ||
| Manufacturer | Model | MSRP |
|---|---|---|
| Ferrari | LaFerrari | $1,416,362 |
| Ford | GT | $447,000 |
| Lamborghini | Aventador | $397,500 |
| Rolls-Royce | Dawn | $335,000 |
| Ferrari | F12Berlinetta | $319,995 |
| Rolls-Royce | Wraith | $304,350 |
| Ferrari | GTC4Lusso | $298,000 |
| Ferrari | FF | $295,000 |
| Ferrari | 458 Speciale | $291,744 |
| Aston Martin | Vanquish | $287,250 |
The top ten most expensive cars in the dataset tell a clear story about where the money goes. Ferrari dominates the list with five entries, led by the LaFerrari at over $1.4 million (more than three times the price of any other car in the dataset). The Ford GT makes a surprising appearance at number two, representing America’s answer to European exotica. A Lambo, two Rolls-Royces, and Aston Martin round out the list, each occupying a different niche of the ultra-luxury market. Notably absent from the top ten are the German manufacturers, whose cars offer serious performance at comparatively accessible price points.
| Performance Tiers | |||
Grouping the 47 cars in gtcars by horsepower output. |
|||
| Models | Avg Torque (lb-ft) |
Avg Price | |
|---|---|---|---|
| Modest (<400 HP) | 10 | 319 | $78,545 |
| Strong (400-499 HP) | 8 | 374 | $95,416 |
| High (500-599 HP) | 17 | 469 | $188,800 |
| Extreme (600+ HP) | 12 | 548 | $363,024 |
The relationship between horsepower and price is neither linear nor deterministic. Some modest-horsepower vehicles (certain Porsches, for instance) command premium prices through brand cachet and driving dynamics. Some extremely powerful vehicles achieve their output through brute displacement rather than exotic engineering, keeping prices relatively accessible. The correlation exists, but the exceptions tell interesting stories.
The choice to create gtcars was motivated by a desire for a modern equivalent to the venerable mtcars dataset that has shipped with R for decades. The original mtcars contains 1974 Motor Trend data on 32 automobiles, and it has been used in countless examples and tutorials. But cars from 1974 feel increasingly remote from contemporary experience. A dataset of modern luxury vehicles offers familiar reference points (Ferrari, Porsche, Aston Martin) and specifications that relate to cars people actually see on roads today.
For table-making purposes, gtcars provides natural groupings by manufacturer, multiple numeric columns suitable for formatting and comparison, and a mix of discrete and continuous variables. The manufacturer and model columns enable row grouping and stub labeling. The price column practically demands currency formatting. The horsepower and torque columns work well for bar chart visualizations within cells. It is a dataset that seems designed for beautiful tables because, in fact, it was.
C.4 countrypops
The countrypops dataset tracks population estimates for countries worldwide from 1960 through the present (currently extending to 2024). The data comes from the World Bank, which compiles demographic estimates from national statistical offices, census data, and the United Nations Population Division. With over 13,000 rows covering more than 200 countries across six decades, it is one of the larger datasets in the gt collection.
| Population Growth in Five Major Nations | ||||
| 1960 | 1980 | 2000 | 2020 | |
|---|---|---|---|---|
| Brazil | 72.39M | 121.21M | 174.02M | 208.66M |
| China | 667.07M | 981.23M | 1.26B | 1.41B |
| India | 435.99M | 687.35M | 1.06B | 1.40B |
| Nigeria | 45.05M | 73.76M | 126.38M | 214.00M |
| United States | 180.67M | 227.22M | 282.16M | 331.58M |
Population data might seem straightforward, but it encodes profound stories of human migration, economic development, public health, and political change. China’s population trajectory shows the demographic impact of the one-child policy. Nigeria’s explosive growth reflects patterns common across sub-Saharan Africa. European countries exhibit the stagnation and aging that accompany developed economies. Each row is a snapshot of millions of individual lives aggregated into a single number.
C.4.1 Understanding population data
Population counts are harder to obtain than one might assume. Only a handful of countries conduct reliable censuses at regular intervals. Many estimates rely on birth and death registrations (which vary in completeness), surveys of representative samples (which involve statistical uncertainty), or projections from previous counts (which compound errors over time). The World Bank’s task is to synthesize these imperfect sources into consistent estimates that allow comparison across countries and years.
| World's Most Populous Countries | ||
| 2023 estimated population. | ||
| Population (2023) | ||
|---|---|---|
| India | 1.44B | |
| China | 1.41B | |
| United States | 337M | |
| Indonesia | 281M | |
| Pakistan | 248M | |
| Nigeria | 228M | |
| Brazil | 211M | |
| Bangladesh | 171M | |
| Russia | 144M | |
| Mexico | 130M | |
| Ethiopia | 129M | |
| Japan | 125M | |
| Philippines | 115M | |
| Egypt | 115M | |
| Congo (DRC) | 106M | |
The uncertainties in population data matter for policy and planning. A country that believes it has 100 million people will allocate resources differently than one that believes it has 120 million. Census undercounts (common in remote areas, among marginalized populations, and in places where people distrust government) lead to underinvestment in precisely the communities that need services most. The countrypops figures represent best estimates, not ground truth, and users should remember this limitation.
That said, the trends in population data are generally reliable even when the absolute numbers carry uncertainty. If the World Bank estimates that Nigeria’s population doubled between 1990 and 2020, the actual growth was almost certainly substantial even if the precise figures might be revised. Trends matter more than point estimates for most analytical purposes, and the countrypops dataset captures these trends across the entire modern era of demographic record-keeping.
| Population Change in Aging Societies | ||||
| Index: 1990 = 100 | ||||
| 1990 | 2000 | 2010 | 2020 | |
|---|---|---|---|---|
| Germany | 100.0 | 103.5 | 103.0 | 104.7 |
| Spain | 100.0 | 104.4 | 119.8 | 121.8 |
| Italy | 100.0 | 100.4 | 105.5 | 104.8 |
| Japan | 100.0 | 102.7 | 103.7 | 102.3 |
| South Korea | 100.0 | 109.7 | 115.6 | 120.9 |
| Values show population relative to 1990 baseline | ||||
The table above shows population indexed to 1990 for five countries facing demographic aging. Japan’s population has declined in absolute terms. Germany and Italy have barely grown. South Korea’s growth is slowing rapidly. These patterns reflect low birth rates, increased longevity, and (in some cases) restrictive immigration policies. The economic and social implications of aging populations (pension systems, healthcare costs, labor force composition) represent some of the most significant policy challenges of the coming decades.
The dataset updates whenever the World Bank publishes new estimates, rather than on any fixed release schedule. This ongoing maintenance means that examples in documentation and books remain current. A population figure for China in 2024 becomes available, and shortly thereafter it appears in countrypops. This currency makes the dataset more useful for teaching than static historical data would be.
For gt demonstrations, countrypops excels at time series comparisons, geographic groupings, and the handling of large numbers. The population values range from thousands (small island nations) to billions (China and India), exercising formatters across their full dynamic range. The longitudinal structure supports year-over-year comparisons, growth rate calculations, and the kind of decade-by-decade summary tables that appear in demographic reports.
C.5 towny
While countrypops takes a global view, towny focuses on a single Canadian province: Ontario. The dataset contains population figures for 414 municipalities, including data from every Canadian census between 1996 and 2021 (conducted every five years) plus various geographic and administrative attributes. It exists because I actually live in Ontario and wanted an excuse to know more about the places surrounding me.
| Ontario's Largest Municipalities | ||||
| Population and density for the top 10, 2001 vs. 2021. | ||||
|
2001
|
2021
|
|||
|---|---|---|---|---|
| Population | Density | Population | Density | |
| Toronto | 2,481,494 | 3,932.0 | 2,794,356 | 4,427.8 |
| Ottawa | 774,072 | 277.6 | 1,017,449 | 364.9 |
| Mississauga | 612,925 | 2,093.8 | 717,961 | 2,452.6 |
| Brampton | 325,428 | 1,223.9 | 656,480 | 2,469.0 |
| Hamilton | 490,268 | 438.4 | 569,353 | 509.1 |
| London | 336,359 | 799.9 | 422,324 | 1,004.3 |
| Markham | 208,615 | 989.0 | 338,503 | 1,604.8 |
| Vaughan | 182,022 | 668.1 | 323,103 | 1,186.0 |
| Kitchener | 190,399 | 1,391.7 | 256,885 | 1,877.7 |
| Windsor | 208,402 | 1,427.2 | 229,660 | 1,572.8 |
| Density is measured in persons per km². | ||||
The data comes from Statistics Canada and reveals patterns that might surprise those unfamiliar with Canadian geography. Toronto dominates, of course, but the surrounding municipalities (Mississauga, Brampton, Hamilton) have grown substantially over twenty years. Some smaller towns have declined as economic opportunities concentrated elsewhere. The dataset captures this quiet drama of population redistribution that plays out across every country’s regions.
Ontario municipality names offer their own entertainment. Some are indigenous place names with beautiful sounds. Others commemorate British royalty or colonial administrators. A few seem almost whimsical when encountered for the first time. These names appear on highway signs and maps, marking places where real communities exist with their own histories and concerns. The towny dataset transforms those signs into data, inviting exploration of what lies behind the familiar names.
| Fastest Growing Ontario Municipalities | |||
| Among places with 10,000+ residents in 2001. | |||
| Pop. 2001 | Pop. 2021 | Growth | |
|---|---|---|---|
| Milton | 31,471 | 132,979 | 322.5% |
| Whitchurch-Stouffville | 22,859 | 49,864 | 118.1% |
| Brampton | 325,428 | 656,480 | 101.7% |
| Wasaga Beach | 12,419 | 24,862 | 100.2% |
| Bradford West Gwillimbury | 22,228 | 42,880 | 92.9% |
| Vaughan | 182,022 | 323,103 | 77.5% |
| Ajax | 73,753 | 126,666 | 71.7% |
| East Gwillimbury | 20,555 | 34,637 | 68.5% |
| New Tecumseth | 26,141 | 43,948 | 68.1% |
| Markham | 208,615 | 338,503 | 62.3% |
The fastest-growing municipalities cluster around the Greater Toronto Area, where housing demand has driven expansion into formerly rural townships. Milton, Brampton, and Markham have transformed from small towns into substantial cities within a generation. The infrastructure challenges of this growth (roads, schools, healthcare, transit) consume enormous resources and dominate local politics. The towny data captures the before and after of this transformation but cannot convey the lived experience of watching farmland become subdivisions.
Not every municipality grew. Some communities in northern and eastern Ontario lost population as young people left for opportunities elsewhere. Factory closures, mine exhaustions, and the general drift of economic activity toward metropolitan areas hollowed out places that had thrived in earlier decades. The dataset does not distinguish between population loss from out-migration and loss from natural decrease (more deaths than births), but both dynamics contribute to the patterns visible in the numbers.
For table-making, towny provides opportunities for population density calculations, before-after comparisons, and growth rate analysis across its six census years. The land area column enables density visualization. The municipality names work naturally as row labels in grouped tables organized by population tier or geographic region.
C.6 peeps
The peeps dataset contains fictional personal information for 100 imaginary people: names, addresses, phone numbers, email addresses, and nationalities. These fake individuals were generated using an online tool that produces realistic-seeming demographic data, then verified for plausible formatting of addresses and contact information across different countries.
| A Random Selection of Peeps | ||||
| First Name | Last Name | Country | ||
|---|---|---|---|---|
| Krzysztof | Kowalczyk | krzysztof_k@example.com | Poland | |
| Gaweł | Zając | gawelzajac@example.com | Poland | |
| Eva | Simpson | eva_simpson@example.com | Canada | |
| Rolla | Skov | rollaskov@example.com | Denmark | |
| Oliver | Mikkelsen | oli_mikkelsen@example.com | Denmark | |
| Letizia | Moretti | l_moretti@example.com | United Kingdom | |
The international scope was intentional. peeps was created specifically to demonstrate formatters like fmt_email(), fmt_country(), and fmt_flag(). Having people from various countries ensures that flag icons and country name formatting can be shown in realistic contexts. An address book or contact directory table should contain international entries, and peeps provides exactly that.
C.6.1 The problem of synthetic data
Generating realistic fake data is harder than it sounds. Names must fit cultural expectations (a person from Japan should have a Japanese name). Addresses must follow country-specific formats (postal codes before or after city names, province abbreviations versus full names). Phone numbers must have correct country codes and plausible internal structure. Email addresses must look like real email addresses while clearly being fictional.
The country distribution in peeps emphasizes variety over statistical representativeness. Having multiple people from smaller countries ensures that formatting edge cases get tested. A dataset with 90 Americans and 10 others would not exercise international formatting as thoroughly as one with broader distribution.
The email domains follow patterns typical of real email usage: major providers dominate, with country-specific services appearing for non-English-speaking regions. This realism helps ensure that fmt_email() handles the variety of domain lengths and TLD formats that appear in actual contact databases.
Every person in the dataset is entirely fictional. The addresses do not correspond to real residences. The phone numbers should not connect to anyone. But the formatting follows authentic patterns for each country represented. A French address looks like a French address. A Japanese name follows Japanese naming conventions. This verisimilitude matters because formatting functions must handle real-world variation, and peeps provides test cases for that variation without compromising anyone’s actual privacy.
C.7 sza (solar zenith angles)
The sza dataset originates from atmospheric chemistry research, specifically from data tables published in textbooks by Pitts and Finlayson-Pitts. It records solar zenith angles (the angle between the sun and the vertical) across different latitudes and months. The original data came from a US government source that may no longer be online, but the values remain scientifically accurate and useful.
| Solar Zenith Angles by Latitude and Month | ||||||||||||
| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mid-latitude (30°) | ||||||||||||
| 0700 | 89.4 | 86.2 | 81.1 | 74.9 | 69.9 | 66.8 | 66.3 | 68.3 | 72.9 | 78.7 | 84.6 | 88.7 |
| 0730 | 83.7 | 80.3 | 74.9 | 68.5 | 63.4 | 60.4 | 60.0 | 61.9 | 66.4 | 72.4 | 78.4 | 82.8 |
| 0800 | 78.3 | 74.6 | 68.8 | 62.1 | 56.9 | 54.0 | 53.5 | 55.4 | 60.0 | 66.2 | 72.8 | 77.3 |
| 0830 | 73.2 | 69.1 | 62.9 | 55.8 | 50.4 | 47.5 | 47.1 | 48.9 | 53.6 | 60.1 | 67.1 | 72.2 |
| 0900 | 68.4 | 64.1 | 57.4 | 49.6 | 44.0 | 41.0 | 40.6 | 42.4 | 47.3 | 54.4 | 62.0 | 67.3 |
| 0930 | 64.1 | 59.4 | 52.2 | 43.8 | 37.6 | 34.5 | 34.1 | 36.0 | 41.2 | 48.8 | 57.0 | 63.0 |
| 1000 | 60.4 | 55.3 | 47.5 | 38.2 | 31.4 | 28.1 | 27.7 | 29.7 | 35.5 | 43.8 | 53.0 | 59.2 |
| 1030 | 57.3 | 51.9 | 43.5 | 33.3 | 25.6 | 21.7 | 21.2 | 23.6 | 30.2 | 39.5 | 49.3 | 56.0 |
| 1100 | 55.0 | 49.3 | 40.3 | 29.3 | 20.4 | 15.7 | 15.1 | 18.1 | 25.8 | 36.1 | 46.7 | 53.7 |
| 1130 | 53.5 | 47.7 | 38.4 | 26.5 | 16.5 | 10.5 | 9.6 | 13.7 | 22.8 | 33.9 | 44.9 | 52.2 |
| 1200 | 53.0 | 47.2 | 37.7 | 25.5 | 15.0 | 8.0 | 6.9 | 11.9 | 21.6 | 33.1 | 44.4 | 51.8 |
| 0630 | — | — | 87.5 | 81.4 | 76.3 | 73.1 | 72.5 | 74.7 | 79.4 | 85.1 | — | — |
| 0600 | — | — | — | 87.2 | 82.7 | 79.2 | 78.7 | 81.0 | 85.9 | — | — | — |
| 0530 | — | — | — | — | 88.9 | 85.3 | 84.7 | 87.2 | — | — | — | — |
Solar zenith angles matter because they determine how much solar radiation reaches Earth’s surface and at what angle. This affects everything from climate modeling to photovoltaic panel efficiency to the rate of photochemical reactions in the atmosphere. At high latitudes in winter, the sun barely rises above the horizon (large zenith angles). At the equator, the sun passes nearly overhead at noon year-round (small zenith angles). The interplay between latitude, season, and time of day creates the patterns visible in the sza data.
C.7.1 Why zenith angles matter
When the sun sits directly overhead, sunlight travels through the minimum possible amount of atmosphere before reaching the surface. As the sun moves toward the horizon, light must traverse increasingly long atmospheric paths. This path length, often expressed as “air mass” affects both the intensity and spectral composition of sunlight reaching the ground. Ultraviolet radiation attenuates more strongly through longer path lengths, which is why sunburns are most severe around solar noon in summer at lower latitudes.
| Solar Zenith Angles Throughout the Day | ||||
| By latitude and season. | ||||
| Jan | Apr | Jul | Oct | |
|---|---|---|---|---|
| 20ºN | ||||
| 09:00 | 62° | 46° | 42° | 50° |
| 12:00 | 43° | 16° | 3° | 23° |
| 06:00 | — | 88° | 82° | — |
| 40ºN | ||||
| 09:00 | 76° | 54° | 41° | 60° |
| 12:00 | 63° | 36° | 17° | 43° |
| 06:00 | — | 87° | 75° | — |
The seasonal pattern emerges clearly at mid and high latitudes. January brings high zenith angles in the Northern Hemisphere as the sun traces its winter arc low in the southern sky. July reverses the pattern, with the sun climbing high overhead at noon. At the equator, seasonal variation is minimal: the sun is always nearly overhead at midday. These patterns have shaped human civilization, determining growing seasons, driving migration patterns, and inspiring astronomical observations throughout history.
Atmospheric chemists care about zenith angles because photochemical reactions require light. Photolysis rates vary with the intensity and spectrum of incoming solar radiation. A nitrogen dioxide molecule that might photolyze within seconds at tropical noon may persist for hours at polar twilight. Models simulating urban smog formation or stratospheric ozone depletion must account for these variations, typically by looking up appropriate photolysis rates from tables indexed by zenith angle and altitude.
The dataset reflects my background in atmospheric chemistry. Such data tables would be directly imported into atmospheric box models for simulating photolysis-based reactions of volatile organic compounds (VOCs). Having the data readily available in R format eliminates the tedious work of transcribing values from printed tables or scraping data from web pages.
For gt, the sza dataset demonstrates heatmap-style coloring (values naturally vary from low to high in meaningful patterns), missing value handling (the sun does not rise at certain latitude-time combinations), and the presentation of scientific lookup tables. The structure (rows indexed by time, columns by month, grouped by latitude) maps naturally onto the table designs that scientists create when presenting such reference data.
C.8 constants, reactions, photolysis, and nuclides
These four datasets form a cluster of scientific reference data, each addressing a gap in publicly available resources. While the underlying information exists in various forms (government pages, textbooks, specialized databases), it had not been consolidated into convenient data frames. The gt package provided an opportunity to change that.
The constants dataset contains 30 fundamental physical constants with their values, units, and uncertainties. These are the numbers that appear in physics and chemistry textbooks: the speed of light, Planck’s constant, Avogadro’s number, the gravitational constant. Each value comes with associated metadata specifying units and measurement precision.
| Numbers That Define the Universe | |||
| Ten fundamental physical constants. | |||
| Value | Uncertainty | Units | |
|---|---|---|---|
| Avogadro constant | 6.02 × 1023 | mol−1 | |
| Bohr radius | 5.29 × 10−11 | 8.00 × 10−21 | m |
| Boltzmann constant | 1.38 × 10−23 | J K−1 | |
| electron mass | 9.11 × 10−31 | 2.80 × 10−40 | kg |
| elementary charge | 1.60 × 10−19 | C | |
| fine-structure constant | 7.30 × 10−3 | 1.10 × 10−12 | |
| Newtonian constant of gravitation | 6.67 × 10−11 | 1.50 × 10−15 | m3 kg−1 s−2 |
| Planck constant | 6.63 × 10−34 | J Hz−1 | |
| proton mass | 1.67 × 10−27 | 5.10 × 10−37 | kg |
| speed of light in vacuum | 3.00 × 108 | m s−1 | |
C.8.1 The certainty of physical constants
Physical constants occupy a peculiar position in science. They are measured quantities, subject to experimental uncertainty, yet they describe fundamental features of the universe that we believe to be genuinely constant. The speed of light in vacuum, for instance, is now defined as exactly 299,792,458 meters per second, but this definition only became possible after decades of increasingly precise measurements. Before 1983, we measured the speed of light; now the meter is defined in terms of light’s speed. The historical progression of uncertainty shrinking toward zero tells a story of experimental ingenuity.
| Physical Constants by Measurement Precision | ||
| Constants | Example | |
|---|---|---|
| Extraordinarily precise | 51 | alpha particle-electron mass ratio |
| Extremely precise | 133 | alpha particle mass |
| Moderately precise | 30 | deuteron rms charge radius |
| Very precise | 59 | Angstrom star |
The measurement precision varies dramatically across constants. The fine-structure constant (approximately 1/137) has been measured to extraordinary precision through quantum electrodynamics experiments. The gravitational constant, despite being one of the first constants ever measured (by Cavendish in 1798), remains relatively imprecise because gravity is so weak that experiments must contend with tiny forces and correspondingly large relative uncertainties.
C.8.2 Atmospheric chemistry
The reactions dataset catalogs 1,344 atmospheric chemical reactions with their rate constants and temperature dependencies. The photolysis dataset provides photolysis rates for organic compounds, including spectral data stored in list columns. These datasets rarely exist in such accessible form. Researchers typically extract reaction rates from individual journal articles or specialized databases with restrictive access. Having a curated collection in R format simplifies atmospheric modeling exercises.
| Selected Atmospheric Reactions with OH | ||
| Compound | Formula | OH Rate at 298K |
|---|---|---|
| beta-caryophyllene | C15H24 | 2.00 × 10−10 |
| 3-hydroxy-2-butanone | C4H8O2 | 9.70 × 10−12 |
| O-methyl-N-ethylcarbamate | C4H9NO2 | 1.05 × 10−11 |
| 4-methyl-1,3-dioxane | C5H10O2 | 1.13 × 10−11 |
| morpholine | C4H9NO | 1.10 × 10−10 |
| n-propyl propanoate | C6H12O2 | 4.00 × 10−12 |
| anthracene | C14H10 | 1.17 × 10−10 |
| triethyl phosphate | C6H15O4P | 4.68 × 10−11 |
Understanding atmospheric chemistry requires knowing how fast different reactions proceed under various conditions. The hydroxyl radical (OH) drives much of daytime atmospheric chemistry, attacking volatile organic compounds and beginning their oxidation chains. Nitrate radicals take over at night. Photolysis reactions require sunlight, their rates varying with solar zenith angle and wavelength. Each rate constant in the dataset represents experimental determinations, often from smog chamber studies or theoretical calculations validated against field measurements.
C.8.3 Nuclear data
The nuclides dataset compiles nuclear data for isotopes: half-lives, decay modes, particle emissions. Like the chemical datasets, this information exists scattered across various sources but had not been unified into a single convenient data frame. Nuclear chemistry courses and research often require looking up isotope properties, and nuclides consolidates those lookups.
| Radioactive Decay Modes | |
| Nuclides | |
|---|---|
| Alpha | 473 |
| Beta plus | 4 |
| Beta minus | 1224 |
| Electron capture | 137 |
Radioactive decay follows several pathways depending on the nuclear configuration. Beta-minus decay converts a neutron to a proton, moving the nucleus one step higher in atomic number. Alpha decay ejects a helium nucleus, reducing both atomic number and mass. Electron capture pulls an inner-shell electron into the nucleus. Each decay mode appears in the dataset, along with the half-life governing how quickly unstable nuclei transform.
For gt demonstrations, these scientific datasets showcase formatters like fmt_scientific(), fmt_chem(), and fmt_units(). They also demonstrate from_column() usage, where the number of decimal places might come from an adjacent precision column rather than being hard-coded. Scientific communication demands careful attention to significant figures and uncertainty, and these datasets provide realistic contexts for that attention.
C.9 sp500
The sp500 dataset contains daily stock market data for the S&P 500 index from 1950 through 2015: opening price, closing price, high, low, and trading volume for each trading day. With over 16,000 rows, it provides a substantial corpus of financial time series data.
| Black Monday and the Week That Shook Wall Street | ||||
| S&P 500 daily prices, October 14–23, 1987. | ||||
| Day | Open | Close | Volume | |
|---|---|---|---|---|
| 1987-10-14 | Wednesday | $314.52 | $305.23 | 207.40M |
| 1987-10-15 | Thursday | $305.21 | $298.08 | 263.20M |
| 1987-10-16 | Friday | $298.08 | $282.70 | 338.50M |
| The Weekend | ||||
| 1987-10-19 | Monday | $282.70 | $224.84 | 604.30M |
| 1987-10-20 | Tuesday | $225.06 | $236.83 | 608.10M |
| 1987-10-21 | Wednesday | $236.83 | $258.38 | 449.60M |
| 1987-10-22 | Thursday | $258.24 | $248.25 | 392.20M |
| 1987-10-23 | Friday | $248.29 | $248.22 | 245.60M |
Financial data demands specific formatting conventions: currency symbols, appropriate decimal precision, volume abbreviations. The sp500 dataset exercises these requirements across decades of market history. Bull markets, bear markets, crashes, and recoveries all appear in the data. The 1987 Black Monday crash, the dot-com bubble, the 2008 financial crisis: each left its signature in closing prices and trading volumes.
C.9.1 Reading the market’s history
The S&P 500 tracks 500 large-cap American companies, weighted by market capitalization. It serves as the benchmark against which most US equity investments are measured. A fund that “beats the market” outperforms the S&P 500. An investor who wants “market returns” buys an index fund tracking the S&P 500. The index represents the collective judgment of millions of market participants about the value of corporate America.
| S&P 500 Annual Summary | |||||
| 2006-2015 | |||||
| Open | Close | Annual Return |
Year High | Year Low | |
|---|---|---|---|---|---|
| 2006 | $1,248 | $1,418 | +13.6% | $1,432 | $1,219 |
| 2007 | $1,418 | $1,468 | +3.5% | $1,576 | $1,364 |
| 2008 | $1,468 | $903 | −38.5% | $1,472 | $741 |
| 2009 | $903 | $1,115 | +23.5% | $1,130 | $667 |
| 2010 | $1,117 | $1,258 | +12.6% | $1,263 | $1,011 |
| 2011 | $1,258 | $1,258 | 0.0% | $1,371 | $1,075 |
| 2012 | $1,259 | $1,426 | +13.3% | $1,475 | $1,259 |
| 2013 | $1,426 | $1,848 | +29.6% | $1,849 | $1,426 |
| 2014 | $1,846 | $2,059 | +11.5% | $2,094 | $1,738 |
| 2015 | $2,059 | $2,044 | −0.7% | $2,135 | $1,867 |
The years 2006 through 2015 illustrate the market’s capacity for dramatic swings. 2008 stands out with its catastrophic decline during the financial crisis, when the index lost more than a third of its value. The subsequent years show gradual recovery, with the index reaching new highs by the early 2010s. Anyone who sold during the panic of 2008 locked in losses. Anyone who held through the crisis recovered and then some. The data tells both stories depending on which slices you examine.
The dataset originated from a web search that turned up historical market data, possibly compiled from Kaggle or similar sources. The exact provenance matters less than the utility: a long time series of financial data in a format ready for analysis and visualization. For teaching purposes, the sp500 dataset provides realistic data for demonstrating time series analysis, returns calculations, volatility measurement, and the kind of financial tables that appear in annual reports and investment presentations.
C.10 metro
The Paris Métro is one of the world’s great urban transit systems. Opened in 1900, it has grown to 16 lines serving 308 stations across the city and surrounding communes. The metro dataset captures this network: station names, locations, line assignments, opening dates, and ridership figures.
| Busiest Paris Métro Stations | ||
| Lines | Annual Passengers | |
|---|---|---|
| Gare du Nord | 4, 5 | 34.50M |
| Saint-Lazare | 3, 12, 13, 14 | 33.13M |
| Gare de Lyon | 1, 14 | 28.64M |
| Montparnasse—Bienvenüe | 4, 6, 12, 13 | 20.41M |
| Gare de l'Est | 4, 5, 7 | 15.54M |
| Bibliothèque François Mitterrand | 14 | 11.10M |
| République | 3, 5, 8, 9, 11 | 11.08M |
| Les Halles | 4 | 10.62M |
| La Défense | 1 | 9.26M |
| Châtelet | 1, 4, 7, 11, 14 | 8.35M |
The dataset exists because I really admire the Paris Métro. Among the world’s subway systems, it stands out for its density, connectivity, and integration with other transit modes (RER commuter rail, buses, trams, and high-speed TGV connections). The wayfinding and signage are exemplary. The expansion plans are ambitious and consistently executed. It represents what urban transit can be when treated as essential infrastructure rather than an afterthought.
C.10.1 A brief history of the Métro
The story of the Paris Métro begins in the late nineteenth century, when Paris faced the same urban transportation crisis that afflicted every growing industrial city. Horse-drawn omnibuses clogged the boulevards. The wealthy rode in private carriages while workers walked miles to reach their jobs. London had opened its Underground in 1863, demonstrating that subterranean railways could move masses of people efficiently. Paris, perennially competitive with its cross-Channel rival, needed its own solution.
The first line opened on July 19, 1900, timed to coincide with the Exposition Universelle that drew millions of visitors to Paris that summer. Line 1 ran from Porte de Vincennes to Porte Maillot, connecting the eastern and western edges of the city through its commercial heart. The stations featured distinctive Art Nouveau entrances designed by Hector Guimard, with their sinuous cast-iron curves and amber glass panels that remain iconic more than a century later. Not all survived (many were removed during mid-century modernization campaigns and later regretted), but those that remain are protected monuments.
The network expanded rapidly in its early decades. By 1910, Paris had six lines. By 1920, ten. This breakneck pace was not entirely the product of unified planning. The Compagnie du chemin de fer métropolitain de Paris (CMP) held the primary concession, but it faced competition from the Nord-Sud Company, which built what would become Lines 12 and 13. The two companies raced to serve lucrative routes, and their rivalry accelerated construction beyond what a single monopoly might have achieved. The Nord-Sud stations were arguably more elegant, with ceramic tile work and distinctive lettering that enthusiasts still admire. When the companies merged in 1930, Paris inherited a network that had been built fast precisely because multiple actors were competing to build it.
This frenetic early growth distinguished Paris from its European peers. London’s Underground, though older, expanded more cautiously under a patchwork of private companies that often duplicated routes rather than extending coverage. The Berlin U-Bahn, which opened in 1902, grew steadily but faced the complication of serving multiple municipalities that would not unify until 1920. Paris benefited from centralized city planning within the relatively compact boundaries of the twenty arrondissements, allowing the CMP and Nord-Sud to build a coherent network even while competing. By 1930, Paris had more stations than London despite London’s forty-year head start.
The guiding philosophy was density: stations placed close together (often just 500 meters apart) so that no Parisian would have to walk more than a few minutes to reach the Métro. This density distinguishes Paris from systems like Washington DC or the Bay Area’s BART, where stations are spaced miles apart and require feeder buses or long walks. The Paris approach sacrifices speed between stations for convenience of access, a tradeoff that makes sense for a compact, dense city.
| 125 Years of Métro Expansion | ||
| How the network grew decade by decade. | ||
| Stations Opened | Notable Events | |
|---|---|---|
| 1900s | 65 | Line 1 opens for World's Fair |
| 1910s | 85 | Rapid expansion across Paris |
| 1920s | 60 | Network reaches most arrondissements |
| 1930s | 25 | Great Depression slows construction |
| 1940s | 5 | World War II occupation |
| 1950s | 15 | Post-war reconstruction begins |
| 1960s | 12 | RER regional express network planned |
| 1970s | 8 | RER lines A and B open |
| 1980s | 18 | Line 14 planning begins |
| 1990s | 10 | Line 14 construction starts |
| 2000s | 8 | Line 14 opens (first automated line) |
| 2010s | 14 | Line extensions to suburbs |
| 2020s | 10 | Grand Paris Express under construction |
The interwar period saw continued expansion but also financial difficulties. The 1930s depression slowed construction, and the network that had seemed destined for endless growth began to stabilize. World War II brought occupation and disruption. The Métro continued to operate (the Germans found it useful for moving troops and supplies), but expansion halted and maintenance suffered. Several stations were closed and converted to other uses, some serving as air raid shelters.
Post-war reconstruction proceeded slowly. The immediate decades after 1945 focused on repairing damage and updating aging infrastructure rather than building new lines. The real transformation came in the 1960s and 1970s with the creation of the RER (Réseau Express Régional), a network of express lines that tunneled through central Paris but extended far into the suburbs. The RER was not technically part of the Métro but integrated seamlessly with it, allowing commuters to transfer between the dense inner-city network and the faster regional lines.
C.10.2 The modern network
Today’s Métro comprises 16 lines totaling over 220 kilometers of track. The numbering seems haphazard (there are lines 1 through 14, plus 3bis and 7bis), reflecting historical accidents rather than logical planning. Lines 3bis and 7bis were originally branches of their parent lines that later gained operational independence. The system carries approximately 4 million passengers daily, making it one of the world’s busiest rapid transit networks.
| Paris Métro Lines | ||||
| Current network statistics. | ||||
| Line | Length (km) | Stations | Automated | |
|---|---|---|---|---|
| 1 | 16.6 | 25 | ✔ | |
| 2 | 12.4 | 25 | ✘ | |
| 3 | 11.7 | 25 | ✘ | |
| 3bis | 1.3 | 4 | ✘ | |
| 4 | 12.1 | 27 | ✔ | |
| 5 | 14.6 | 22 | ✘ | |
| 6 | 13.6 | 28 | ✘ | |
| 7 | 22.4 | 38 | ✘ | |
| 7bis | 3.1 | 8 | ✘ | |
| 8 | 23.4 | 38 | ✘ | |
| 9 | 19.6 | 37 | ✘ | |
| 10 | 11.7 | 23 | ✘ | |
| 11 | 6.3 | 13 | ✘ | |
| 12 | 13.9 | 29 | ✘ | |
| 13 | 24.3 | 32 | ✘ | |
| 14 | 14.0 | 13 | ✔ | |
Line 14 deserves special attention as the system’s showcase. Opened in 1998, it was the first fully automated line on the network, operating without drivers. Platform screen doors prevent accidents (a significant concern on older lines) and allow trains to run with shorter headways. The stations feel modern and spacious compared to the cramped nineteenth-century tunnels of earlier lines. Line 14 demonstrated that new construction was possible and could achieve standards superior to the historical network. It has since been extended multiple times and serves as the template for future expansion.
The Grand Paris Express, currently under construction, represents the most ambitious expansion since the network’s founding. This project will add four new automated lines (15, 16, 17, and 18) encircling the existing network and connecting suburban centers that currently require traveling through central Paris to reach one another. When complete, probably sometime in the 2030s, the Grand Paris Express will nearly double the length of the automated network and fundamentally reshape mobility patterns in the Île-de-France region.
C.10.3 What makes the Métro work
Several design principles distinguish the Paris Métro from less successful transit systems. First, the high density of stations means that walking to the Métro is almost always faster than driving to a parking lot. This convenience generates ridership that justifies the investment. Second, the integration with other modes is seamless. The same ticket works on Métro, RER, buses, and trams within Paris. Transfer stations connect lines at useful angles rather than requiring passengers to exit one system and enter another. Third, the frequency of service makes timetables irrelevant. During peak hours, trains arrive every two minutes on busy lines. Even late at night, waits rarely exceed ten minutes. Passengers simply show up and go.
The signage and wayfinding deserve particular praise. Station names appear in a consistent typeface (Parisine, designed specifically for the Métro in 1996) on tiled walls visible from passing trains. Corridor signs point toward exits, transfers, and surface landmarks with clarity that serves tourists and commuters alike. The colored line numbers and terminus names provide all the information needed to navigate without consulting maps. Many transit systems aspire to this legibility but few achieve it so thoroughly.
| Ridership by Line Assignment | |||
| Stations grouped by their line connections. | |||
| Line(s) | Station Count | Total Ridership | Avg per Station |
|---|---|---|---|
| 7 | 28 | 63.28M | 2.26M |
| 9 | 23 | 60.64M | 2.64M |
| 13 | 23 | 57.83M | 2.51M |
| 1 | 14 | 55.93M | 3.99M |
| 4 | 16 | 53.66M | 3.35M |
| 8 | 26 | 50.48M | 1.94M |
| 12 | 21 | 39.59M | 1.89M |
| 3 | 17 | 38.72M | 2.28M |
| 2 | 15 | 35.11M | 2.34M |
| 4, 5 | 1 | 34.50M | 34.50M |
The Métro also benefits from Paris’s urban form. The city is dense and compact, with most destinations within walking distance of a station. Zoning never separated residential from commercial uses as strictly as in American cities, so people live near where they work and shop. The Métro did not create this urban form (it predates the Métro by centuries), but the two reinforce one another. Dense cities need mass transit, and mass transit makes density livable.
For the metro dataset, this context matters. The station names are not arbitrary labels but markers of neighborhoods with distinct characters. The ridership figures reflect how Parisians actually move through their city. The line assignments show which routes carry the heaviest loads and which serve more specialized purposes. Understanding the Métro as a living system, constantly adapting over 125 years of operation, makes the dataset more meaningful than raw numbers alone could convey.
C.10.4 The future of Paris transit
The Grand Paris Express will transform the region, but it is only part of a broader vision. Line 14 continues to extend northward and southward. Line 11 is being extended to connect new suburbs. Tram lines are expanding along the outer boulevards, and bus networks are being reorganized to feed into the rail system more efficiently. The goal is a regional transit network that allows travel between any two points without necessarily passing through central Paris.
The dataset updates periodically to reflect these changes. New stations appear as they open. Ridership figures are updated with each annual release. The metro data is not a static snapshot but an evolving portrait of a transit system that continues to grow and adapt. Future versions will include the Grand Paris Express stations, extending coverage far beyond the historical city limits.
For gt demonstrations, the dataset provides geographic data with French language station names, offering opportunities to demonstrate locale handling and the presentation of transit network information. The ridership figures support ranking tables. The line assignments (stored as comma-separated values) demonstrate handling of multi-valued fields. The opening dates span over a century, creating interesting timelines. But beyond these technical uses, the dataset offers a window into one of humanity’s great collective achievements: a transit system that moves millions of people daily, efficiently and reliably, through one of the world’s most beautiful cities.
C.11 gibraltar
The gibraltar dataset contains hourly weather observations from Gibraltar during May 2023: temperature, humidity, wind speed, cloud cover, and other meteorological variables. It provides 744 rows representing each hour of a single month in this small but fascinating territory.
| Gibraltar Morning Weather | |||||
| May 1, 2023: fog clearing to fair skies. | |||||
| Time | Temp (°C) | Humidity | Wind (km/h) | Direction | Condition |
|---|---|---|---|---|---|
| 06:50 | 17.2 | 72% | 0.4 | W | Fair |
| 07:50 | 17.8 | 88% | 0.9 | NE | Patches of Fog |
| 08:50 | 17.2 | 82% | 0.9 | W | Patches of Fog |
| 09:20 | 17.8 | 77% | 2.7 | WSW | Patches of Fog |
| 09:50 | 17.8 | 77% | 2.2 | WSW | Fair |
| 10:20 | 18.9 | 73% | 2.7 | SW | Fair |
| 10:50 | 21.1 | 64% | 1.3 | WSW | Fair |
| 11:20 | 21.1 | 68% | 2.7 | ESE | Fair |
| 11:50 | 22.2 | 60% | 2.2 | SE | Fair |
| 12:20 | 22.2 | 60% | 2.2 | E | Fair |
| 12:50 | 22.2 | 60% | 2.2 | E | Fair |
| 13:20 | 22.2 | 64% | 2.7 | E | Fair |
| 13:50 | 22.2 | 64% | 2.7 | E | Fair |
Gibraltar sits at the southern tip of the Iberian Peninsula, a British Overseas Territory of barely seven square kilometers guarding the entrance to the Mediterranean Sea. It is the kind of place that captures the imagination precisely because it seems improbable: a limestone promontory with its own airport runway crossing the main road, Barbary macaques roaming the upper rock, and a rather complex history.
C.11.1 Understanding the Rock
The Rock of Gibraltar rises 426 meters above sea level, a dramatic limestone formation that has served as a strategic landmark for millennia. The ancient Greeks called it one of the Pillars of Hercules, marking the edge of the known world. Every Mediterranean power has recognized its importance: control Gibraltar and you control access between the Atlantic and the Mediterranean. The British acquired it in 1704 during the War of Spanish Succession and have held it ever since, despite periodic Spanish objections and one famous siege that lasted nearly four years.
| Gibraltar Weather by Time of Day | |||||
| May 2023 | |||||
|
Temperature (°C)
|
Humidity | Wind | |||
|---|---|---|---|---|---|
| Average | Maximum | Minimum | |||
| Morning | 19.0 | 23.9 | 13.9 | 1% | 3.5 |
| Afternoon | 21.6 | 30.0 | 15.0 | 1% | 4.7 |
| Evening | 21.3 | 28.9 | 15.0 | 1% | 4.4 |
| Night | 18.9 | 27.2 | 13.9 | 1% | 4.0 |
The May weather data captures Gibraltar in spring, before the intense heat of Mediterranean summer arrives. Temperatures climb through the afternoon hours and descend through the evening, following the familiar diurnal pattern. Humidity inversely tracks temperature, rising as the air cools. Wind direction matters at Gibraltar: the Levante wind blows from the east through the strait, often bringing fog as Mediterranean moisture condenses against the Rock. The Poniente arrives from the west, drier and clearer. These wind patterns shaped navigation through the strait for centuries of sailing ships.
| Wind Direction Frequency | |
| May 2023 | |
| Hours | |
|---|---|
| E | 271 |
| W | 216 |
| ENE | 171 |
| WSW | 165 |
| NE | 118 |
| SSW | 115 |
| ESE | 111 |
| SW | 102 |
| S | 52 |
| NNE | 33 |
| SE | 33 |
| WNW | 15 |
| NNW | 10 |
| NW | 8 |
| N | 6 |
| SSE | 4 |
| CALM | 1 |
The predominance of certain wind directions reflects the geography of the strait. Air flows through the narrow gap between Europe and Africa, channeled by the mountains on either side. Local topography further complicates matters: the Rock itself creates wind shadows and acceleration zones. Pilots landing at Gibraltar Airport must contend with these effects, making it one of the more challenging airports in Europe. The runway crosses Winston Churchill Avenue, requiring traffic to stop when aircraft land or take off.
May was chosen simply to provide pre-summer weather data. Gibraltar’s Mediterranean climate means mild, pleasant conditions that month, with temperatures climbing toward but not yet reaching peak summer heat. The specific year (2023) holds no particular significance beyond being recent enough for the data to feel current. The data comes from weather APIs providing historical observations, typical of the sources that make meteorological data increasingly accessible for analysis and visualization.
For gt, the dataset demonstrates time series formatting, weather data presentation, and the handling of multiple related numeric columns. Temperature formatting, wind direction encoding, and the diurnal patterns visible in hourly data all provide teaching opportunities.
C.12 films
The films dataset is a labor of love: a comprehensive record of every film that has competed for the Palme d’Or at the Cannes Film Festival. It contains 1,607 entries spanning the festival’s history, with each row recording a film’s title (in both English and original language), director, year, country of origin, spoken languages, and IMDB link.
| Cannes Film Festival 2019 | ||
| Official Competition | ||
| Film | Director | Country |
|---|---|---|
| A Hidden Life | Terrence Malick | United Kingdom, Germany, United States |
| Atlantics | Mati Diop | France, Senegal, Belgium |
| Bacurau | Juliano Dornelles, Kleber Mendonça Filho | Brazil, France |
| Pain and Glory | Pedro Almodóvar | Spain, France |
| Frankie | Ira Sachs | France, Portugal |
| Parasite | Bong Joon Ho | South Korea |
| The Traitor | Marco Bellocchio | Italy, France, Germany, Brazil |
| It Must Be Heaven | Elia Suleiman | France, Qatar, Germany, Canada, Turkey, Palestine |
| The Whistlers | Corneliu Porumboiu | Romania, France, Germany, Switzerland, Sweden |
| Young Ahmed | Jean-Pierre Dardenne, Luc Dardenne | Belgium, France |
| Les Misérables | Ladj Ly | France |
| Little Joe | Jessica Hausner | Austria, United Kingdom, Germany, France |
| Matthias & Maxime | Xavier Dolan | Canada |
| Mektoub, My Love: Intermezzo | Abdellatif Kechiche | France |
| The Wild Goose Lake | Yi'nan Diao | China, France |
| Once Upon a Time in... Hollywood | Quentin Tarantino | United States, United Kingdom, China |
| Portrait of a Lady on Fire | Céline Sciamma | France |
| Oh Mercy! | Arnaud Desplechin | France |
| Sibyl | Justine Triet | France, Belgium |
| Sorry We Missed You | Ken Loach | United Kingdom, France, Belgium |
| The Dead Don't Die | Jim Jarmusch | United States |
The dataset exists because I really like watching movies. My letterboxd account (letterboxd.com/rich_i/) tracks my viewing history and provides an ongoing record of films watched and opinions formed (manifesting in star ratings). Film festivals provide endless opportunities for discovery, surfacing works that might never reach mainstream distribution. The Cannes Film Festival, as the most prestigious venue for international cinema, seemed like essential data that should exist in an accessible format. But no such dataset was publicly available. The only logical solution was to create one.
C.12.1 Building the Cannes dataset
Construction required extensive research spanning months of work. The festival’s official website provided the foundation, listing competition entries by year. But the website alone was insufficient. Many older entries appeared only with French titles, requiring investigation to find corresponding English names (or vice versa for English-language films shown under French titles). Some films had been released under multiple names in different markets, demanding careful verification of which title was authoritative.
| Cannes Competition Entries by Year | |
| Sample of years from 1970 onward. | |
| In-Competition Films | |
|---|---|
| 1970 | 25 |
| 1971 | 26 |
| 1972 | 25 |
| 1973 | 24 |
| 1974 | 26 |
| 1975 | 22 |
| 1976 | 20 |
| 1977 | 23 |
| 1978 | 23 |
| 1979 | 21 |
| 1980 | 23 |
| 1981 | 22 |
| 1982 | 22 |
| 1983 | 22 |
| 1984 | 19 |
| 1985 | 20 |
| 1986 | 20 |
| 1987 | 20 |
| 1988 | 21 |
| 1989 | 22 |
IMDB links were tracked down for each entry, providing viewers easy access to cast lists, synopses, and user ratings. This was straightforward for recent films but required detective work for older or more obscure entries. Some films from the 1950s and 1960s had minimal online presence, with IMDB pages containing little additional information. But the links exist for completeness, allowing interested viewers to explore further.
Spoken languages and countries of origin required the most careful coding. International co-productions muddy the concept of a film’s “country”. Is a film shot in France, funded by German and Italian producers, directed by a Polish filmmaker, and starring British actors a French film? The dataset records all countries involved in production, accepting that many films defy simple national categorization. Languages posed similar challenges: a film might be primarily in French with scenes in Arabic and English, and all three languages deserve acknowledgment. Where multiple languages appear, I tried to arrange them roughly by the quantity of words spoken in each, so the first language listed is generally the one that dominates the dialogue.
C.12.2 The festival and its significance
The Cannes Film Festival has operated since 1946 (with a brief predecessor event in 1939 interrupted by war). It functions simultaneously as a trade show for film distribution, a competition for artistic achievement, and a showcase for celebrity culture. The Palme d’Or, awarded to the best film in competition, carries considerable prestige. Winners enter the canon of international cinema, their directors’ careers transformed by the recognition.
| Countries by Cannes Competition Entries | |
| Single-country productions across festival history. | |
| Films in Competition | |
|---|---|
| United States | 201 |
| France | 150 |
| United Kingdom | 81 |
| Italy | 73 |
| Japan | 56 |
| USSR | 44 |
| Spain | 39 |
| Germany | 36 |
| Hungary | 32 |
| Sweden | 27 |
| Mexico | 25 |
| Poland | 24 |
The table above reveals which national cinemas have received the most recognition at Cannes. France dominates, unsurprisingly given that Cannes is a French festival on the French Riviera. The United States and Italy follow, both countries with robust film industries and strong traditions of auteur filmmaking. Japan’s presence reflects the festival’s long appreciation for directors like Kurosawa, Ozu, and more recently Kore-eda and Hamaguchi. The geographic diversity of competition entries has increased over time, with films from Korea, Iran, Thailand, and other countries appearing regularly in recent decades.
The festival also reflects changing tastes and priorities in world cinema. In its early decades, Cannes emphasized European art cinema and established masters. The 1970s brought more adventurous programming, with controversial entries and recognition for directors working outside commercial constraints. Recent years have seen increased attention to women directors (historically underrepresented) and to cinemas from regions previously marginalized in international distribution.
| Most Frequent Cannes Competitors | ||
| Directors with 5+ competition entries. | ||
| Competition Entries | Years Active | |
|---|---|---|
| Ken Loach | 15 | 1981-2023 |
| Jean-Pierre Dardenne, Luc Dardenne | 10 | 1999-2025 |
| Wim Wenders | 10 | 1976-2023 |
| Carlos Saura | 9 | 1960-1988 |
| Lars von Trier | 9 | 1984-2011 |
| Nanni Moretti | 9 | 1978-2023 |
| Ettore Scola | 8 | 1970-1989 |
| Jim Jarmusch | 8 | 1986-2019 |
| Marco Bellocchio | 8 | 1980-2023 |
| Marco Ferreri | 8 | 1963-1991 |
The directors who return repeatedly to Cannes competition form a roster of international cinema’s most celebrated figures. Their repeated presence reflects both the festival’s loyalty to directors it has championed and these filmmakers’ continued production of work deemed worthy of the world’s most competitive showcase. For many, a Cannes premiere represents the peak of artistic recognition, the moment when a new work enters the conversation of global cinema.
C.12.3 Film as data
The films dataset demonstrates that even cultural artifacts can be structured for analysis. Each film becomes a row with attributes: title, year, director, country, language. These attributes support queries that would be tedious to answer through casual browsing. Which directors have competed most often? How has the linguistic diversity of competition entries changed over time? What proportion of recent competitors are first-time Cannes directors versus returning favorites?
For gt specifically, films demonstrates fmt_flag() and fmt_country() in realistic contexts. The country codes translate directly to flag icons, creating visual tables that communicate nationality at a glance. The categorical structure (years, directors, countries) provides natural grouping opportunities. The IMDB URLs demonstrate link formatting for external references. It is a dataset that makes beautiful tables almost by accident, because film data is inherently interesting to display.
The dataset updates annually as each new festival adds to the historical record. Every May, the Cannes competition announces its official selection, and those entries will appear in future versions of films. The ongoing maintenance reflects both practical utility (keeping examples current) and personal interest (following each year’s festival with the attention of a devoted fan).
C.12.4 My Letterboxd
The films dataset exists because I really enjoy movies, and that love extends well beyond festival competition entries. Below is a searchable, sortable table of every film I’ve watched (and mostly rated) on Letterboxd. The data was assembled using scripts in this book’s repository (scripts/scrape-letterboxd.R), which merge the Letterboxd data export files and fetch director information from individual film pages.
One detail worth noting is that the star ratings used fmt() for formatting rather than a pre-formatted text column. This matters for interactive tables because fmt() changes only the display while preserving the underlying numeric values. When a user clicks the Rating column header to sort, the table sorts on the original numbers (5, 4.5, 4, …) rather than on rendered text like “★★★★½”, which would sort alphabetically and produce nonsensical results. It is a small trick but an important one whenever you need sortable columns with custom formatting in opt_interactive() tables!
C.13 illness
The illness dataset takes a different approach than the others. Rather than modeling behavior or compiling reference data, it reproduces a single table from a published scientific article. The source is “A fatal yellow fever virus infection in China: description and lessons” from Emerging Microbes & Infections (July 2016), which documented laboratory test results for a patient who contracted yellow fever during travel to Angola.
| Units | Day 3 | Day 7 | Day 9 | |
|---|---|---|---|---|
| Viral load | copies per mL | 12000.00 | 760.00 | 250.00 |
| WBC | ×109/L | 5.26 | 24.77 | 19.03 |
| Neutrophils | ×109/L | 4.87 | 22.08 | 16.59 |
| RBC | ×1012/L | 5.72 | 4.12 | 3.32 |
| Hb | g/L | 153.00 | 75.00 | 95.00 |
| PLT | ×109/L | 67.00 | 74.10 | 25.60 |
| ALT | U/L | 12835.00 | 1623.70 | 512.40 |
| AST | U/L | 23672.00 | 2189.00 | 782.50 |
| TBIL | µmol/L | 117.20 | 127.30 | 163.20 |
| DBIL | µmol/L | 71.40 | 117.80 | 126.30 |
The article is freely available under a Creative Commons license, making reproduction appropriate. The dataset was created specifically to test gt’s fmt_units() function and its ability to render scientific unit notation correctly. Medical laboratory results frequently include units like mL, μL, g/dL, U/L, and ×10³/μL that require careful formatting (the last of those being particularly tedious to typeset correctly without dedicated tooling). The question was whether gt could faithfully reproduce the original Table 1 from the article.
C.13.1 Reading laboratory values
Medical laboratory tests generate data that require specialized interpretation. Each test has reference ranges defining normal values, and deviations above or below those ranges signal pathology. A white blood cell count of 3.0 × 10⁹/L might indicate leukopenia (low white cells), potentially signifying infection, bone marrow problems, or medication side effects. Liver enzymes elevated beyond normal ranges suggest hepatic damage. Reading the illness dataset means tracking multiple indicators as they evolve day by day through a fatal disease progression.
| Laboratory Test Reference Ranges | |||
| Units |
Normal Range
|
||
|---|---|---|---|
| Low | High | ||
| WBC | x10^9 / L | 4.0 | 10.0 |
| Neutrophils | x10^9 / L | 2.0 | 8.0 |
| RBC | x10^12 / L | 4.0 | 5.5 |
| Hb | g / L | 120.0 | 160.0 |
| PLT | x10^9 / L | 100.0 | 300.0 |
| ALT | U/L | 9.0 | 50.0 |
| AST | U/L | 15.0 | 40.0 |
| TBIL | umol/L | 0.0 | 18.8 |
| DBIL | umol/L | 0.0 | 6.8 |
| NH3 | mmol/L | 10.0 | 47.0 |
| PT | s | 9.4 | 12.5 |
| APTT | s | 25.1 | 36.5 |
The normal ranges provide context for interpreting measurements. When day 9 values fall far outside these ranges, the severity becomes apparent. Bilirubin rising dramatically indicates liver failure. Creatinine elevation signals kidney involvement. The cascade of organ dysfunction visible in sequential laboratory values explains why this case study merited publication and why it serves as a teaching resource.
The dataset thus serves as a benchmark: if you can recreate a published scientific table using gt, the package’s formatting capabilities are proven sufficient for real-world use. The illness data provides that proof of concept while also documenting a tragic case that contributed to medical understanding of yellow fever progression.
C.14 rx_adsl and rx_addv
These two datasets represent gt’s connection to the pharmaceutical industry, where clinical trial tables must meet rigorous standards for regulatory submission. The datasets follow CDISC (Clinical Data Interchange Standards Consortium) conventions, specifically the ADaM (Analysis Data Model) structure used throughout the pharmaceutical industry.
rx_adsl contains subject-level data (ADSL format) for 182 participants in a fictional clinical trial. rx_addv provides protocol deviation records (ADDV format) with 910 entries documenting when and how trial participants deviated from study protocols. Both datasets use standard variable names and coding conventions that pharmaceutical statisticians will immediately recognize.
| Subject ID | Age | Sex | Ethnicity | Treatment |
|---|---|---|---|---|
| GT1000 | 37 | Male | Hispanic or Latino | NA |
| GT1001 | 41 | Male | Not Hispanic or Latino | Placebo |
| GT1002 | 39 | Female | Not Hispanic or Latino | Placebo |
| GT1003 | 38 | Male | Not Hispanic or Latino | Placebo |
| GT1004 | 45 | Male | Not Hispanic or Latino | Placebo |
| GT1005 | 35 | Female | Hispanic or Latino | Placebo |
| GT1006 | 42 | Female | Not Hispanic or Latino | Placebo |
| GT1007 | 35 | Male | Not Hispanic or Latino | Placebo |
C.14.1 The language of clinical trials
Pharmaceutical data follows conventions that seem arcane to outsiders but enable precise communication among specialists. USUBJID uniquely identifies a subject across all studies from a sponsor. TRTA indicates the actual treatment received (as opposed to the treatment assigned). SAFFL flags subjects in the safety population. This vocabulary, defined by CDISC standards, appears in regulatory submissions worldwide. A statistician in Switzerland reviewing a submission from Japan knows exactly what TRTA means because the standards are universal.
| Treatment Arm Demographics | ||||
| Subjects | Mean Age | Female | Ethnicities | |
|---|---|---|---|---|
| Placebo | 90 | 41.2 | 0% | 3 |
| Drug 1 | 90 | 39.2 | 0% | 3 |
| NA | 2 | 38.5 | 0% | 1 |
The treatment arms in clinical trials typically include the experimental treatment at one or more doses, a placebo or active comparator, and sometimes multiple dosing regimens. Demographic balance across arms helps ensure that observed differences reflect treatment effects rather than baseline differences. Age, sex, ethnicity, disease severity at baseline, and prior treatments all require documentation and comparison.
| Protocol Deviation Categories | |
| Deviations | |
|---|---|
| 187 | |
| Major | 104 |
Protocol deviations document when trial participants did not follow the study plan. Some deviations are minor (a visit occurring outside the allowed window). Others are major (taking prohibited medications, missing doses). The rx_addv dataset catalogs these deviations, enabling sensitivity analyses that exclude subjects with major violations. Regulators scrutinize deviation patterns for evidence that the trial was conducted properly and that deviations do not undermine the conclusions.
These datasets were contributed by Alexandra Lauer as part of ongoing collaboration between gt developers and pharmaceutical industry users. The package website includes a dedicated case study article demonstrating how to create clinical tables that meet industry standards. For pharmaceutical statisticians evaluating gt for regulatory work, these datasets provide immediately relevant examples.
The inclusion of pharmaceutical data reflects gt’s ambition to serve professional communities with specialized requirements. Clinical trials generate enormous quantities of tabular output, much of it following strict formatting conventions. Having sample datasets in the standard format lowers the barrier for pharmaceutical users to adopt gt and verify that it meets their needs.
C.15 The value of curated datasets
Looking across all eighteen datasets, certain patterns emerge. Many fill gaps where public data existed but not in convenient form. The scientific datasets consolidate information scattered across journals and government pages. The films dataset creates a resource that simply did not exist before. Even the simulated pizzaplace data serves a purpose: realistic transactional data is rarely available publicly due to business confidentiality.
Other datasets reflect personal curiosity. Ontario towns. The Paris Métro. Gibraltar’s weather. Cannes films. These choices say something about my interests and the particular corners of the world that captured my attention. The datasets are better for this personal investment. Someone who cares about the Paris Métro will notice details that a disinterested compiler would miss.
For users of gt, the datasets provide reliable materials for learning and experimentation. The variety ensures that nearly any table type (financial, scientific, demographic, geographic, categorical) has relevant sample data available. The careful construction means edge cases are present: missing values, unusual formatting requirements, multi-valued fields. The documentation grounds each dataset in context that makes the data more meaningful to work with.
Datasets are infrastructure. Good ones get used for years, appearing in examples, tutorials, homework assignments, and documentation. The eighteen datasets in gt aspire to that longevity. They are not throwaway data generated to fill a requirement but carefully assembled resources intended to remain useful across many versions and use cases. The stories behind them, now recorded here, add another layer of value: not just what the data contains but why it exists and where it came from.