Forbes Magazine Archive (1917–1930)
Dataset Documentation

Overview

The Forbes Magazine Archive contains the complete pre-1930 run of one of America’s most influential business publications, founded by B.C. Forbes. 9,267 rows of clean, structured text beginning with Volume 1, Number 2, September 29, 1917.

 

📊 AI Training Readiness Report:
Overall Score: 75 / 100

Assessment Summary:

Strong candidate for RAG and historical business intelligence applications. Excellent completeness and content richness, with primary limitations in metadata quality and residual OCR artifacts.

1. Dataset Overview

MetricValue
Total Records9,267
Unique Titles6,601
Unique Issues Covered213
Unique Authors298
Content TypeAll classified as “article”
Date Range1917–1930 (13 years of Forbes publishing)

2. Completeness Assessment — ✅ Excellent

FieldNull/Empty CountCompleteness
Text (body)0100%
Title0100%
Author0100%
Issue0100%

Verdict: Zero missing values across all fields. This is outstanding for a historical OCR-processed dataset and demonstrates thorough reconstruction work.

3. Text Quality Analysis

MetricValue
Average text length~8,344 characters
Median text length~2,837 characters
Minimum text length22 characters
Maximum text length216,392 characters
Records < 100 characters1,580 (17.0%)
Records with potential OCR artifacts (~, |)2,111 (22.8%)

Observations:

  • The majority of records contain substantive, multi-paragraph text suitable for RAG retrieval and LLM training.
  • ~17% of records are very short fragments (<100 characters) — likely advertisement headers, image captions, or OCR remnants. These may introduce noise in training scenarios.
  • ~23% of records still carry residual OCR artifacts (tilde characters, pipes), though paragraph-level reconstruction appears solid.

4. Metadata Quality:
⚠️ Moderate Concern Author Field

  • 91.3% of records are attributed to “Unknown” (8,462 of 9,267)
  • Remaining author values are predominantly first-name fragments only (e.g., “Charles”, “Paul”, “Fisher”) — not full bylines
  • This field has limited utility for attribution-based queries or author-centric RAG retrieval

Issue Field

  • Issue identifiers use a non-standard format (e.g., 1329-Louis7209-Company1922-N) rather than clean dates
  • While all 213 issues are populated, the encoding is opaque for temporal filtering without a lookup reference

Recurring/Boilerplate Content

  • 264 records are provenance/license notices (non-editorial content)
  • Recurring titles like “PUBLISHED EVERY TWO WEEKS BY” (38x), “120 FIFTH AVENUE NEW YORK, N.Y.” (31x) indicate publisher boilerplate was captured alongside editorial content

5. Content Richness: ✅ Strong

Topic DomainRecords Mentioning% of Corpus
Financial topics (investment, stock, bond)5,14355.5%
War & military4,58849.5%
Major business leaders (Rockefeller, Ford, Carnegie, Morgan)2,46426.6%

The corpus is rich in domain-specific content covering early 20th-century American capitalism, WWI-era economics, industrial organization, and investment strategy — exactly the kind of specialized knowledge that adds unique value to AI systems.

6. RAG Readiness Breakdown

CriterionScoreNotes
Data Completeness20/20Zero nulls, all fields populated
Text Quality & Length13/20Good avg. depth, but 17% short fragments + OCR noise
Metadata Quality8/20Author field largely unusable; issue format non-standard
Content Richness & Topical Diversity18/20Excellent historical business domain coverage
Structural RAG Readiness16/20Title + text pairing works well for chunked retrieval
TOTAL75/100 

7. Recommendations for Improvement

  1. Filter boilerplate & license notices — Remove or tag the ~264 provenance records and recurring publisher boilerplate to reduce noise in RAG retrieval.
  2. Flag short fragments — Mark or exclude the 1,580 records under 100 characters, as these add minimal information value and may confuse embedding models.
  3. Enrich the author field — Where possible, cross-reference issue-level metadata or known Forbes contributor lists to resolve “Unknown” attributions.
  4. Normalize the issue identifier — Create a parsed date or volume/number field from the issue codes to enable temporal filtering and time-series analysis.
  5. OCR artifact cleanup pass — A targeted regex pass to remove stray tildes, pipes, and line-break artifacts would improve embedding quality.

Summary:

The Forbes Archive 1917–1930 dataset is a well-structured, complete text corpus with strong RAG potential for historical business intelligence applications. Its zero-null completeness and substantial article depth are major assets. The primary limitations are weak author metadata, some residual OCR artifacts, and the presence of non-editorial boilerplate. With targeted cleanup of the short fragments and boilerplate, this dataset could score in the mid-80s and serve as an excellent foundation for specialized AI applications in business history, economic research, and leadership studies.

questions?

return to snowflake to inquire about this dataset

Scroll to Top