YouTube, the largest platform for content creators hosts an enourmous amount of valuable metadata. All of it is technically publicly available and accessible via YouTube's Data V3 API. Practically, getting a good dataset is hard though. The official API is highly rate-limited and I found there is a lack of bulk downloads for researchers to access, even though they exist: The Internet Archive hosts over 400K crawls since 2007 but none of them are publicly available (yet). So I took matters into my own hands and created a pretty capable crawl pipeline.



Disclaimer: Using this data for commercial/advertising purposes is against YouTube's ToS and probably illegal! Use it morally.

10 Billion Comments (CSV)

Grouped by comment author ID. RFC 4180 CSV with 10,365,014,153 rows posted by 576,551,936 author accounts. Tiny amount of duplicates possible. First row is header.


author_id id video_id parent_id crawled_at likes replies author content

Coverage of authors, uploaders and videos unknown / to be aggregated. Crawl ended early 2019-11.
Unsorted crawl results (partial timestamps) also available, contact me if interested.

100 Million Channel Names (TXT)

Just a small sample for testing. One line per channel, sorted by lines.

