YouTube, the largest platform for content creators hosts an enourmous amount of valuable metadata. All of it is technically publicly available and accessible via YouTube's Data V3 API. Practically, getting a good dataset is hard though. The official API is highly rate-limited and I found there is a lack of bulk downloads for researchers to access, even though they exist: The Internet Archive hosts over 400K crawls since 2007 but none of them are publicly available (yet). So I took matters into my own hands and created a pretty capable crawl pipeline.
Disclaimer: Using this data for commercial/advertising purposes is against YouTube's ToS and probably illegal! Use it morally.
Grouped by comment author ID. RFC 4180 CSV with 10,365,014,153 rows posted by 576,551,936 author accounts.
Tiny amount of duplicates possible. First row is header.
Coverage of authors, uploaders and videos unknown / to be aggregated. Crawl ended early 2019-11.
Unsorted crawl results (partial timestamps) also available, contact me if interested.
Just a small sample for testing. One line per channel, sorted by lines.
If this was useful to you, please consider supporting me by seeding the torrents,
This site is JS- & ad- & bs-free.
Crawling, indexing, sorting and moving terabytes around all day is not cheap 😶.