W3C standards grow at a pace of about one POSIX every 4 to 6 months

My local copy of all of the W3C specs is 3.2G and I'm still not done scraping

@sir if W3C specs become law, you'll probably have to pay to access them.

@sir are you about to write your own web browser? 🙃

@sir Don't worry a lot of it is not implemented by any browsers and won't anytime soon.

@carl the ones web browsers don't implement have a tendency to be the smallest (aside from AP, lol)

@sir my last estimation of was about 40G. But now, most is on github anyway

@rigo do you have a mirror available somewhere? Can you measure a few things for me?

@sir I don't have that, but I can ask Vivien whether he can do that for you

@sir Note that those measures will be necessarily imiprecise as everything moved to github since a while now. In TR there are only a few iterations of a Specification left because they practice editor drafts on github

@sir there are no W3C standards. There are only W3C Recommendations. But maybe you mean Community Group Reports. And yes, that's wanted, but those are not Recommendations 🙂

@sir ah, ok, you do a cvs co of all you get ALL the Specifications and every single iteration thereof since 1994. That may be a lot but isn't all relevant stuff

@rigo even 2019 alone is ~4M words of new specs, and that's a consistent figure for most years from 2010 foward

@sir The policy always was to publish every 3 month for transparency (when to groups where member only) Again, the count of words is meaningless if you count 10 iterations of WebRTC or the iterations of XML Query with lots of examples

@rigo given that the github repos are so inconsistently organized and managed entirely by each specification's interested parties, it's not a good source, either, simply because it'd be too hard to measure consistently

@rigo open to suggestions. I deliberately made generous estimates for w3c to account for these issues

@sir I think they have a tool with an overview, but I can ask back. Send me the questions to and I will see what I can do for you. If I understand your goal I can even be more helpful. Because its different to measure prolific text production from corpus of rules. For the first you take all Specs, for the last only the Recommendations. There should be some Linked data source somewhere. But I haven't put my nose into this since ages

@rigo thanks, I'll shoot you an email. BTW: even after adding too-broad filters for pulling out WDs, RECs, etc, the dataset does not change substantially in size

@sir looks normal. I suspect those are the schemata, images, illustrations, pseudo code, tests, css, js and the like. We could single out something and I can checkout and look

@rigo note I was already only including HTML, XHTML, and XML files in the count. I'm looking into removing XML as well but I don't think it's entirely fair

@rigo new approach: download only the html files found directly under the link to the latest draft/rec/spec

@sir have the w3c ever produced a good IETF-quality standard?
@steph @sir W3C has never produced a good standard, let alone an IETF-quality standard.

@sir Who, in your estimation, actually uses that? Or is the W3C just a containment device to keep docuphiles from bothering the rest of us?

Sign in to participate in the conversation

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!