New Study Shows Up to 96% of Megaupload Files Are Infringing

A new study from North Eastern University is getting some attention and it’s interesting how some people are spinning the numbers, so we decided to take a look.

“For Megaupload (MU) the researchers found that 31% of all uploads were infringing, while 4.3% of uploads were clearly legitimate. This means that with an estimated 250 million uploads, 10.75 million uploads were non-infringing. For the remaining 65% the copyrighted status was either unknown, or the raters couldn’t reach consensus.” – Torrent Freak

Simple Math : 31% + 65% = 96%

5 thoughts on “New Study Shows Up to 96% of Megaupload Files Are Infringing

  1. This is not surprising. If you look at the history of Kim Dotcom, he made his fortune in illegal activity including blatant pump and dump of stocks and other clearly illicit activities.

    Anyone with common sense would not trust the integrity of their data to someone who made their fortune with criminal activity, so it is not surprising that the bulk of the use is for illegal content opposed to legal personal files.

  2. I’m a bit wary of summing it up like that, at least without a further breakdown of the “unknown” category into “unknown” and “disputed”. In the latter case, a prudent business should assume that such files are – in fact – infringing and a legal risk.

    Such reservations aside, I think the proportion of legal to infringing uploads is a much less interesting number than the equivalent proportions of downloads. Since Megaupload made its money from downloaders (albeit in indirect ways), I think that is a much better illustration of whether their business was criminally-minded at the root or not.

    Ceterum censeo that any business that earns money from downloads is automatically suspect and should be held to higher standards of evidence with regards to the source of their content, since it is essentially a publisher. Storage providers who charge uploaders for space and bandwidth aren’t particularly attractive “sharing” platforms – at least not for “community-minded” uploaders.

  3. I’m sorry, but this is a pretty terrible abuse of the data. You can’t simply make these sort of implications to suit your own spin (there are other legitimate ways, but this is flat out incorrect).

    The pull quote you used has three groupings:
    1) Clearly infringing – 31%
    2) Clearly legitimate – 4.3%
    3) Unclear or unknown – 65%

    You cannot simply group the third into either of the others for your own purpose. If you use it to claim “up to 96% of files on MegaUpload were infringing!” you lay yourself open to the same argument in reverse with “up to 69.3% of files on Megaupload were legitimate!” They’re both incorrect, and abuse the data.

    Now, if you REALLY want to spin this, the best way would be to say “88% of identified files on Megaupload were infringing” (31/(31+4.3) = 87.8). The other option would be to say “7.2 times as many infringing files were uploaded as legitimate files”. You still have the message which you seem to be aiming for, but you don’t abuse the data as much and don’t have to use the weasel word phrase “up to”.

    As to spin from the other side of the argument, there doesn’t actually seem to be that much. The TorrentFreak article ( focuses on the absolute values to get their point across (“New research from Boston’s Northeastern University shows that with the shutdown of Megaupload, the U.S. Government took down at least 10.75 million legitimate files”), but there doesn’t appear to be too much manipulation of the paper’s figures. They do the same “simple math” exercise that you’ve done, but they caveat it: “While unlikely, this means that in the most optimistic scenario 69.3% of the files uploaded to Megaupload could be perfectly legal. This means that the Megaupload raid could in theory have destroyed 172,500,000 million non-infringing files.”

    If you read section 5.2 in the paper you can get some more information about how the classifications were reached. To fall into infringing or legitimate there had to be agreement between the three labellers, rather than a majority rule situation (two saying infringing or legitimate and one saying unknown would result in unknown rather than a classification). The paper discusses the possibility of using the majority, but “this comes at the cost of lower confidence in the accuracy of the labels, thus we decided to retain the more conservative consensus merging for the remainder of this paper”. Grouping the unknown section with either of the other two destroys confidence in the accuracy, as you end up grouping files where two of the labellers thought were either infringing or legitimate while one couldn’t classify it with those where all three agreed they were the opposite(!).

    The problem with abusing data in a tabloid manner like this is that it immediately makes me suspicious of anything else cited elsewhere on the site.

    • We actually agree with you Matt. Our post was intended to illustrate exactly the abuse of data you describe that we’ve seen reported just about everywhere else. We felt the need to quickly show the same data could be represented in the complete opposite way using the same logic.

  4. When I saw an article about this initially, the author claimed that up to 69% of Megaupload files were legal. She did this by taking the numbers they knew, 31% illegal and 4.3% legal and then added the remainder, 65% to the legal total. While it is not exact science, one can extrapolate that the remaining 65% of files will likely be in the same ratio as those studied, So I think it is fair to say that 90% of their files were illegal and perhaps as much as 10% were legal.

