Also a user ID, which seems to be 36 base64 characters (can't have one user count for multiple votes).
Round up to 500 raw bytes per row (perhaps including time/ip and other random garbage, plus indexes), 3x replication/redundancy or something, for 6 million users each having voted on 500 videos, and you're at 6TB; still some ways off from 15TB, but not insurmountably far.
(votes/user is rather tricky to get; but, as a bit of random garbage statistics math: YT gets ~5B views/day and has ~3B users; 6M downloads of the extension means ~0.2% of users use it, so 10M extension-user views/day = 15B over 4 years, or 2.5K/user; assuming 20% vote rate (rather high but lets say extension users care more for voting and/or watch YT more than an average person), that's 500 votes/user)
500 bytes? A user ID couldn't be more than 8, a date is another 8, a video ID is another 8, and an IP is 16. Even if you assume there is some overhead, a database cannot possibly need more than 100 bytes per row.
That's assuming all of those are stored in packed formats, but even then it's not that low.
I already mentioned that the user IDs are 36 base64 chars, or 27 bytes if you store them max-packed; YT video IDs are 11 base64 chars, so 66 bits, doesn't quite fit in 8 bytes (not to mention that trying to pack the video IDs would mean your db becomes useless if youtube suddenly added a new video ID format). IP needs another bit somewhere for ipv4 vs ipv6, so likely 24 bytes (or just a string could be used).
Then you have some overhead for padding and string field lengths, and whatever overhead for packing the data for the disk storage (padding for all entires to stay within a page? maintaining a percentage of free space to ensure the B-tree property?). Then you have a copy of the fields in indexes, with whatever overhead those come with.
Granted, even that's probably around 200 bytes, not 500, in a reasonable db, but who's to say that the db used used is a reasonable & well-configured one; of course it's possible that a bunch more metadata is stored for user trustworthiness statistics or something, or duplicated tables where relations would work.
> user IDs are 36 base64 chars, or 27 bytes if you store them max-packed
Stupid. You aren't going to have 2^288 users, why do you need that many user IDs? A 64-bit integer is already overkill.
>YT video IDs are 11 base64 chars, so 66 bits, doesn't quite fit in 8 bytes (not to mention that trying to pack the video IDs would mean your db becomes useless if youtube suddenly added a new video ID format)
Your video table has a 64-bit integral row ID. You have a column that is a foreign key to it. Join on them.
>IP needs another bit somewhere for ipv4 vs ipv6, so likely 24 bytes (or just a string could be used).
All IPv4 addresses can be encoded as IPv6 addresses so this only requires 16 bytes.
>Then you have some overhead for padding and string field lengths,
None of these things are strings.
>and whatever overhead for packing the data for the disk storage (padding for all entires to stay within a page? maintaining a percentage of free space to ensure the B-tree property?).
This is pretty low overhead. It won't take you to hundreds of bytes per row.
>Granted, even that's probably around 200 bytes, not 500, in a reasonable db, but who's to say that the db used used is a reasonable & well-configured one; of course it's possible that a bunch more metadata is stored for user trustworthiness statistics or something, or duplicated tables where relations would work.
If it stores IDs as strings then the DB probably won't be set up correctly either. That would be clearly wrong and wrong people are usually wrong about other things too.
Unfortunately, nice as it would be, your level of perfectionism is unfortunately not particularly common; indeed, it's possible to do things much more efficiently, but for most purposes "it works" is enough; and when it starts to not be you already have terabytes of db and just adding more disk is much easier than the hassle of migrating the entire thing to something different.
Why not summarize every 3 months? It would allow someone to downvote every 3 months on the same video but it's easier to just install this extention on another profile.
It would bring the size down to under a 1T and allow the developer to go ad free. Hope this message reaches him.
That'd easily result in accidental re-votes if a user watches the same video every couple months (and such re-watchers would likely be ones that.. like the video, thereby skewing data away from dislikes over time).
Especially if the extension just sends the vote status instead of only reacting on a press (which'd allow it to send forward a vote done originally on, say, mobile; don't know if it does this, but it seems like a useful thing to do).
The way the plugin works (in my simplified understanding) is that it guesses how many dislikes there are based on the like/dislike ratio of the people that have the plugin installed. So if 100 people that have the plugin installed and there is a 90/10 like/dislike ratio, and the actual video has 1000 likes, it will say that there are 100 dislikes. Youtube not only took away the dislike UI, but stopped publicly giving the number of dislikes even behind the API.
But even then, the database could not get that big, you'd only need a few simple tables, one that tracks every plugin users like/dislike on the video they stored it on, and then a table that does the aggregations. 15TB sounds crazy.
I'm not a youtuber so idk what content creators could see, but it would have been smarter for them to go after the content creators that have the plugin installed instead of youtube users, not sure why we would care about those kinds of analytics
It's not a representative sample so the dislikes it shows aren't accurate it's a bad estimate. I also heard some content creators that said they compared with real dislike numbers and it was way off.
I wasn't really upset about the removal of the button but this add-on seems superfluous. What benefit does it give users to see how many other users of this extension disliked a video? I would understand if it helped shape your recommendations or home page feed but I'm at a loss here.
Imagine a video is recommended that is for a specific how-to search; if it has a poor rating then you can be confident it's a bad match for your search.
E.g. a plumbing video for fixing a tap with a bad rating is unlikely to actually tell you how to do so.
I mean, isn't that the point of comments? I have a hard time believing a video can have high likes and not a single higher up comment countering it. At least in a realistic example like household repair or something. I also tend to skim videos for content as well to verify or find it. So maybe I'm just more diligent than most.
Maybe I am missing something but how does a database which just needs to store video ID and a number become 15TB in size?