Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Also a user ID, which seems to be 36 base64 characters (can't have one user count for multiple votes).

Round up to 500 raw bytes per row (perhaps including time/ip and other random garbage, plus indexes), 3x replication/redundancy or something, for 6 million users each having voted on 500 videos, and you're at 6TB; still some ways off from 15TB, but not insurmountably far.

(votes/user is rather tricky to get; but, as a bit of random garbage statistics math: YT gets ~5B views/day and has ~3B users; 6M downloads of the extension means ~0.2% of users use it, so 10M extension-user views/day = 15B over 4 years, or 2.5K/user; assuming 20% vote rate (rather high but lets say extension users care more for voting and/or watch YT more than an average person), that's 500 votes/user)



500 bytes? A user ID couldn't be more than 8, a date is another 8, a video ID is another 8, and an IP is 16. Even if you assume there is some overhead, a database cannot possibly need more than 100 bytes per row.


That's assuming all of those are stored in packed formats, but even then it's not that low.

I already mentioned that the user IDs are 36 base64 chars, or 27 bytes if you store them max-packed; YT video IDs are 11 base64 chars, so 66 bits, doesn't quite fit in 8 bytes (not to mention that trying to pack the video IDs would mean your db becomes useless if youtube suddenly added a new video ID format). IP needs another bit somewhere for ipv4 vs ipv6, so likely 24 bytes (or just a string could be used).

Then you have some overhead for padding and string field lengths, and whatever overhead for packing the data for the disk storage (padding for all entires to stay within a page? maintaining a percentage of free space to ensure the B-tree property?). Then you have a copy of the fields in indexes, with whatever overhead those come with.

Granted, even that's probably around 200 bytes, not 500, in a reasonable db, but who's to say that the db used used is a reasonable & well-configured one; of course it's possible that a bunch more metadata is stored for user trustworthiness statistics or something, or duplicated tables where relations would work.


> user IDs are 36 base64 chars, or 27 bytes if you store them max-packed

Stupid. You aren't going to have 2^288 users, why do you need that many user IDs? A 64-bit integer is already overkill.

>YT video IDs are 11 base64 chars, so 66 bits, doesn't quite fit in 8 bytes (not to mention that trying to pack the video IDs would mean your db becomes useless if youtube suddenly added a new video ID format)

Your video table has a 64-bit integral row ID. You have a column that is a foreign key to it. Join on them.

>IP needs another bit somewhere for ipv4 vs ipv6, so likely 24 bytes (or just a string could be used).

All IPv4 addresses can be encoded as IPv6 addresses so this only requires 16 bytes.

>Then you have some overhead for padding and string field lengths,

None of these things are strings.

>and whatever overhead for packing the data for the disk storage (padding for all entires to stay within a page? maintaining a percentage of free space to ensure the B-tree property?).

This is pretty low overhead. It won't take you to hundreds of bytes per row.

>Granted, even that's probably around 200 bytes, not 500, in a reasonable db, but who's to say that the db used used is a reasonable & well-configured one; of course it's possible that a bunch more metadata is stored for user trustworthiness statistics or something, or duplicated tables where relations would work.

If it stores IDs as strings then the DB probably won't be set up correctly either. That would be clearly wrong and wrong people are usually wrong about other things too.


Unfortunately, nice as it would be, your level of perfectionism is unfortunately not particularly common; indeed, it's possible to do things much more efficiently, but for most purposes "it works" is enough; and when it starts to not be you already have terabytes of db and just adding more disk is much easier than the hassle of migrating the entire thing to something different.


Why not summarize every 3 months? It would allow someone to downvote every 3 months on the same video but it's easier to just install this extention on another profile.

It would bring the size down to under a 1T and allow the developer to go ad free. Hope this message reaches him.


That'd easily result in accidental re-votes if a user watches the same video every couple months (and such re-watchers would likely be ones that.. like the video, thereby skewing data away from dislikes over time).

Especially if the extension just sends the vote status instead of only reacting on a press (which'd allow it to send forward a vote done originally on, say, mobile; don't know if it does this, but it seems like a useful thing to do).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: