Thanks for posting this again! It's a year later and I still haven't touched the web scraper in production which is great to reflect on. It seems running the Youtube command on the post is still producing the exact same data too.
Did you ever make another blog post about how to choose properties working backward from the visible data on the web page to the data structure containing said data?
Searching the heap manually is not working very well. The data I want is in a (very) long list of irrelevant values within a "strings" key. It might have something to do with the data on the page that I want to scrape being rendered by JavaScript.
Anecdotally I agree with you but doesn't this blog post suggest the reverse - click bait does well? The model was trained on a fairly comprehensive set of HN titles and it scores click-bait-y titles with a high "Good" probability. e.g. `"Beware! Uninstalling this PC game deletes your hard drive"` with a `62.0% Good prob`. There's a ton of hidden complexity involved here but if click-bait was generally downvoted by the HN community, we should expect a low "Good" score, right?
Apologies for the shameless self-promotion here but it was this very problem that I built puppeteer-heap-snapshot. It decouples the HTML from the scraper and instead we inspect the booted app’s memory. Not near as performant but a lot more reliable. I wrote about it here: https://www.adriancooney.ie/blog/web-scraping-via-javascript...
Hi! Your application looks interesting! I have a question regarding your YouTube example: Where do you get property names like channelId,viewCount,keywords from? Thanks
If you haven't already, give the Fall of Civilizations podcast [1] a listen. It's one of my favourites - informative, engaging and peaceful listening - about how civilizations rise and fall. Episode 8 is about the Sumerians in Iraq and might give you a picture of how these people lived (if nearly 1500 years earlier).
The History lf Rome was my first podcast I've listen to. It is such a treat to listen. I've tried to continue with The History of Byzantium and it was just not the same. So now I've picked up Revolutions since I think it was really Mike Duncan style I appreciated (well that and the Romans)
If it’s rendered server-side - no. The data likely won’t be loaded into the JS heap (the DOM isn’t included in the heap snapshots) when you visit the page. You might be in luck if the website executes JavaScript to augment the server-side rendered page however. If it does, your data may be loaded into memory in a way you can extract it.
Do you offer historical exports of data? I’d love to create some visualisations of the housing situation in Ireland over time.