+1 for shared sentiments. Do you have a blog post(s) that explains your setup wi...

renegat0x0 · on May 28, 2024

I do not have any blog entry worth sharing. I am running an "ethic web scraper". I think I cannot speak about processes. I just may be lacking knowledge. I think it is more about "experience" rather than process.

Web crawling core is in file: https://github.com/rumca-js/Django-link-archive/blob/main/rs...

Some things more project specific are in https://github.com/rumca-js/Django-link-archive/blob/main/rs...

I know that there already are spiders, metadata processing packages for python, but I like having control over the process.

Old man yelling at the cloud. I hate also:

- blocking me with 403 because my user agent is not "mainstream". Why do I have to use chrome undetected to read some RSS feeds? Why can't I use third party clients? Contents can have adverts. I just want my own layout, buttons

- RSS feeds protected with cloudflare, so tools cannot read feeds easily

- not using, or outright blocking RSS functionality in wordpress. Some sites could be more open that way, but no. RSS feeds are closed/removed

- some sites have "/blog" location, but the main domain is empty, or nearly empty, or returns 404. Can I trust such location?

- when HTML meta data are not available. I like YouTube. It allows me to scrape metadata, but it protects video contents, and that is good

- weird redirects. Domain does not have any contents. Does not describe what it is. It just have javascript redirects. From main domain to some weird locations within the domain

- url shorteners, vanity links. You do not know where you will be transported. I understand they are counting sheep, but they sacrifice my security

- google returning links with syntax "https://www.google.com/url", not directly. Youtube does the same with syntax "https://www.youtube.com/redirect". For me again this is vulnerability

My ethic web scraper results are placed in: https://github.com/rumca-js/Internet-Places-Database.

0x445442 · on May 29, 2024

Thanks for the info.