I do not have any blog entry worth sharing. I am running an "ethic web scraper". I think I cannot speak about processes. I just may be lacking knowledge. I think it is more about "experience" rather than process.
I know that there already are spiders, metadata processing packages for python, but I like having control over the process.
Old man yelling at the cloud. I hate also:
- blocking me with 403 because my user agent is not "mainstream". Why do I have to use chrome undetected to read some RSS feeds? Why can't I use third party clients? Contents can have adverts. I just want my own layout, buttons
- RSS feeds protected with cloudflare, so tools cannot read feeds easily
- not using, or outright blocking RSS functionality in wordpress. Some sites could be more open that way, but no. RSS feeds are closed/removed
- some sites have "/blog" location, but the main domain is empty, or nearly empty, or returns 404. Can I trust such location?
- when HTML meta data are not available. I like YouTube. It allows me to scrape metadata, but it protects video contents, and that is good
- weird redirects. Domain does not have any contents. Does not describe what it is. It just have javascript redirects. From main domain to some weird locations within the domain
- url shorteners, vanity links. You do not know where you will be transported. I understand they are counting sheep, but they sacrifice my security