Stopping Robots.txt Spam on Hyperlinks

While crawling a page, the my search engines finds all the anchor tags and extracts the hrefs inside them. I'm currently manually reviewing every hostname I allow my web crawler to explore, because I have a specific criteria.

It must be a personal website
It must have a robots.txt file with a content-signal search=yes for the * agent

That way I am only indexing personal websites that are opting in to being indexed by a search engine. So when my web crawler is exploring a page, for each hyperlink, it would check whether the hostname of that hyperlink has a robots.txt file. The issue is, that if a page links to the same location multiple times, then my web crawler would fetch the robots.txt file multiple times with no delay.

I fixed this by keeping a map of all hostnames the web crawler has explored during the current session. Whenever it finds a new hostname, it fetches the robots.txt file and parses the agent rules for the * agent. Once parsed, that object is mapped to the hostname.

robotsMap := make(map[string]Robots)

Now whenever a hyperlink is found on a page, before attempting to fetch the robots.txt, my crawler first checks whether it has already done so. If it has, it just uses the cached rules. If a page doesn't have a robots.txt file, to stop it from being pinged many times over, it's hostname is added to the map with a default rules object.

You can see the exact changes in this commit