Entering the Matrix: Robots.txt URL Sourcing

List diving and web scraping has definitely become an art form in recent years. I’m lucky enough to chat with Aaron Lintz once in a while and absorb some advanced techniques in unlocking the secrets of the Matrix. Usually, the stuff is too advanced for me the first time he explains it.

One day while looking for directories and attendee lists we got to talking about data extraction and some specific syntax. He pulled up this robots.txt trick I’d never seen before. When Lintz pulled up the robots file and started indexing entire websites, I thought the sentinels would seek and destroy to disallow access. However, this is publicly shared information, so anyone can view the .txt files for company websites (without fear of vengeful gigantic robots). The outcome has been pretty eye opening, particularly on the medical side of things.


Enter the Matrix of Robots.txt

So what the heck is robots.txt? It’s a syntax put into many organization’s websites in order to “block” or “disallow” automated bots, search engines, and web crawlers from accessing specific URLs. By accessing the robots.txt file, you are looking for clues for more information to bypass the “disallow.” Typically, one can access a public sitemap (which often times may be an index of the entire website). Sometimes you can stumble upon something bigger. I’ve had some success (and a ton of fun) while looking for indexing trends or trying promising URL variations.

Keep in mind this type of method requires much trial and error and is more appropriately used when you’ve exhausted most of the normal outlets. This is truly meant for the deep web dives.


Following the White Rabbit

Start with the main company website. This typically works best with larger organizations, but anything is worth a quick look:

Let’s index a hospital (based in NY).

Go to the main website: www.mskcc.org

Add /robots.txt to the tail end of the URL: https://www.mskcc.org/robots.txt



This pulls a pretty expansive list of programming jargon (with the nice explanation of the robots file). Just ignore this, for now, we are looking for a sitemap which would be a table of contents for the website (and usually at the bottom of the text file).




The highlighted link is the sitemap. Plug this into your browser:  https://www.mskcc.org/sitemap.xml




This sitemap leads to 31 one other sitemaps. The bio clinical URL looks promising as well as the doctor file. Trial and error lead us to this full list of clinical profiles:





Click directly on the link and it sends us straight to a Nurse Practitioner.


Now you can use some of the cross-referencing techniques here to dig in further.


Learning Kung Fu and Trendsetting

Again, not all these searches are this clean and easy. Dead ends are common when messing around with this stuff, but you can identify certain trends when exploring sitemaps from similar organizations.

Sometimes you can try a certain URL segment and pull similar results.

For example, put this into your web browser: http://memorialhermann.org/robots.txt


Notice the following line: Disallow: /doctors.htm

Like in the Matrix, you can bend the rules a bit. Bypass the “disallow” and substitute /doctors.htm in the URL instead of /robots.txt:

Article Continues Below



We just found a zion of MDs. All those glorious links with the doctor’s names and listed departments just waiting to be sourced!


Speak URL, not Boolean, Neo

We can take it a step further from here. Say we need MDs based in emergency medicine. We search the matrix for “emergency medicine” and find a short list of potential candidates. Notice you’ll need to use a – between words in this search since we are speaking URL, not Boolean.

Ctrl-F (to pull up your finder) and search: emergency-medicine


We found our candidate, and it looks like she has experience with pediatrics as well.

This can be much easier than accessing a company’s contact finder. You can see the full list of results and change the keywords quickly instead of working within a profile finder that may be more limiting or that may have some restrictions on total access.

Now you know a little URL kung fu. As you can see, it can land a big score if you look at a website through another doorway. I hope this helps. Happy hunting!

  • http://tomordonez.com/ Tom Ordonez

    You often use robots.txt to tell bots not to crawl some parts of your site. In particular server related, configuration files, staging sites.

    But reading a large robots file can quickly give you a headache.

    The same can be done with the site: operator. For this example, you can use: site:memorialhermann.org/doctors to see all categories of doctors.

    Then use:
    site:memorialhermann.org/doctors/dermatologists to see all doctors in this category.

    If you like wild stuff you could do:
    site:*memorialhermann.org -www

  • http://www.recruiting-online.com/ Glenn Gutmacher

    Excellent post to clearly explain how to find a lot of *publicly-available* people data (so anyone concerned about crossing the line between ethical and illegal should realize you are on the safe side using this method) that Google and other search engines will not show you in their results.