« Back to blog

Creating a Command Line SEO Crawler

image

Report generated from node.js script here github.com/sanfrancesco/prerendercloud-crawler
see larger image

Suppose you, the webmaster, want to compile a table with a row for each page on a given domain/host.

Each row should have a screenshot of the page, and the SEO-related tags (title, h1, meta, open graph, etc.). How is this typically done?

You could tediously visit each page in your browser, take a screenshot, and then "view source" to grab the meta
...or issue curl requests for each page on your site
...or use various chrome extensions, commercial software, or SaaS tools

Or, you could "do it yourself": write a script to crawl your site, take screenshots, extract SEO tags, and generate a report.

What do we need to write a basic crawler?

something to download the HTML of a webpage
something to parse the HTML of a webpage (to extract the SEO tags and links)
something to coordinate the crawling of multiple pages using the links found in step 2

If you're crawling standard HTML pages, you could do this with curl and some regular expressions:


curl -s "https://headless-render-api.com" | awk -F'"' '/href="\/[^"]/{print $2}' | grep "^/" | sed 's/\?.$//' | sort | uniq

Add a loop and you've got a basic crawler. But our use case is not just parsing HTML, we need to parse JavaScript apps which don't render until loaded in a browser, so we need something more.

How do we crawl JavaScript apps?

We need all of the above, but instead of cURL in step 1, we use a headless browser capable of waiting for a JavaScript app to finish loading before serializing the DOM to HTML.

We could do this by running a headless browser locally, like Puppeteer, and then parsing the HTML locally. We wrote a blog post on this back in 2018.

This time around, we'll use Headless-Render-API's scrape API to do the heavy lifting: wait for the page to finish loading, serialize the DOM to HTML, and parse the SEO tags and links, take a screenshot, and return the results in a single API call.

Using Headless-Render-API to scrape meta from a JavaScript app

// npm install prerendercloud
const prerendercloud = require("prerendercloud");
// set your API key via env var PRERENDER_TOKEN or the following line:
// prerendercloud.set('prerenderToken', 'mySecretToken')

const {
  meta: {
    title,
    h1,
    description,
    ogImage,
    ogTitle,
    ogType,
    ogDescription,
    twitterCard,
  },
  links,
  body, // Buffer of HTML
  screenshot, // Buffer of png file
  statusCode, // number, e.g. 200, 301/302 if redirect
  headers, // object, e.g. { 'content-type': 'text/html' }
} = await prerendercloud.scrape("https://example.com", {
  withMetadata: true,
  withScreenshot: true,
  deviceIsMobile: false, // Default: false, whether the meta viewport tag is taken into account
  followRedirects: false, // Default: false (redirects will have statusCode 301/302 and headers.location)
});

This single API call will return the various SEO related meta tags, and all links, and a screenshot of the page. It works for both static HTML and JavaScript apps.

But that's just a single page, we still need to "crawl" or, visit every link on that page, and the links on those pages, and so on.

We can add a loop to keep crawling as we find links. Here's a example of how to do that in Node.js using Headless-Render-API:

const prerendercloud = require("prerendercloud");

const urlToScrape = "https://headless-render-api.com";
let processedUrls = new Set();
let queue = [urlToScrape];

// loop until the queue is empty
while (queue.length) {
  const { links } = await prerendercloud.scrape(queue.shift(), { withMetadata: true });

  links.forEach((link) => {
    const url = new URL(link, urlToScrape).href.split("#")[0];

    if (!processedUrls.has(url)) {
      processedUrls.add(url);
      queue.push(url);
    }
  });
}

console.log(processedUrls);

That snippet will visit all possible links on your site and print them to the console. Still, we want more: screenshots, HTML table, concurrency control, and edge cases like redirects, non HTML files, and relative/absolute links.

For that, see repo: github.com/sanfrancesco/prerendercloud-crawler

Git clone and run prerendercloud-crawler

git clone git@github.com:sanfrancesco/prerendercloud-crawler.git
cd prerendercloud-crawler
npm install
HOST_TO_SCRAPE=example.com npm start

see output/crawl-results.html for list of screenshots and meta tags per page
see output/crawl-results.json for the full data

console output will look something like:

$ PRERENDER_TOKEN="secretToken" HOST_TO_SCRAPE=headless-render-api.com npm start

scraping https://headless-render-api.com/
scraping https://headless-render-api.com/pricing
scraping https://headless-render-api.com/docs
scraping https://headless-render-api.com/support
scraping https://headless-render-api.com/blog
scraping https://headless-render-api.com/users/sign-in
scraping https://headless-render-api.com/docs/api/prerender
scraping https://headless-render-api.com/docs/api/examples
scraping https://headless-render-api.com/docs/api/usage
 // skipping https://hub.docker.com/r/prerendercloud/webserver
scraping https://headless-render-api.com/users/sign-up
scraping https://headless-render-api.com/docs/api/screenshot-examples
scraping https://headless-render-api.com/docs/api/screenshot

What's next?

The prerendercloud-crawler repo is meant more as an example of how a little scripting can go a long way. Modify it, rewrite it, and submit a PR if you'd like.

Email us with feature requests support@headless-render-api.com or questions.