Report generated from node.js script here github.com/sanfrancesco/prerendercloud-crawler
see larger image
Suppose you, the webmaster, want to compile a table with a row for each page on a given domain/host.
Each row should have a screenshot of the page, and the SEO-related tags (title, h1, meta, open graph, etc.). How is this typically done?
You could tediously visit each page in your browser, take a screenshot, and then "view source" to grab the meta
...or issue curl requests for each page on your site
...or use various chrome extensions, commercial software, or SaaS tools
Or, you could "do it yourself": write a script to crawl your site, take screenshots, extract SEO tags, and generate a report.
What do we need to write a basic crawler?
something to download the HTML of a webpage
something to parse the HTML of a webpage (to extract the SEO tags and links)
something to coordinate the crawling of multiple pages using the links found in step 2
If you're crawling standard HTML pages, you could do this with curl and some regular expressions:
curl -s "https://headless-render-api.com" | awk -F'"' '/href="\/[^"]/{print $2}' | grep "^/" | sed 's/\?.$//' | sort | uniq
Add a loop and you've got a basic crawler. But our use case is not just parsing HTML, we need to parse JavaScript apps which don't render until loaded in a browser, so we need something more.
How do we crawl JavaScript apps?
We need all of the above, but instead of cURL in step 1, we use a headless browser capable of waiting for a JavaScript app to finish loading before serializing the DOM to HTML.
We could do this by running a headless browser locally, like Puppeteer, and then parsing the HTML locally. We wrote a blog post on this back in 2018.
This time around, we'll use Headless-Render-API's scrape API to do the heavy lifting: wait for the page to finish loading, serialize the DOM to HTML, and parse the SEO tags and links, take a screenshot, and return the results in a single API call.
Using Headless-Render-API to scrape meta from a JavaScript app
// npm install prerendercloud
const prerendercloud = require("prerendercloud");
// set your API key via env var PRERENDER_TOKEN or the following line:
// prerendercloud.set('prerenderToken', 'mySecretToken')
const {
meta: {
title,
h1,
description,
ogImage,
ogTitle,
ogType,
ogDescription,
twitterCard,
},
links,
body, // Buffer of HTML
screenshot, // Buffer of png file
statusCode, // number, e.g. 200, 301/302 if redirect
headers, // object, e.g. { 'content-type': 'text/html' }
} = await prerendercloud.scrape("https://example.com", {
withMetadata: true,
withScreenshot: true,
deviceIsMobile: false, // Default: false, whether the meta viewport tag is taken into account
followRedirects: false, // Default: false (redirects will have statusCode 301/302 and headers.location)
});
This single API call will return the various SEO related meta tags, and all links, and a screenshot of the page. It works for both static HTML and JavaScript apps.
But that's just a single page, we still need to "crawl" or, visit every link on that page, and the links on those pages, and so on.
We can add a loop to keep crawling as we find links. Here's a example of how to do that in Node.js using Headless-Render-API:
const prerendercloud = require("prerendercloud");
const urlToScrape = "https://headless-render-api.com";
let processedUrls = new Set();
let queue = [urlToScrape];
// loop until the queue is empty
while (queue.length) {
const { links } = await prerendercloud.scrape(queue.shift(), { withMetadata: true });
links.forEach((link) => {
const url = new URL(link, urlToScrape).href.split("#")[0];
if (!processedUrls.has(url)) {
processedUrls.add(url);
queue.push(url);
}
});
}
console.log(processedUrls);
That snippet will visit all possible links on your site and print them to the console. Still, we want more: screenshots, HTML table, concurrency control, and edge cases like redirects, non HTML files, and relative/absolute links.
For that, see repo: github.com/sanfrancesco/prerendercloud-crawler
Git clone and run prerendercloud-crawler
git clone git@github.com:sanfrancesco/prerendercloud-crawler.git
cd prerendercloud-crawler
npm install
HOST_TO_SCRAPE=example.com npm start
see output/crawl-results.html for list of screenshots and meta tags per page
see output/crawl-results.json for the full data
console output will look something like:
$ PRERENDER_TOKEN="secretToken" HOST_TO_SCRAPE=headless-render-api.com npm start scraping https://headless-render-api.com/ scraping https://headless-render-api.com/pricing scraping https://headless-render-api.com/docs scraping https://headless-render-api.com/support scraping https://headless-render-api.com/blog scraping https://headless-render-api.com/users/sign-in scraping https://headless-render-api.com/docs/api/prerender scraping https://headless-render-api.com/docs/api/examples scraping https://headless-render-api.com/docs/api/usage // skipping https://hub.docker.com/r/prerendercloud/webserver scraping https://headless-render-api.com/users/sign-up scraping https://headless-render-api.com/docs/api/screenshot-examples scraping https://headless-render-api.com/docs/api/screenshot
What's next?
The prerendercloud-crawler repo is meant more as an example of how a little scripting can go a long way. Modify it, rewrite it, and submit a PR if you'd like.
Email us with feature requests support@headless-render-api.com or questions.