Teague
Teague Stockwell
Software Engineer
Seattle, WA

How to build a price scraping service

Teague Stockwell
Teague Stockwell2022-09-24
container ship

The web is an ocean where schools of e-commerce stores swim. This post will show you how to reel in a product's title, price, and image without standing behind a fishing pole casting one line at a time!

Cross Origin Resource Sharing (CORS)

With the url for a product, you may consider sending an HTTP GET from a client's browser to retrieve the products html.

const getHtml = async (url: string) => {
  const res = await fetch(url)
  if (!res.ok) {
    throw new Error(`${res.status}: ${url}`)
  }
  return res.text()
}

What may happen is that the server will respond with an access-control-allow-origin header. When the browser sees this header and it's value is not the same origin the request was made from, it will reject the response. This is a security measure that prevents abusing client web browsers to request resources they don't own.

cors error

Client Side Rendered (CSR) Web Pages

Given a users browser can't complete the request because its using javascript deployed to a different origin, lets try to make the same request from a server.

<body>
  <noscript>You need to enable JavaScript to run this app.</noscript>
  <div id="root"></div>
</body>

The initial response may not contain content right away if it was meant to be rendered by the client. When a browser loads a url, there are often secondary ajax requests to make the page contentfull. This means that although a server can ignore CORS, the response may end up being a shell of the page without useful product meta data. When visiting an amazon product page for example, a browser makes over 300 request for resources needed to make the page contentfull and interactive.

Parsing Structured Data

A browser can render HTML into contentfull web pages for humans, but how can a name, price, and picture be extracted from that HTML? Luckily it is in the best interest of e-commerce stores to use structure data so search engines may display rich search results. For example this sofa from Wayfair is using a format of structured data in the head of the document called: json+ld.

rich search results
<head>
  <script type="application/ld+json">
    {
      "@context": "http://schema.org",
      "@type": "Product",
      "name": "Ainsley 73.6'' Rolled Arm Sofa",
      "sku": "W004362250",
      "url": "https://www.wayfair.com/furniture/pdp/steelside-ainsley-7401-rolled-arm-sofa-w004362250.html",
      "brand": {
        "@type": "Brand",
        "name": "Steelside\u2122"
      },
      "image": "https://secure.img1-fg.wfcdn.com/im/06305386/compr-r85/1695/169530576/ainsley-736-rolled-arm-sofa.jpg",
      "description": "Form meets function with this vintage-inspired sofa. It features a solid oak frame with four dowel legs to give your space a modern vibe. This couch is also wrapped in faux leather(Suede fabric) upholstery for a lived-in look, and it comes in a rich brown hue that blends in with a variety of color schemes. Foam filling gives you extra support whenever you need some time to relax. Round arms and a cushion back round out this classically industrial design. Colors may slightly vary due to difference in monitor\u00a0settings and lighting.",
      ...rest
    }
  </script>
</head>

The HTML can be fed into an extensible npm package to parse product information that uses a combination of clever fallbacks like open graph meta tags.

import metaScraper from 'metascraper'
import shopping from '@samirrayani/metascraper-shopping'
import metaTitle from 'metascraper-title'
import metaImage from 'metascraper-image'
import metaUrl from 'metascraper-url'
import metaDescription from 'metascraper-description'

const metascraper = metaScraper([
  shopping(),
  metaTitle(),
  metaImage(),
  metaUrl(),
  metaDescription(),
])

const getMeta = (html: string, url: string) => metascraper({html, url})

Creating an HTTP Server

There are a handful of tools that can be used to control a browser like Puppeteer, Cypress, and Playwright. I found Playwright the most well supported for this use case because it publishes a Docker image to abstract the environment setup for a chromium browser. I created the npm package price-scraper for an Express server to:

  1. Visit web pages and make them contentfull by rendering them inside Playwright's chromium
  2. Parse structured data from the browser's HTML into product meta data
import express, {Application} from 'express'
import {scrapePrice} from 'price-scraper/dist'

const app: Application = express()
app.use(express.json())

app.get('/scrape', async (req, res) => {
  const url = req.query.url as string

  try {
    const priceMeta = await scrapePrice({url, backOffCoefficient: 1.3})
    res.status(200).json(priceMeta)
  } catch (e) {
    const msg = getMsg(e)
    logger.error(msg)
    res.status(500).json({msg})
  }
})

const server = app.listen(env.PORT)

Continuous Deployment (CD)

Great! Now that there is an HTTP server that uses a headed browser to load a webpage and parse the structured data into product meta data. Let's pack this up so many instances of it can be deployed behind an HTTPS load balancer when the repo is pushed to Github! I encourage you to take a look at:

This template for CD of a node server to Azure Kubernetes Service with Github Actions.

ci

How i'm using this service in a social media platform to explore projects by the products used to build them.

scraping app