Simple Web Scraping WooCommerce Store Using Google Sheets

Simple Web Scraping WooCommerce Store Using Google Sheets
Posted
Nov 07, 2021

Successful brand promotion depends a lot on your willingness to think broadly and develop in perspective. That's why data collection is  one of the essential steps in branding, and even the most brilliant idea needs structured up-to-date information.

Here you need to collect information for your online store. Or build a campaign to promote products. It doesn't matter. The important thing is that you need  data anyway. And with the development of the niche market, there are more and more online stores and marketplaces, and therefore competitors. Where do they get data from for stores? For example, images or product descriptions? Of course, with e-commerce platforms, where the most popular one is WooCommerce, which has the largest share among online store platforms. We’ll use it as an example and collect data from it.

But first, we want to talk a little bit about the tool we will use to collect the data. There are a large number of instruments that allow you to extract data from websites, where one of them is Google Sheets.

Get your Business Back on Track

Boost the growth and productivity of your retail or manufacturing business with e-commerce data

  • Free Sample Data Sets
  • Regular Data Delivery
  • Legal and GDPR compliance
Get a Quote

Google Sheets in Web Scraping

In fact, Google sheets can be seen as a basic web scraper. Without knowledge of programming languages, you can gather any website information like stock prices, site analytics, etc.

There are three functions you can use for web scraping with Google Sheets: IMPORTFEEDIMPORTHTML, and IMPORTXML. These functions will extract data from a particular website based on what you provide as input.

  • IMPORTFEED → to gather data from the RSS feed
  • IMPORTHTML → to retrieve data from tables and lists
  • IMPORTXML → for scraping data from structured data types

The main task, i.e. getting data, in our case is performed by the function "IMPORTXML". IMPORTXML imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML, etc. feeds. This formula has the syntax:

=IMPORTXML(url,xpath_query)

Where:

  • =  the equal sign is how we start any function in Google Sheets.
  • 'url' is the URL of the web page from which we want to retrieve data.
  • 'xpath_query' is the identifier that tells the formula exactly what to scrape.

WooCommerce Store Web Scraping

So let's move back to e-commerce platform scraping. We take a demo store on WooCommerce - https://themes.woocommerce.com/.

We have to get the assortment of some categories and data for it. Now we work with one category, but all our actions can be done for the whole site if necessary.

We leave column A in the table for the input data for operating convenience. Specify the first category page and the following ones:

First, we get the XPath links to the product page. For this, we open DevTools in the browser and copy the XPath.

Result:

//*[@id="main"]/div[2]/ul/li[1]/a[1]

Then we change this value a little bit. First, replace the double quotes with single quotes, otherwise the formula will be wrong. Second, remove the predicate [1] for "li" tag, because we need not only the first product, but all of them. For the "a" tag predicate should not be removed, because the results will get extra links “add to cart”. Now we add the @href attribute to our path, since our goal is the product links.

We get the final XPath query:

//*[@id='main']/div[2]/ul/li/a[1]/@href

And the formula for B2 cell:

=IMPORTXML(A2;"//*[@id='main']/div[2]/ul/li/a[1]/@href")

The result as in the screenshot below.

We get the result for the first page. But what about the pagination pages? Copying the formula from B2 cell to B3 cell makes no sense. This will cause an error “Array result was not expanded because it would overwrite data in B3.”

 

The simplest way is to merge arrays with the resulting values in B2 cell. For this, in curly brackets we list formulas for the first, second, and third pages of the category, separated by semicolons.

Get your Business Back on Track

Boost the growth and productivity of your retail or manufacturing business with e-commerce data

  • Free Sample Data Sets
  • Regular Data Delivery
  • Legal and GDPR compliance
Get a Quote

We get:

={IMPORTXML(A2;"//*[@id='main']/div[2]/ul/li/a[1]/@href");
IMPORTXML(A3;"//*[@id='main']/div[2]/ul/li/a[1]/@href");
IMPORTXML(A4;"//*[@id='main']/div[2]/ul/li/a[1]/@href")}

 

So, we've got all product addresses from the needed category, now we have to get information about them.

We switch to any product page and use DevTools to copy the XPath of the desired item.

We get product names in column C. In our case there is only one H1 title tag on the page, so there is no point in complicating the formula and specifying the full path to the element.

<h1 itemprop="name" class="product_title entry-title">Comfortable gray bed</h1>

We write the formula in C2 row:

=IMPORTXML(B2;"//h1")

Fill in the remaining lines with the formula and get:

Now we have to get the product descriptions. They are specified in the div block with the “woocommerce-product-details__short-description” class:

<div class="woocommerce-product-details__short-description">
<p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac
turpis egestas. Vestibulum tortor quam, feugiat vitae, ultricies eget, tempor sit amet,
ante. Donec eu libero sit amet quam egestas semper. Aenean ultricies mi vitae est.
Mauris placerat eleifend leo.</p>
</div>

Copy XPath of the block and use it in formula, which we write in D2 cell:

=IMPORTXML(B2;"//div[@class='woocommerce-product-details__short-description']")

Fortunately, in our demo store the description consists of one paragraph. So it took up only one cell. But very often, the description consists of several paragraphs, and as a result of importing an array it will take up several cells. In this case, you need to merge the values in the received array by adding a separator between them. There are two functions you can use for that: CHAR and JOIN. The first one converts a numeric code into a character according to the current version of the Unicode table, and will be used to get the control character of a new line. The second one combines the values in the array using the specified delimiter.

As a result, you' ll get a formula like this:

=JOIN(CHAR(10);
IMPORTXML(B2;"//div[@class='woocommerce-product-details__short-description']"))

Now get the prices of goods:

<p class="price">
<del>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">£</span>300.00 </span>
</del>
<ins>
<span class="woocommerce-Price-amount amount">
<span class="woocommerce-Price-currencySymbol">£</span>250.00 </span>
</ins>
</p>

For now, let's start from the paragraph with the "price" class. We use the formula in E2 cell:

=IMPORTXML(B2;"//p[@class='price']"

Copy the formula into the rows below. Result:

As shown, we got two prices for several products - a standard price and a promo price. But what if we need only one? Looking at the paragraph code with prices we see that the old price is in the <del> tag and the promotional price is in the <ins> tag. Let's refine our XPath and get:

=IMPORTXML(B2;"//div[@class='summary entry-summary']/p[@class='price']/
ins/span[@class='woocommerce-Price-amount amount']/text()")

Note that we use the text() function at the end of the XPath query, since we only need the price, without the currency symbol, which is in a separate <span> tag on the page.

If we fill in all rows with this formula, we will get an error "Imported content is empty" instead of the price for those products that have no discount. It is because these pages do not have <ins> tag. Let's make an XPath request for these pages:

//div[@class='summary entry-summary']/p[@class='price']/
span[@class='woocommerce-Price-amount amount']/text()"

Now we specify two queries in a formula, using the node merge operator "|". We get this formula:

=IMPORTXML(B2;"//div[@class='summary entry-summary']/p[@class='price']/
ins/span[@class='woocommerce-Price-amount amount']/text()|
//div[@class='summary entry-summary']/p[@class='price']/
span[@class='woocommerce-Price-amount amount']/text()")

Result:

Perhaps the last thing that can be scraped from the product page are the product image links. Let's collect them too. At first we study the document structure to understand where the product image address is at its highest resolution. In our case these links are in the Open Graph meta tags and directly in the image output on the page. We don't recommend getting the address from the Open Graph, because the product may have several images and the meta tags will only contain the first image.  Therefore, we get images from the <img> tag, namely, from the attribute "data-large_image", because exactly there the address to the original image is specified, from which images with a smaller resolution are generated.

<code>
<img width="600" height="600"
src="https://themes.woocommerce.com/homestore/wp-content/uploads/sites/88/2015/11/bedroom-4-e1448555746616.jpg?w=600&h=600&crop=1"
class="attachment-shop_single size-shop_single wp-post-image" alt="" loading="lazy" title="Camera da letto con parquet" data-caption=""
data-src="https://themes.woocommerce.com/homestore/wp-content/uploads/sites/88/2015/11/bedroom-4-e1448555746616.jpg"
data-large_image="https://themes.woocommerce.com/homestore/wp-content/uploads/sites/88/2015/11/bedroom-4-e1448555746616.jpg"
data-large_image_width="980" data-large_image_height="1043"
srcset="https://themes.woocommerce.com/homestore/wp-content/uploads/sites/88/2015/11/bedroom-4-e1448555746616.jpg?resize=150,150 150w,
https://themes.woocommerce.com/homestore/wp-content/uploads/sites/88/2015/11/bedroom-4-e1448555746616.jpg?resize=600,600 600w"
sizes="(max-width: 600px) 100vw, 600px" />
</code>

We get the formula:

=IMPORTXML(B2;"//img/@data-large_image")

Only one image is added for each product in this example. But if there were more than one, we would use the JOIN function again to display all the image addresses in the same cell.

Result:

View the final version of the table here.

To Sum Up

Now, information and data are key strengths for high performance, and the ability to easily and quickly convert them into useful material can be invaluable.

This method is quite simple, reduces the likelihood of errors and is absolutely free. However, we’d like to point out the fact that not all sites can be scraped with IMPORTXML. For example, sites with dynamic rendering will not work. In addition, with a large number of lines, updating data can take a very long time.

So if you need to get data on many products - don't be afraid to contact us and ask all your questions. We use an individual approach to each customer and can not just collect data, but also solve the actual problem. For example, we can implement tracking and monitoring of competitor prices or dynamic changes in the prices of products in your online store or on the marketplace. Or, by combining the product data with competitive analytics services to determine which products are most popular and how much profit they can bring to the seller.

We will always be glad to advise you and solve your problem.

Talk to us to find out how we can help you

Let us take your work with data to the next level and outrank your competitors.

How does it Work?

1. Make a request

You tell us which website(s) to scrape, what data to capture, how often to repeat etc.

2. Analysis

An expert analyzes the specs and proposes a lowest cost solution that fits your budget.

3. Work in progress

We configure, deploy and maintain jobs in our cloud to extract data with highest quality. Then we sample the data and send it to you for review.

4. You check the sample

If you are satisfied with the quality of the dataset sample, we finish the data collection and send you the final result.

Get in Touch with Us

Tell us more about you and your project information.
scrapiet

Scrapeit Sp. z o.o.
80/U1 Młynowa str., 15-404, Bialystok, Poland
NIP: 5423457175
REGON: 523384582