Ensuring Your Extracted Web Data Is Extensive Enough

What do you need to make sure your extracted web data is extensive enough for well-made statistical analysis or 100% accurate product match? This post gives you guidelines on how to create extractions that satisfy the requirements of a full, complete data set. This allows you to take full advantage of Sniffie’s ecommerce analytics platform.

What happens if you extract too little product data from the web?

You might be fooled into thinking that it’s always enough to extract product name and price. However, this is not the case more often than not. Let’s consider the following example:

Website A sells a product in the electronics category called Generic HDMI Cable for 4.99$. Let’s call this Product A.
Website B also happens to sell a product in their electronics category called Generic HDMI Cable, but their price is 24.99$. Let’s call this Product B.

Is Website B just way overpriced? Well they might be, but most likely their product is completely different from the product that Website A sells. Supposing the product is different and you only extract product name and price, Sniffie will make a suggestion for a match based on the available information. That is, the match by Sniffie AI is made only using the information of product name. Because no other information is available, Sniffie will suggest that Product A matches to Product B.

When you manually override / confirm matches, you also lack sufficient information to actually make sure that the match you make is actually correct. Supposing you still confirm the match, we end up with the following match.

How to ensure that web data extractions have enough data?

In many cases you really should try to extract all available data from any product. This includes product code, product image, description, technical details, manufacturer, brand, variant etc. The more data you extract from the web on any given product, the more complete your data set becomes. This allows both Sniffie and you make better product matches.

Let’s continue with our example. Product A and B have the same manufacturer, Generic Cable Manufacturer. They have the same generic product picture. They are the same length, 1m both. However, the material in Product B has conductive material that is gold, where as in Product A it’s copper. They also have different product codes. When you extract all this information, both you and Sniffie can be sure that Product A is not actually and exact match for Product B. They can be substitutes to each other, however. If you confirm the substitute match, you get the following match.

Clearly, both you and Sniffie can differentiate between the two products now. Using a substitute match allows you to exclude or include it from your analysis using filters. If you don’t want to set them as substitutes, no match between Product A and Product B will be made. Thus, they are analyzed as products of their own.

Use data enrichment for making sure you get all necessary data

Starting in August 2017, Sniffie provides you a simple way to enrich your extracted data with more data. That process is called data enrichment. It allows you to easily combine more extracted web data to already extracted web data. When using data enrichment, your active URL quota is not used, which makes it very easy to scrape vast amounts of extra information from any web store.

Did this answer your question?