Show HN: I scraped 200M Shopify products to build a search engine https://ift.tt/r30G8IC

Show HN: I scraped 200M Shopify products to build a search engine Hi HN! In December I launched an MVP for Agora here: https://ift.tt/AcIiDs9 After posting, we got thousands of users and hundreds of comments with valuable feedback from the community. I spent a couple sleepless nights frantically pacing around my room trying to keep the product live and, relatively, performant. After getting some sleep, I got back to work to make the product better. A few updates: 1. We've grown from 25 million to 200 million products on Shopify and WooCommerce. The team at WooCommerce reached out after the HN launch to help us figure out how to index their stores. Similar to Shopify, we found that there’s a public file available for all stores that use Wordpress and WooCommerce at [Base URL]/wp-json/wc/v1/products. For example, the file for Good Works Tractors is available here: https://ift.tt/Lv4wGD3... So I bought a list of 3.5 million active WooCommerce stores on a website called BuiltWith, adapted the product data model, and started the crawler to go down the list. We've indexed around 515k stores so far. 2. We improved the search experience. We're using Mongo to host the 200 million product records. First, we switched from Mongo Atlas Search to Typesense. After testing Typesense with our product records, we found most searches to be under 200ms. We're not storing the product images which slows down the loading speed at times. This week, we set up a server using Paperspace to run SBERT embeddings on a GPU (new to the AI workflow so apologies if I get the lingo wrong). We quickly realized that the dimension size of the embeddings matters a lot here, given the size of the data set. The GPU is still running to process all 200 million records and we're about a week away from releasing AI-powered search. 3. We localized the user experience. There's now frontend and backend IP detection to only show users products that are 'based in' or 'ship to' their specific country. This 'ships to' filter (i.e. stored in all Shopify stores in the /meta.json route like https://ift.tt/AOWjUcX ) significantly slows down the search results but we're trying to get creative on the loading process and animation. For example, we're using Revalidating on Next.JS to give several pages a 'hard coded' feel and the data refreshes every 60 seconds. https://ift.tt/txeq5HD... 4. We got our first few paying customers. Store owners can sign up for free to track their store's performance on Agora. We validate that they are the store owner by making sure the email address and store URL match on sign up, and then send them an email verification link. They can upgrade to a subscription tier to 'verify' their products to get better placement in relevant search results. Additionally, they can pay to 'boost' products and guarantee that they'll show up in the first row of results. Given the high purchase-intent searches on Agora, I'm finding this to be the right business model. The next challenge to solve: We need to improve the quality of products on Agora. There's a lot of resellers, dropshipping stores, and low quality images. Now, just because a product is sold on a reseller or dropshipping website, doesn't mean it's a bad product. There's a lot of exceptions and edge cases to solve. One potential solution: we're considering coming up with an "Agora Score" that takes in several factors including the image quality, store name, brand name, website SEO, etc. to tell users how trustworthy we think the product is. I'd love any feedback or advice. I did solve my original problem of finding 'red shoes' for my wife, but inadvertently created more problems for myself. I'm loving every minute of it though. My wife jokes that everything is now "Agora this...Agora that". Open to any advice on that as well. https://ift.tt/cBCNwnT February 22, 2024 at 04:04AM

Comments