Using AI and LLMs to scrape sites

Although web scraping has long been a popular technique for gathering data from websites, it has never been without its share of difficulties, including navigating intricate architecture and dealing with site modifications. With different degrees of success, programmers have been automating this process for years using technologies like Puppeteer. Advanced language models, like GPT-4, have, however, brought about a new method of addressing web scraping that not only overcomes these problems but also allows us to be more imaginative with our output.

Traditional Web Scraping

Puppeteer, for example, is a popular Node.js library that allows you to control a headless Chrome or Chromium browser programmatically. You can easily click buttons, complete forms, and navigate between pages by using this tool. Puppeteer has proven to be a strong web scraping tool, although it has some drawbacks:

  • The components that need to be scraped must be manually identified.
  • It depends on a stable website structure.
  • Updating and maintaining it may take a lot of time.

Enter LLMs and AI: A New Era of Web Scraping

We can get around the constraints of conventional approaches and create a more adaptable and effective process by integrating cutting-edge AI technologies like GPT-4 into web scraping. The following are the main benefits of use LLMs for web scraping:

Enhanced creativity: LLMs can be used to process the result creatively, in contrast to standard approaches that concentrate on extracting particular aspects. You could, for instance, use scraped data to create a totally new piece of content or extract the article's major ideas and summarise it.

Robustness against site redesigns: LLMs can be trained to comprehend the underlying semantics and context of the material rather than relying on a particular site structure or element hierarchy. This implies that the AI can efficiently extract the desired information even if a site undergoes a big makeover without the requirement for manual updates.

Dynamic element identification: Using context and meaning rather than an element's precise location or structure, LLMs can be trained to recognise pertinent items on a web page. As a result, less manual element selection is required, and the AI is better able to adjust to changes in site design.

AI web scraper prototype

Given the above, I coded up a prototype that utilizes AI to help scrape any URL and return the results in JSON format. Here's what it looks like:

AI web scraper demo

You can get pretty creative with it and ask it for some transformations (which LLMs are pretty good at). Here, I'm asking it to group actors of a TV show into categories, then sort them alphabetically.

AI web scraper demo - transformations