We also touched on the legal question that arises whenever scraping is discussed. Scraping publicly accessible data is generally accepted and widely used across industries, from price comparison to financial data feeds to AI training sets. The areas to avoid are personal data, content behind a login, and anything that puts undue load on a site, particularly smaller, nonprofit ones.
James explains why to use Gaffa over building your own scraper
The session then moved into a walkthrough of Gaffa itself. Gaffa is a web browser automation API. You send a POST request with a URL and a list of actions, and Gaffa executes them in a real, hosted browser and returns the result. No infrastructure to manage, no proxies to configure, no bot detection to fight.
The API Playground is the best place to get started. It lets you build and test browser requests interactively, with built-in examples covering common scenarios. During the session, I walked through a live form-filling example, including enabling request recording so you can see exactly what the browser did.
James demos scraping a Wikipedia page with AI
The first full demo showed how to scrape a Wikipedia article and use it as context for an OpenAI Q&A session. The workflow is straightforward: use Gaffa's generate_markdown action to strip a page down to clean, LLM-ready text, then pass that markdown to the model with a question.
The key insight here is that markdown is a much more efficient way to feed web content into a language model than raw HTML. It removes noise while preserving the page's structure and meaning. The demo showed the model correctly answering questions about the article content and, importantly, telling us when an answer wasn't present, a behavior we prompted for explicitly.
The full example is available in the Gaffa Python Examples GitHub repository.
James demonstrates structured data extraction with parse_json
The second demo is where things get particularly powerful. Rather than asking free-form questions, parse_json lets you define a data schema and have Gaffa use an AI model to extract exactly the fields you need from any page, regardless of its structure.
In the session, I used the Python Wikipedia page as an example, extracting the title, creator, release year, summary, and key features. The schema is defined as a JSON object with named fields, types, and per-field descriptions that act as mini-prompts to guide the model.
One practical detail that came up with a real client: you can use field descriptions to enforce a specific output format, for example, specifying that a country field should return a two-letter ISO Alpha-2 code rather than whatever format appears on the page. The model handles the mapping automatically.
The same action also works on online PDFs. I demonstrated this against a hosted academic paper, extracting the title, abstract, author names, and institutional affiliations, the kind of data that varies in layout across every paper you'd encounter, making it almost impossible to extract reliably with traditional selectors. The result was a clean JSON object ready to insert directly into a database.
Both examples are available in the Gaffa Python Examples GitHub repository.
James outlines the Gaffa MLH challenges
As part of Global Hack Week, we put together a set of Gaffa challenges for attendees:
- Sign up for a Gaffa account and redeem the MLH credit code for 20ドル of free credits
- Send your first request in the API Playground
- Use a browser request to subscribe to our newsletter via the Gaffa demo site
- Extract the title, summary, and author from a Gaffa blog post using parse_json
If you're working through these and run into any issues, reach out via support, and we'll help you get unstuck.
Had a great experience with Gaffa! It was my first time doing browser automation, and sending that first API request to print an HTML page to PDF felt like magic. The step-by-step challenges made a complex topic really approachable.
— A Global Hack Week participant
A huge thank you to the MLH team, particularly Rosendo, for hosting the opportunity to present to their community. It was a genuinely great audience, full of thoughtful questions about scraping legality, dynamic sites, speed, and cost. If you were in the session or are just now finding this post, thanks for watching and reading.
If you want to try everything covered in the session, sign up for a free Gaffa account and head to the API Playground to make your first request. The demo site, Python examples, and documentation are all there waiting for you.