Gaffa @ Major League Hacking's Global Hack Week

DEV Community

We also touched on the legal question that arises whenever scraping is discussed. Scraping publicly accessible data is generally accepted and widely used across industries, from price comparison to financial data feeds to AI training sets. The areas to avoid are personal data, content behind a login, and anything that puts undue load on a site, particularly smaller, nonprofit ones.

▶️ Introducing Gaffa and the API playground

James explains why to use Gaffa over building your own scraper

The session then moved into a walkthrough of Gaffa itself. Gaffa is a web browser automation API. You send a POST request with a URL and a list of actions, and Gaffa executes them in a real, hosted browser and returns the result. No infrastructure to manage, no proxies to configure, no bot detection to fight.

The API Playground is the best place to get started. It lets you build and test browser requests interactively, with built-in examples covering common scenarios. During the session, I walked through a live form-filling example, including enabling request recording so you can see exactly what the browser did.

▶️ Demo: Scraping a webpage and asking questions with AI

James demos scraping a Wikipedia page with AI

The first full demo showed how to scrape a Wikipedia article and use it as context for an OpenAI Q&A session. The workflow is straightforward: use Gaffa's generate_markdown action to strip a page down to clean, LLM-ready text, then pass that markdown to the model with a question.

The key insight here is that markdown is a much more efficient way to feed web content into a language model than raw HTML. It removes noise while preserving the page's structure and meaning. The demo showed the model correctly answering questions about the article content and, importantly, telling us when an answer wasn't present, a behavior we prompted for explicitly.

The full example is available in the Gaffa Python Examples GitHub repository.

▶️ Demo: Extracting structured data with parse_json

James demonstrates structured data extraction with parse_json

The second demo is where things get particularly powerful. Rather than asking free-form questions, parse_json lets you define a data schema and have Gaffa use an AI model to extract exactly the fields you need from any page, regardless of its structure.

In the session, I used the Python Wikipedia page as an example, extracting the title, creator, release year, summary, and key features. The schema is defined as a JSON object with named fields, types, and per-field descriptions that act as mini-prompts to guide the model.

One practical detail that came up with a real client: you can use field descriptions to enforce a specific output format, for example, specifying that a country field should return a two-letter ISO Alpha-2 code rather than whatever format appears on the page. The model handles the mapping automatically.

The same action also works on online PDFs. I demonstrated this against a hosted academic paper, extracting the title, abstract, author names, and institutional affiliations, the kind of data that varies in layout across every paper you'd encounter, making it almost impossible to extract reliably with traditional selectors. The result was a clean JSON object ready to insert directly into a database.

Both examples are available in the Gaffa Python Examples GitHub repository.

▶️ The MLH challenges

James outlines the Gaffa MLH challenges

As part of Global Hack Week, we put together a set of Gaffa challenges for attendees:

Sign up for a Gaffa account and redeem the MLH credit code for 20ドル of free credits
Send your first request in the API Playground
Use a browser request to subscribe to our newsletter via the Gaffa demo site
Extract the title, summary, and author from a Gaffa blog post using parse_json

If you're working through these and run into any issues, reach out via support, and we'll help you get unstuck.

Had a great experience with Gaffa! It was my first time doing browser automation, and sending that first API request to print an HTML page to PDF felt like magic. The step-by-step challenges made a complex topic really approachable.
— A Global Hack Week participant

A huge thank you to the MLH team, particularly Rosendo, for hosting the opportunity to present to their community. It was a genuinely great audience, full of thoughtful questions about scraping legality, dynamic sites, speed, and cost. If you were in the session or are just now finding this post, thanks for watching and reading.

If you want to try everything covered in the session, sign up for a free Gaffa account and head to the API Playground to make your first request. The demo site, Python examples, and documentation are all there waiting for you.

Top comments (1)

theycallmeswift profile image

Swift

I'm the CEO & Co-Founder of Major League Hacking (MLH) where I'm helping the next generation of developers launch their careers.

Location

New York, NY, USA
Education

Rutgers, the State University of New Jersey
Pronouns

He, Him, His
Work

CEO & Co-Founder @ Major League Hacking (MLH)
Joined

Dec 26, 2017

• May 1

Thanks for helping to make Global Hack Week awesome and for posting about your experience on DEV! 🙌