diff --git a/AGENTS.md b/AGENTS.md index 54bde2a05..9bdb876b0 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -143,8 +143,9 @@ src-layout reorg): - `src/browser_harness/*.py` (`daemon.py`, `admin.py`, `helpers.py`, `run.py`, `_ipc.py`) — protected. Pull verbatim. If behavior change is needed, upstream a PR to `browser-use/browser-harness`. -- `interaction-skills/`, `agent-workspace/domain-skills/` — verbatim. - Never edit. +- `interaction-skills/` — verbatim. Never edit. +- `(agent-workspace/)?domain-skills/` — **excluded** from vendored tree. + Sync agents skip these paths; see UPSTREAM.md §3 "Excluded paths". Sync workflow lives in `harness-sync.md`. diff --git a/UPSTREAM.md b/UPSTREAM.md index 8f418a3d5..940ade3cb 100644 --- a/UPSTREAM.md +++ b/UPSTREAM.md @@ -91,21 +91,38 @@ Each upstream has its own append-only table. Add a row every time you pull. --- -## 3. Harness divergences +## 3. Harness divergences and excluded paths -Per-file record of where `packages/bcode-browser/harness/` deliberately differs from upstream. Read this *before* a sync diff so intentional differences aren't mistaken for missing features. +Per-file record of where `packages/bcode-browser/harness/` deliberately differs from upstream, plus the list of paths excluded from the vendored tree entirely. Read this *before* a sync diff so intentional differences aren't mistaken for missing features and excluded paths aren't accidentally re-imported. Path-allowlist policy (decisions.md §3.7, §4.5; updated for upstream PR #229 src-layout reorg): - `agent-workspace/agent_helpers.py` — editable; primary BrowserCode extension surface. Divergences expected. - `src/browser_harness/*.py` (`daemon.py`, `admin.py`, `helpers.py`, `run.py`, `_ipc.py`) — protected. Pulled verbatim from upstream. If behavior change is needed, upstream a PR to `browser-use/browser-harness`. -- `interaction-skills/`, `agent-workspace/domain-skills/` — verbatim from upstream. We never edit these. +- `interaction-skills/` — verbatim from upstream. We never edit these. +- `(agent-workspace/)?domain-skills/` — **excluded.** See "Excluded paths" below. - Other files (`pyproject.toml`, `LICENSE`, `README.md`, etc.) — divergence allowed but discouraged. +### Excluded paths + +Upstream paths the vendored tree treats as if they don't exist. Sync agents skip them; the diff checker filters them out. The runtime guard in `helpers.py` (`if d.is_dir():` in `goto_url`) means absence is a clean no-op. + +| Pattern | Reason | +|---|---| +| `(agent-workspace/)?domain-skills/**` | User-contributed site recipes. Quality, maintenance, and prompt-injection concerns. Browsercode (cloud-first, performance-focused) curates its own skills server-side; OSS users get the harness without bundled recipes. Both upstream paths covered: post-PR-#229 `agent-workspace/domain-skills/` and the legacy/PR-#247 top-level `domain-skills/`. The exclusion is enforced in three places that all reference this row: `script/check-harness-diff.sh` (`IGNORED_PATHS_REGEX`), `harness-sync.md` step 5 ("Excluded paths" row), and the absence of these directories from the vendored tree. | + +### Modified files + | File | Section | Direction | Reason | |---|---|---|---| | `.gitignore` | venv entry | added `.venv/` | smoke-test workflow creates `.venv/` in the harness dir; we ignore it. Upstream uses CWD-level venv so doesn't need this. | +The vendored harness's `SKILL.md`, `README.md`, and `install.md` reference `agent-workspace/domain-skills/`, but we keep them verbatim from upstream. Rationale: + +- `README.md` and `install.md` are not referenced by any browsercode prompt or TS code — the agent never reads them. Their content is dead weight in the extracted cache, not agent-visible. +- `SKILL.md` is referenced by `packages/opencode/src/tool/browser-execute.txt` today, but the long-term plan (see ROADMAP) is to replace that pointer with a browsercode-owned prompt file, making vendored `SKILL.md` inert too. +- Trimming these files would generate per-sync drift forever for zero agent-behavior benefit. Keeping them verbatim costs nothing and keeps future syncs mechanical. + --- ## Drift checker diff --git a/harness-sync.md b/harness-sync.md index cdf0cba52..f9073fe65 100644 --- a/harness-sync.md +++ b/harness-sync.md @@ -28,7 +28,7 @@ git pull origin main Two things to read before touching anything: - **`UPSTREAM.md`** — the latest `To SHA` row under `### browser-use/browser-harness`. That is the last commit we synced to. It is the only source of truth for "what version is vendored." -- **`UPSTREAM.md` §3 Harness divergences** — the table of files where we deliberately differ from upstream, with reasons. Read this *before* the diff so you know which differences are intentional and not "missing features." +- **`UPSTREAM.md` §3 Harness divergences and excluded paths** — the table of files where we deliberately differ from upstream, plus the list of paths excluded from the vendored tree entirely. Read both *before* the diff so you know which differences are intentional and not "missing features," and which paths to skip outright. If the divergences table is empty (initial vendor state), every difference between us and upstream is unintentional drift; flag any in the PR. @@ -65,14 +65,16 @@ This is where the agent earns its keep. For each file changed in ` | File category | Action | |---|---| -| Files not in our divergences table (incl. `src/browser_harness/*.py`, `agent-workspace/domain-skills/`, `interaction-skills/`, `tests/`, `pyproject.toml`, `LICENSE`, etc.) | Take upstream verbatim — `cp temp/browser-harness/ packages/bcode-browser/harness/`. | +| **Excluded paths** (`(agent-workspace/)?domain-skills/...`) | **Skip entirely.** Never copy in, never resurrect. See UPSTREAM.md §3 "Excluded paths". `script/check-harness-diff.sh` filters these out automatically. | +| Files not in our divergences table (incl. `src/browser_harness/*.py`, `interaction-skills/`, `tests/`, `pyproject.toml`, `LICENSE`, etc.) | Take upstream verbatim — `cp temp/browser-harness/ packages/bcode-browser/harness/`. | | Files in our divergences table | Read each upstream hunk. For each, decide: **take** (apply upstream change to our file), **skip** (our divergence wins, ignore upstream change), or **adapt** (rewrite our divergence to coexist with the upstream change). Update the divergences row if its reason or scope shifts. | -| New upstream files | Copy in. | +| New upstream files | Copy in (unless under an excluded path). | | Files we have but upstream removed | Decide: keep ours (record in divergences) or delete. | Path-allowlist policy stays in force during sync resolution as well as normal development: - `agent-workspace/agent_helpers.py` — editable, agent's primary extension surface (post PR #229). - `src/browser_harness/*.py` (`daemon.py`, `admin.py`, `helpers.py`, `run.py`, `_ipc.py`) — protected. Always take upstream verbatim. If upstream regresses, file an issue at `browser-use/browser-harness` and pin to the prior SHA, do not patch locally. +- `(agent-workspace/)?domain-skills/` — **excluded.** Treat as if not in the upstream tree. Quality + prompt-injection concerns; user-contributed site recipes do not ship with browsercode. The runtime guard in `helpers.py` (`if d.is_dir():`) means this is a clean no-op. ### 6. Smoke test diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/amazon/product-search.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/amazon/product-search.md deleted file mode 100644 index 3deb07186..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/amazon/product-search.md +++ /dev/null @@ -1,198 +0,0 @@ -# Amazon — Product Search & Data Extraction - -Field-tested against amazon.com on 2025-04-18 using a logged-in Chrome session. -No CAPTCHA or bot detection was triggered during any test run. - -## Navigation - -### Direct search URL (fastest, always use this) -```python -goto_url("https://www.amazon.com/s?k=mechanical+keyboard") -wait_for_load() -wait(2) # dynamic content needs ~2s after readyState=complete -``` - -### Search box typing (use when you need category filtering) -```python -goto_url("https://www.amazon.com") -wait_for_load() -wait(1) -js("document.querySelector('#twotabsearchtextbox').focus()") -js("document.querySelector('#twotabsearchtextbox').click()") -wait(0.3) -type_text("wireless mouse") -wait(0.3) -press_key("Enter") -wait_for_load() -wait(2) -``` - -### Direct product page -```python -# URL pattern: /dp/{ASIN} or /dp/{ASIN}?th=1 (Amazon may redirect to add ?th=1) -goto_url("https://www.amazon.com/dp/B08Z6X4NK3") -wait_for_load() -wait(2) -``` - -## Session Gotcha - -**Always use `new_tab()` when opening Amazon for the first time in a harness session.** -`goto_url()` can silently fail to navigate if the current tab resists the navigation -(observed when the daemon attached to a different real tab). The safe pattern: - -```python -tid = new_tab("https://www.amazon.com/s?k=mechanical+keyboard") -wait_for_load() -wait(2) -``` - -After that, `goto_url()` works fine within the same Amazon session. - -## Search Results Extraction - -### Container selector -`[data-component-type="s-search-result"]` — confirmed working, yields ~22 results per page. - -### Full extraction (field-tested) -```python -results = js(""" - Array.from(document.querySelectorAll('[data-component-type="s-search-result"]')).map(el => ({ - asin: el.getAttribute('data-asin'), - title: el.querySelector('h2 span')?.innerText?.trim(), - price: el.querySelector('.a-price .a-offscreen')?.innerText, - list_price: el.querySelector('.a-text-price .a-offscreen')?.innerText, - rating: el.querySelector('[aria-label*="out of 5 stars"]')?.getAttribute('aria-label')?.split(' ')[0], - reviews: el.querySelector('[aria-label*="ratings"]')?.getAttribute('aria-label'), - is_sponsored: !!el.querySelector('.puis-sponsored-label-text'), - url: el.querySelector('h2 a')?.href - })) -""") -``` - -### Field notes -- **`asin`**: `data-asin` attribute on the container div — always present, matches the `/dp/{ASIN}` URL. -- **`title`**: `h2 span` works consistently. `h2 a.a-link-normal span` also works. -- **`price`**: `.a-price .a-offscreen` returns the formatted string e.g. `"$69.99"`. Use this, not `.a-price-whole`. -- **`list_price`**: `.a-text-price .a-offscreen` — only present when item is on sale (was/now pricing). -- **`rating`**: Use `aria-label` on `[aria-label*="out of 5 stars"]` — gives `"4.5 out of 5 stars, rating details"`, split on space for the number. -- **`reviews`**: Use `[aria-label*="ratings"]` attribute — gives `"1,514 ratings"`. Do NOT use `.a-size-base.s-underline-text` — that element exists on sponsored results and shows "Xbox" (a cross-sell widget text). -- **`is_sponsored`**: `.puis-sponsored-label-text` is present on sponsored listings; first 2-3 results are usually sponsored. -- **`url`**: `h2 a` href — contains the full `/dp/{ASIN}/...` URL. - -## Product Detail Page Extraction - -### Confirmed selectors (field-tested on B08Z6X4NK3) -```python -detail = js(""" - ({ - title: document.querySelector('#productTitle')?.innerText?.trim(), - price: (function() { - var whole = document.querySelector('.a-price-whole')?.innerText?.replace(/[\\n.]/g,''); - var frac = document.querySelector('.a-price-fraction')?.innerText; - return (whole && frac) ? '$' + whole + '.' + frac - : document.querySelector('.a-price .a-offscreen')?.innerText || null; - })(), - list_price: document.querySelector('.basisPrice .a-offscreen')?.innerText, - rating: document.querySelector('#acrPopover')?.getAttribute('title'), - review_count: document.querySelector('#acrCustomerReviewText')?.innerText, - availability: document.querySelector('#availability span')?.innerText?.trim(), - brand: document.querySelector('#bylineInfo')?.innerText?.trim(), - asin: document.querySelector('input[name="ASIN"]')?.value, - bullet_points: Array.from(document.querySelectorAll('#feature-bullets li span.a-list-item')) - .map(e => e.innerText?.trim()).filter(t => t) - }) -""") -``` - -### Price field notes -- `#priceblock_ourprice` and `#priceblock_dealprice` are **legacy** — they return `null` on modern product pages. -- Construct price from `.a-price-whole` + `.a-price-fraction` (both stripped of `\n` and `.`). -- As a fallback: first `.a-price .a-offscreen` on the page also works (confirmed `$69.99`). -- `list_price` from `.basisPrice .a-offscreen` shows the crossed-out "was" price when a discount exists. - -## Best Sellers Page - -URL: `https://www.amazon.com/Best-Sellers-{Category}/zgbs/{slug}/` -e.g. `https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/` - -### DOM structure (2025) -`.zg-item-immersion` **does not exist** — Amazon migrated to CSS modules. Use `[data-asin]` anchored on `[id="gridItemRoot"]`: - -```python -goto_url("https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics/") -wait_for_load() -wait(2) - -items = js(""" - Array.from(document.querySelectorAll('[data-asin]')).map(el => { - var container = el.closest('[id="gridItemRoot"]') || el; - return { - asin: el.getAttribute('data-asin'), - rank: container.querySelector('[class*="zg-bdg-text"]')?.innerText, - title: container.querySelector('img[alt]')?.getAttribute('alt'), - price: container.querySelector('.p13n-sc-price, .a-size-base.a-color-price')?.innerText, - url: 'https://www.amazon.com/dp/' + el.getAttribute('data-asin') - } - }).filter(r => r.rank) -""") -``` - -Note: Title comes from the product image `alt` attribute — the text title elements use obfuscated CSS module class names that change between deployments. - -## Pagination - -```python -# Get next page URL directly -next_url = js("document.querySelector('.s-pagination-next')?.href") -if next_url: - goto_url(next_url) - wait_for_load() - wait(2) - -# Or construct by page number -goto_url("https://www.amazon.com/s?k=wireless+mouse&page=2") -``` - -## Result Count - -```python -count_text = js("document.querySelector('[data-component-type=\"s-result-info-bar\"] h1')?.innerText?.trim()") -# Returns e.g.: '1-16 of over 40,000 results for "wireless mouse"\nSort by:\n...' -# Extract just the count: count_text.split('\n')[0] -``` - -## CAPTCHA Detection - -No CAPTCHA was encountered during testing with a logged-in Chrome session. To detect defensively: - -```python -def check_captcha(): - text = js("document.body.innerText.slice(0,500)") or "" - url = page_info()["url"] - return ( - "captcha" in text.lower() - or "enter the characters" in text.lower() - or "sorry, we just need to make sure" in text.lower() - or "captcha" in url.lower() - or "validateCaptcha" in url - ) - -if check_captcha(): - raise RuntimeError("Amazon CAPTCHA hit — stop and notify user") -``` - -Amazon may serve a CAPTCHA on fresh/anonymous sessions. Using the browser's existing logged-in session avoids this in practice. - -## Gotchas - -- **`goto_url()` silent failure**: On first visit, use `new_tab(url)` instead. After the tab is on Amazon, `goto_url()` works. -- **`.zg-item-immersion` is gone**: Best Sellers page uses CSS module classes (obfuscated). Use `[data-asin]` + `img[alt]` for title. -- **`.a-size-base.s-underline-text` is unreliable for review count**: On sponsored results it shows unrelated text (e.g. "Xbox"). Use `[aria-label*="ratings"]` instead. -- **`#priceblock_ourprice` is legacy**: Returns `null` on modern pages. Construct from `.a-price-whole` + `.a-price-fraction`. -- **Sponsored results appear first**: First 2-3 results are almost always `is_sponsored: true`. Filter them out with `!el.querySelector('.puis-sponsored-label-text')` when you need organic results. -- **`data-asin` can be empty string on non-product rows**: Filter with `.filter(r => r.asin)`. -- **Price split DOM**: `.a-price-whole` innerText includes a trailing `\n.` — strip it: `.replace(/[\n.]/g,'')`. -- **ASIN from URL**: Use `/dp/([A-Z0-9]{10})/` regex on the product URL. `data-asin` on search results is always the canonical ASIN. -- **`?th=1` redirect**: Amazon appends `?th=1` (and sometimes `?psc=1`) to product URLs after redirect. This is normal — `input[name="ASIN"]` always has the clean ASIN. -- **Wait 2s after `wait_for_load()`**: Amazon search results load the listing cards asynchronously. `readyState=complete` fires before cards render. A hard 2s wait is required. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/archive-org/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/archive-org/scraping.md deleted file mode 100644 index 692a00aae..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/archive-org/scraping.md +++ /dev/null @@ -1,341 +0,0 @@ -# Internet Archive / Wayback Machine — Scraping & Data Extraction - -`https://archive.org` / `https://web.archive.org` — all public data, no auth required. Every workflow here is pure `http_get` — no browser needed. - -## Do this first - -**Use the CDX API for anything Wayback-related — it is the reliable workhorse. The Wayback Availability API (`/wayback/available`) is known to return empty `archived_snapshots` even for well-archived URLs and should not be used as a primary mechanism.** - -```python -import json - -# Find snapshots of any URL — primary entry point for Wayback data -r = http_get( - "https://web.archive.org/cdx/search/cdx" - "?url=iana.org&output=json&limit=5" - "&fl=timestamp,original,statuscode,mimetype,length", - timeout=40.0 -) -rows = json.loads(r) -headers = rows[0] # ['timestamp', 'original', 'statuscode', 'mimetype', 'length'] -for row in rows[1:]: - ts, orig, status, mime, length = row - snap_url = f"https://web.archive.org/web/{ts}/{orig}" - print(f"{ts} {status} {snap_url}") -``` - -For item metadata (books, video, audio, software), go straight to: - -```python -data = json.loads(http_get("https://archive.org/metadata/{identifier}", timeout=30.0)) -``` - -## Common workflows - -### Find the nearest archived snapshot to a target date - -```python -import json - -# CDX sort=closest returns the single snapshot nearest to the given timestamp -r = http_get( - "https://web.archive.org/cdx/search/cdx" - "?url=iana.org&output=json&limit=1" - "&fl=timestamp,original,statuscode" - "&closest=20230601120000&sort=closest", - timeout=60.0 # CDX can be slow — always use timeout >= 40s -) -rows = json.loads(r) -# rows[0] = header, rows[1] = closest snapshot -ts, orig, status = rows[1] -snap_url = f"https://web.archive.org/web/{ts}/{orig}" -# Result: ts='20230601114925', orig='https://www.iana.org/', status='200' -# snap_url: https://web.archive.org/web/20230601114925/https://www.iana.org/ -``` - -Timestamp format is always 14-digit `YYYYMMDDHHMMSS`. Pass any prefix — `20230601` (day), `202306` (month), `2023` (year) — and CDX will match. - -### List all monthly snapshots for a URL (collapsed) - -```python -import json - -r = http_get( - "https://web.archive.org/cdx/search/cdx" - "?url=iana.org&output=json" - "&collapse=timestamp:6" # :6 = dedupe by YYYYMM (one per month) - "&from=20230101&to=20231231" - "&fl=timestamp,original", - timeout=60.0 -) -rows = json.loads(r) -# rows[0] = header ['timestamp', 'original'] -# rows[1:] = one row per month: -# ['20230101103807', 'https://www.iana.org/'] -# ['20230201144829', 'https://www.iana.org/'] -# ...12 rows for 2023 - -for ts, orig in rows[1:]: - print(f"{ts[:4]}-{ts[4:6]} https://web.archive.org/web/{ts}/{orig}") -``` - -`collapse=timestamp:N` deduplicates by the first N digits of the timestamp: -- `:4` = one per year, `:6` = one per month, `:8` = one per day - -### List snapshots for an entire domain (all pages) - -```python -import json - -# matchType=domain captures all URLs under that domain -r = http_get( - "https://web.archive.org/cdx/search/cdx" - "?url=iana.org&matchType=domain&output=json" - "&limit=10&fl=timestamp,original,statuscode" - "&collapse=timestamp:8", # one capture per URL per day - timeout=60.0 -) -rows = json.loads(r) -for row in rows[1:]: - print(row) -# ['19971210061738', 'http://www.iana.org:80/', '200'] -# ['19980211065537', 'http://www.iana.org:80/', '200'] -# ... -``` - -`matchType` options: `exact` (default), `prefix` (URL + subpaths), `host` (all subdomains), `domain` (host + all subdomains). - -### Filter snapshots by prefix path - -```python -import json - -# All archived pages under /domains/ path -r = http_get( - "https://web.archive.org/cdx/search/cdx" - "?url=iana.org/domains/&matchType=prefix&output=json" - "&limit=5&fl=timestamp,original,statuscode", - timeout=40.0 -) -rows = json.loads(r) -for row in rows[1:]: - print(row) -# ['20080509121811', 'http://www.iana.org/domains/', '200'] -# ['20080704174537', 'http://iana.org/domains/', '200'] -``` - -### Paginate CDX results with resumeKey - -```python -import json -from urllib.parse import quote - -def cdx_all_snapshots(url, fl="timestamp,original,statuscode", page_size=500): - """Iterate all CDX records for a URL, yielding rows (excluding header).""" - base = ( - f"https://web.archive.org/cdx/search/cdx" - f"?url={quote(url, safe='')}&output=json" - f"&fl={fl}&limit={page_size}&showResumeKey=true" - ) - resume_key = None - while True: - endpoint = base if resume_key is None else f"{base}&resumeKey={quote(resume_key)}" - rows = json.loads(http_get(endpoint, timeout=60.0)) - # rows structure with showResumeKey=true: - # [header, row1, row2, ..., [], [resume_key_string]] - # The second-to-last row is [] (separator), last row is [resume_key] - has_resume = len(rows) >= 2 and rows[-1] != [] and rows[-2] == [] - data_rows = rows[1:-2] if has_resume else rows[1:] - for row in data_rows: - yield row - if not has_resume: - break - resume_key = rows[-1][0] - -for row in cdx_all_snapshots("iana.org", fl="timestamp,original"): - ts, orig = row - # process... -``` - -### Retrieve the actual archived page - -```python -# Direct snapshot URL: /web/{14-digit-timestamp}/{original-url} -snap_url = "https://web.archive.org/web/19971210061738/http://www.iana.org:80/" -content = http_get(snap_url, timeout=30.0) -# Returns the archived HTML with Wayback toolbar injected at top -# The toolbar is inside comments - -# The calendar view URL pattern (for browser navigation, not http_get): -# https://web.archive.org/web/20230101000000*/python.org -# The * tells Wayback to show the calendar — returns HTML, not raw page -``` - -### Item metadata (books, video, audio, software, collections) - -```python -import json -from urllib.parse import quote - -identifier = "HardWonWisdomTrailer" -data = json.loads(http_get(f"https://archive.org/metadata/{identifier}", timeout=30.0)) - -# Top-level keys: -# alternate_locations, created, d1, d2, dir, files, files_count, -# is_collection, item_last_updated, item_size, metadata, server, uniq, workable_servers - -meta = data['metadata'] -# Common metadata fields (not all present on every item): -print(meta.get('identifier')) # 'HardWonWisdomTrailer' -print(meta.get('title')) # 'Hard Won Wisdom Trailer' -print(meta.get('mediatype')) # 'movies' | 'texts' | 'audio' | 'software' | 'collection' -print(meta.get('creator')) # 'jakemauz' -print(meta.get('date')) # '2017-02-18' -print(meta.get('description')) # HTML string — strip tags if needed -print(meta.get('subject')) # str OR list of str depending on item -print(meta.get('publicdate')) # '2017-02-18 11:51:16' -print(meta.get('collection')) # parent collection identifier - -files = data['files'] -# Each file entry: -# name, source ('original'|'derivative'|'metadata'), format, size (bytes as str), -# md5, sha1, crc32, mtime -# For video/audio: length (seconds as str), height, width -# For derivative: original (name of source file) - -# Find the primary original file -orig_files = [f for f in files if f.get('source') == 'original'] -# orig_files[0]: {'name': 'Hard-won wisdom trailer.mp4', 'source': 'original', -# 'format': 'MPEG4', 'size': '7532153', 'length': '94.13', -# 'height': '360', 'width': '640', 'md5': 'aaeebe0481...', ...} - -# Build download URL — two equivalent forms: -server = data['server'] # 'ia601405.us.archive.org' -dir_path = data['dir'] # '/2/items/HardWonWisdomTrailer' -fname = orig_files[0]['name'] -from urllib.parse import quote as urlquote -# Form 1: direct storage server (fastest) -url1 = f"https://{server}{dir_path}/{urlquote(fname)}" -# Form 2: standard redirect URL (always works, resolved by CDN) -url2 = f"https://archive.org/download/{identifier}/{urlquote(fname)}" -# Both confirmed status 200, Content-Type: video/mp4 -``` - -### Search items (books, audio, video, software) - -```python -import json - -# advancedsearch.php is the correct API — /search returns HTML -r = http_get( - "https://archive.org/advancedsearch.php" - "?q=artificial+intelligence+AND+mediatype:texts" - "&fl[]=identifier&fl[]=title&fl[]=creator&fl[]=date&fl[]=downloads" - "&rows=5&sort[]=downloads+desc&output=json", - timeout=30.0 -) -data = json.loads(r) -# data['responseHeader']['status'] = 0 (success) -# data['responseHeader']['QTime'] = query time ms -# data['response']['numFound'] = 25911 (total matches) -# data['response']['start'] = 0 (offset) -# data['response']['docs'] = list of item dicts - -resp = data['response'] -print(f"Total: {resp['numFound']}, showing: {len(resp['docs'])}") -for doc in resp['docs']: - print(f" {doc['identifier']} {doc.get('title', '')[:50]}") - # doc fields are only present if they have values — always use .get() -``` - -Pagination: use `start=` offset (not `page=`). Max `rows=` is not documented but 100 works reliably. - -### Search with all supported parameters - -```python -import json - -r = http_get( - "https://archive.org/advancedsearch.php" - "?q=machine+learning+AND+mediatype:texts" # Lucene query syntax - "&fl[]=identifier&fl[]=title&fl[]=date&fl[]=year" - "&fl[]=creator&fl[]=subject&fl[]=description&fl[]=downloads" - "&rows=3" - "&start=0" # pagination offset - "&sort[]=date+desc" # sort field + direction - "&output=json", - timeout=30.0 -) -data = json.loads(r) -# Confirmed fields in fl[]: -# identifier, title, date, year, creator, subject, description, -# downloads, mediatype, collection, language, avg_rating, num_reviews - -# mediatype values: texts, audio, movies, software, image, etree, data, collection, account -# Sort fields: date, downloads, avg_rating, num_reviews, publicdate, addeddate -``` - -## API reference - -| Endpoint | What it returns | Auth | -|---|---|---| -| `web.archive.org/cdx/search/cdx?url=...&output=json` | Snapshot index: all captures of a URL | None | -| `archive.org/wayback/available?url=...` | Nearest snapshot (DEGRADED — see gotchas) | None | -| `archive.org/metadata/{identifier}` | Item metadata + files list | None | -| `archive.org/advancedsearch.php?q=...&output=json` | Full-text + metadata search | None | -| `archive.org/download/{identifier}/{filename}` | Direct file download | None | -| `web.archive.org/web/{timestamp}/{url}` | Archived page HTML | None | - -## CDX field reference - -The CDX API returns a JSON array of arrays. The first row is always the header when `output=json`. - -| Field | Description | Example | -|---|---|---| -| `urlkey` | SURT-format URL (reversed domain, path in parens) | `org,iana)/` | -| `timestamp` | Capture time, 14-digit `YYYYMMDDHHMMSS` | `19971210061738` | -| `original` | Original crawled URL (exact, including port) | `http://www.iana.org:80/` | -| `mimetype` | Content-Type of the archived response | `text/html` | -| `statuscode` | HTTP status at crawl time | `200` | -| `digest` | SHA-1 of response body, base32-encoded | `I4YBMQ6PHPWE2TD6TIXNWHZB6MXRNTSR` | -| `length` | Content length in bytes (as string) | `1418` | - -Default `fl=` when omitted: `urlkey,timestamp,original,mimetype,statuscode,digest,length` (all 7 fields in that order). - -## Rate limits - -No auth, no API key. In practice: -- CDX API: **intermittently slow** — individual queries time out at 20s and succeed at 40–60s. Always use `timeout=40.0` minimum. 3 rapid sequential CDX calls in ~10s completed; 10 rapid calls produced 3 timeouts. -- Metadata API: Fast and reliable — 5 sequential calls completed in 3.0s with no errors. -- Search API: Fast — typically responds in 30–65ms (`QTime` in response header). -- No documented per-second or per-day limits. Archive.org's policy is to be respectful: add `time.sleep(1)` between CDX calls in loops. - -## Gotchas - -- **CDX times out — always set `timeout=40.0` or higher.** The default 20s is often too short for CDX. Metadata and search APIs are fine at 20–30s. CDX slowness is backend-side and unpredictable; add retry logic for production use. - -- **Wayback Availability API is unreliable.** `GET /wayback/available?url=iana.org` returns `{"url": "iana.org", "archived_snapshots": {}}` even for URLs confirmed archived via CDX. Tested 2026-04-18 across many URLs and timestamp combinations — consistently empty. Use `CDX ?sort=closest&limit=1` instead (confirmed working). - -- **CDX first row is always the header when `output=json`.** `rows[0]` is `['timestamp', 'original', ...]`, not a data row. Always slice `rows[1:]` for data. When `showResumeKey=true`, the last two rows are `[]` (separator) and `['']`. - -- **CDX `fl=` must match exactly what you iterate.** If you request `&fl=timestamp,original` you get 2-element rows; forgetting a field breaks destructuring. When in doubt, omit `fl=` entirely and get all 7 fields. - -- **`output=json` is required — there is no default JSON mode.** Omitting `output=json` returns space-separated text. `output=text` also works and is slightly faster for simple queries. - -- **`timestamp` is a string, not an integer.** Even in JSON, CDX returns all fields as strings: `'1418'` not `1418`, `'200'` not `200`. Cast explicitly: `int(row[4])`, `int(row[6])`. - -- **The `original` field preserves port numbers.** Old crawls captured `http://www.iana.org:80/` — the `:80` is part of the URL. When building a playback URL, use `original` verbatim: `f"https://web.archive.org/web/{ts}/{orig}"` works correctly with the port included. - -- **Metadata `{}` means the item doesn't exist or is private.** `http_get("https://archive.org/metadata/nonexistent")` returns `'{}'` (2-byte response) with HTTP 200. Always check `if not data` or `if not data.get('metadata')` before accessing fields. - -- **Metadata `subject` can be a string or a list.** When a single subject tag is set, the API returns `"subject": "short film"`. When multiple, it returns `"subject": ["short film", "spoken word"]`. Normalize with: `subjects = [meta['subject']] if isinstance(meta.get('subject'), str) else meta.get('subject', [])`. - -- **File `size` and `length` are strings, not numbers.** `files[0]['size']` is `'7532153'` (bytes). `files[0]['length']` is `'94.13'` (seconds for video/audio). Cast with `int()` and `float()` respectively. - -- **Use `archive.org/download/` not the raw storage server URL for reliability.** The raw URL (`ia601405.us.archive.org/2/items/...`) is faster but server-specific. `archive.org/download/{id}/{file}` redirects to the correct storage node and remains stable as items migrate. - -- **`/search?output=json` returns HTML, not JSON.** The `/search` endpoint is a React SPA — it ignores `output=json`. Always use `advancedsearch.php` for programmatic access. - -- **`collapse=timestamp:6` gives one row per month, but it keeps the FIRST capture of that month.** If you want the last, you'd need to reverse and re-collapse, or fetch all and filter client-side. The `collapse` parameter de-duplicates by truncating the timestamp to N digits and keeping the first matching row. - -- **CDX `from=` / `to=` accept partial timestamps.** `from=20230101` means `20230101000000`. `to=20231231` means `20231231000000` (exclusive). To include all of 2023, use `to=20240101`. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv-bulk/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv-bulk/scraping.md deleted file mode 100644 index d10adc117..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv-bulk/scraping.md +++ /dev/null @@ -1,333 +0,0 @@ -# arXiv Bulk Harvest + Semantic Scholar — OAI-PMH & Citation Enrichment - -Companion to `domain-skills/arxiv/scraping.md`. Use the **arxiv** skill for search-and-fetch workflows. Use **this skill** when you need: - -- Bulk-harvesting all papers in a subject area or date window (OAI-PMH) -- Citation counts, influential-citation scores, and cross-database IDs (Semantic Scholar) -- Per-paper version history and submitter info (`arXivRaw` metadata) - -No API key required for either endpoint. Both return JSON or XML over plain HTTP. - ---- - -## OAI-PMH bulk harvest - -### Endpoint (confirmed 2026-04-19) - -``` -https://oaipmh.arxiv.org/oai -``` - -`https://export.arxiv.org/oai2` is the old URL — it 301-redirects to the new one. Use the new URL directly to avoid the extra round-trip. - -### Harvest all cs papers from a date window - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -OAI_NS = { - 'oai': 'http://www.openarchives.org/OAI/2.0/', - 'arXiv': 'http://arxiv.org/OAI/arXiv/', -} - -def fetch_oai_page(url): - """Fetch one OAI-PMH page; return (records_xml_list, next_token_or_None).""" - xml = http_get(url) - root = ET.fromstring(xml) - records = root.findall('.//oai:record', OAI_NS) - token_el = root.find('.//oai:resumptionToken', OAI_NS) - token = token_el.text if token_el is not None and token_el.text else None - return records, token - -def parse_arxiv_record(rec): - """Extract fields from one element (metadataPrefix=arXiv).""" - header = rec.find('oai:header', OAI_NS) - meta = rec.find('.//arXiv:arXiv', OAI_NS) - if meta is None: - return None # deleted record (header has status="deleted") - authors_el = meta.findall('arXiv:authors/arXiv:author', OAI_NS) - authors = [] - for a in authors_el: - fn = (a.findtext('arXiv:forenames', namespaces=OAI_NS) or '').strip() - ln = (a.findtext('arXiv:keyname', namespaces=OAI_NS) or '').strip() - authors.append(f"{fn} {ln}".strip()) - return { - 'id': meta.findtext('arXiv:id', namespaces=OAI_NS), - 'datestamp': header.findtext('oai:datestamp', namespaces=OAI_NS), - 'created': meta.findtext('arXiv:created', namespaces=OAI_NS), - 'updated': meta.findtext('arXiv:updated', namespaces=OAI_NS), - 'title': (meta.findtext('arXiv:title', namespaces=OAI_NS) or '').strip(), - 'authors': authors, - 'categories': (meta.findtext('arXiv:categories', namespaces=OAI_NS) or '').split(), - 'abstract': (meta.findtext('arXiv:abstract', namespaces=OAI_NS) or '').strip(), - 'doi': meta.findtext('arXiv:doi', namespaces=OAI_NS), - 'journal_ref': meta.findtext('arXiv:journal-ref', namespaces=OAI_NS), - 'license': meta.findtext('arXiv:license', namespaces=OAI_NS), - } - -# --- Main harvest loop --- -import time - -BASE = 'https://oaipmh.arxiv.org/oai' -first_url = ( - f"{BASE}?verb=ListRecords" - f"&metadataPrefix=arXiv" - f"&set=cs" - f"&from=2024-01-01" - f"&until=2024-01-02" -) - -papers = [] -url = first_url -while url: - records, token = fetch_oai_page(url) - for rec in records: - p = parse_arxiv_record(rec) - if p: - papers.append(p) - print(f" fetched {len(records)} records, total so far: {len(papers)}") - if token: - url = f"{BASE}?verb=ListRecords&resumptionToken={token}" - time.sleep(5) # OAI-PMH policy: >=5s between pages - else: - url = None - -print(f"Done. {len(papers)} papers harvested.") -# Confirmed output for cs, 2024-01-01 to 2024-01-02: -# fetched 44 records, total so far: 44 -# Done. 44 papers harvested. -# For 2024-01-01 to 2024-01-07 (cs): multiple pages, resumptionToken issued when >~200 records -``` - -### Available verbs - -| Verb | Purpose | Key params | -|---|---|---| -| `Identify` | Repository info, earliest datestamp (`2005-09-16`) | — | -| `ListSets` | All harvestable sets (see table below) | — | -| `ListMetadataFormats` | `oai_dc`, `arXiv`, `arXivOld`, `arXivRaw` | — | -| `ListRecords` | Bulk harvest with date/set filter | `metadataPrefix`, `set`, `from`, `until` | -| `GetRecord` | Single record by OAI identifier | `identifier`, `metadataPrefix` | - -### Top-level sets (confirmed) - -| setSpec | Name | -|---|---| -| `cs` | Computer Science (all) | -| `cs:cs` | Computer Science (subset notation — same scope) | -| `math` | Mathematics | -| `physics` | Physics | -| `stat` | Statistics | -| `eess` | Electrical Engineering and Systems Science | -| `econ` | Economics | -| `q-bio` | Quantitative Biology | -| `q-fin` | Quantitative Finance | - -Subset sets use `topic:topic:SUBCATEGORY` notation, e.g. `cs:cs:LG` for Machine Learning. List all with `verb=ListSets`. - -### Available metadata formats - -- `arXiv` — rich: id, created/updated dates, authors (keyname + forenames separately), categories, abstract, doi, journal-ref, license. **Use this.** -- `arXivRaw` — adds ``, per-version history (`` with date and file size), author list as flat string. Use when you need version history. -- `oai_dc` — Dublin Core, minimal. Skip unless you need cross-system compatibility. -- `arXivOld` — legacy format pre-2007. Skip. - -### GetRecord + arXivRaw (version history) - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -RAW_NS = { - 'oai': 'http://www.openarchives.org/OAI/2.0/', - 'raw': 'http://arxiv.org/OAI/arXivRaw/', -} - -xml = http_get( - "https://oaipmh.arxiv.org/oai" - "?verb=GetRecord" - "&metadataPrefix=arXivRaw" - "&identifier=oai:arXiv.org:1706.03762" -) -root = ET.fromstring(xml) -meta = root.find('.//raw:arXivRaw', RAW_NS) - -title = meta.findtext('raw:title', namespaces=RAW_NS) -submitter = meta.findtext('raw:submitter', namespaces=RAW_NS) -versions = meta.findall('raw:version', RAW_NS) -for v in versions: - print(v.get('version'), v.findtext('raw:date', namespaces=RAW_NS)) -# Confirmed output for 1706.03762 ("Attention Is All You Need"): -# v1 Mon, 12 Jun 2017 17:57:34 GMT -# v2 Mon, 19 Jun 2017 16:49:45 GMT -# ... -# v7 Wed, 02 Aug 2023 00:41:18 GMT -# submitter: Llion Jones -``` - ---- - -## Semantic Scholar — citation enrichment for arXiv papers - -No API key required (unauthenticated: 1 req/s, 5000 req/day). With a free key the limit rises to 100 req/s. - -Base URL: `https://api.semanticscholar.org/graph/v1/` - -### Single paper lookup by arXiv ID - -```python -import json -from helpers import http_get - -paper = json.loads(http_get( - "https://api.semanticscholar.org/graph/v1/paper/arXiv:1706.03762" - "?fields=title,year,venue,publicationDate,citationCount," - "influentialCitationCount,authors,abstract,externalIds" -)) -print(paper['title']) # "Attention is All you Need" -print(paper['citationCount']) # 173155 (confirmed 2026-04-19) -print(paper['influentialCitationCount']) # 19629 -print(paper['venue']) # "Neural Information Processing Systems" -print(paper['externalIds']['ArXiv']) # "1706.03762" -print(paper['externalIds']['DOI']) # missing if no DOI -for a in paper['authors']: - print(a['name'], a['authorId']) -``` - -The ID format `arXiv:NNNN.NNNNN` is accepted directly — no conversion needed. - -### Batch lookup (up to 500 IDs per POST) - -```python -import json -from helpers import http_get -import urllib.request - -ids = ["arXiv:1706.03762", "arXiv:1810.04805", "arXiv:2005.14165"] -fields = "paperId,externalIds,title,year,citationCount,influentialCitationCount" - -body = json.dumps({"ids": ids}).encode() -req = urllib.request.Request( - f"https://api.semanticscholar.org/graph/v1/paper/batch?fields={fields}", - data=body, - headers={"Content-Type": "application/json"}, - method="POST", -) -with urllib.request.urlopen(req, timeout=20) as r: - results = json.loads(r.read()) - -for p in results: - print(p['externalIds'].get('ArXiv'), p['citationCount'], p['title'][:50]) -# Confirmed output (2026-04-19): -# 1706.03762 173155 Attention is All you Need -# 1810.04805 113138 BERT: Pre-training of Deep Bidirectional Tran... -# 2005.14165 (varies) Language Models are Few-Shot Learners -``` - -Note: `helpers.http_get` only does GET. For POST use `urllib.request.Request` directly as above. - -### Paper search - -```python -import json -from helpers import http_get - -results = json.loads(http_get( - "https://api.semanticscholar.org/graph/v1/paper/search" - "?query=large+language+model" - "&fields=paperId,externalIds,title,year,citationCount" - "&limit=5" -)) -total = results['total'] # e.g. 3473582 for "large language model" -for p in results['data']: - arxiv_id = p['externalIds'].get('ArXiv', 'no-arxiv') - print(arxiv_id, p['year'], p['citationCount'], p['title'][:50]) -# next page: use offset=5, offset=10, etc. -``` - -### Available fields (pass as comma-separated `fields=` query param) - -| Field | Type | Notes | -|---|---|---| -| `paperId` | str | Semantic Scholar internal ID | -| `externalIds` | dict | Keys: `ArXiv`, `DOI`, `DBLP`, `MAG`, `ACL`, `CorpusId` | -| `title` | str | | -| `abstract` | str | | -| `year` | int | Publication year | -| `publicationDate` | str | `YYYY-MM-DD` | -| `venue` | str | Conference/journal name | -| `citationCount` | int | Total citations | -| `influentialCitationCount` | int | Citations deemed highly influential | -| `authors` | list | Each: `{authorId, name}` | -| `references` | list | List of paper objects (needs own `fields`) | -| `citations` | list | Citing papers (needs own `fields`) | -| `openAccessPdf` | dict | `{url, status, license}` | - ---- - -## Downloading PDFs - -Direct PDF download — no auth, no redirect for versionless URLs (returns 200 + PDF body directly). - -```python -import urllib.request - -def download_pdf(arxiv_id, dest_path, version=None): - """ - arxiv_id: bare ID like '1706.03762' or versioned '1706.03762v7' - version: if given, appended as 'v{version}' — ignored if arxiv_id already has version - dest_path: where to save, e.g. '/tmp/paper.pdf' - """ - if 'v' not in arxiv_id.split('.')[-1] and version: - arxiv_id = f"{arxiv_id}v{version}" - url = f"https://arxiv.org/pdf/{arxiv_id}" - req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0"}) - with urllib.request.urlopen(req, timeout=60) as r: - with open(dest_path, 'wb') as f: - f.write(r.read()) - print(f"Saved {r.headers.get('content-length', '?')} bytes to {dest_path}") - -download_pdf('1706.03762', '/tmp/attention.pdf') -# Confirmed: saves 2215244 bytes, filename hint in header: '1706.03762v7.pdf' -# Versionless URL resolves to latest version server-side (no redirect, 200 direct) -``` - ---- - -## Gotchas - -- **OAI-PMH endpoint moved.** `https://export.arxiv.org/oai2` 301-redirects to `https://oaipmh.arxiv.org/oai`. Use the new URL. `helpers.http_get` (which uses `urllib`) does NOT follow redirects — you'll get an empty string or error. Either use `urllib.request.urlopen` with `follow_redirects` logic, or just use the canonical URL directly. - -- **OAI-PMH rate limit: 5 seconds between pages.** The protocol requires a `Retry-After` interval. The server embeds an `expirationDate` on the resumptionToken. Violating the rate limit causes the token to be invalidated and the harvest fails silently. Always `time.sleep(5)` between pages. - -- **Resumption token is opaque but URL-encoded.** The token looks like `verb%3DListRecords%26...%26skip%3D247`. Pass it verbatim as `&resumptionToken=` — do not URL-encode it again. - -- **`datestamp` in OAI-PMH is last-modified date, not submission date.** A paper submitted in 2008 can appear in a 2024 harvest window if it was revised then. The `` and `` fields inside `` metadata are the actual submission/revision dates. - -- **Deleted records have no `` element.** The `
` will carry `status="deleted"`. Always check `meta is None` after `find('.//arXiv:arXiv', ...)`. - -- **Author structure differs between OAI-PMH formats.** In `arXiv` metadata, authors are structured: `VaswaniAshish`. In `arXivRaw`, they're a flat comma-separated string: `Ashish Vaswani, Noam Shazeer, ...`. In the Atom API, it's `Ashish Vaswani` (first-last order). Pick the source that matches your downstream use. - -- **Semantic Scholar 429 under unauthenticated bursts.** The unauthenticated limit is ~1 req/s. Rapid parallel calls return `{"code": "429"}`. Add `time.sleep(1)` between single lookups or use the batch POST endpoint (up to 500 IDs, single request) to stay under the limit. The batch endpoint itself counts as 1 request. - -- **Semantic Scholar `externalIds` may lack `ArXiv` key.** Not all papers have an arXiv preprint. When enriching an arXiv list with S2 data, always use `.get('ArXiv')` not `['ArXiv']`. - -- **Atom API rate limit: 1 request per 3 seconds for sustained crawls.** The API returns HTTP 429 `"Rate exceeded."` on rapid-fire requests. The OAI-PMH endpoint is designed for bulk and is more tolerant, but still requires the 5s sleep between resumption pages. - -- **OAI-PMH `set` param uses colon-separated hierarchy, not dot.** The Atom API uses `cat:cs.LG`; OAI-PMH uses `set=cs:cs:LG`. Using `set=cs.LG` returns zero results. - -- **`http_get` in helpers.py does NOT follow HTTP redirects.** If you must use it with the old OAI URL, you'll get an empty body. Either update the URL to the canonical one or use `urllib.request.urlopen` with a redirect handler. - ---- - -## How this complements the existing arxiv skill - -| Task | Use | -|---|---| -| Search by keyword, author, or category | `arxiv` skill — Atom API | -| Fetch 1–2000 specific papers by ID | `arxiv` skill — `id_list` batch | -| Harvest all papers in a subject over a date range | **this skill** — OAI-PMH | -| Get citation counts / influential citations | **this skill** — Semantic Scholar | -| Get per-version history and submitter name | **this skill** — OAI-PMH `arXivRaw` | -| Download a PDF | either skill (same URL structure) | diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv/scraping.md deleted file mode 100644 index c731cd824..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/arxiv/scraping.md +++ /dev/null @@ -1,311 +0,0 @@ -# ArXiv — Scraping & Data Extraction - -`https://arxiv.org` — open-access preprint server. **Never use the browser for ArXiv.** All data is reachable via `http_get` using the Atom API or HTML meta tags. No API key required. - -## Do this first - -**Use the Atom API for any paper search or metadata fetch — one call, XML response, no auth.** - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} - -xml = http_get("http://export.arxiv.org/api/query?search_query=ti:transformer+AND+cat:cs.LG&max_results=5&sortBy=submittedDate&sortOrder=descending") -root = ET.fromstring(xml) -entries = root.findall('atom:entry', NS) -``` - -Use `id_list` for known paper IDs — supports comma-separated batch fetch in a single call. - -Use `http_get` on `https://arxiv.org/abs/{id}` + regex for `citation_*` meta tags when you need the full abstract from an HTML page. - -## Common workflows - -### Search papers (API) - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} - -xml = http_get( - "http://export.arxiv.org/api/query" - "?search_query=ti:transformer+AND+cat:cs.LG" - "&max_results=5&sortBy=submittedDate&sortOrder=descending" -) -root = ET.fromstring(xml) -entries = root.findall('atom:entry', NS) -for e in entries: - title = e.find('atom:title', NS).text.strip().replace('\n', ' ') - arxiv_id = e.find('atom:id', NS).text.split('/')[-1] # e.g. '2604.15259v1' - published = e.find('atom:published', NS).text[:10] # '2026-04-16' - updated = e.find('atom:updated', NS).text[:10] - abstract = e.find('atom:summary', NS).text.strip() - authors = [a.find('atom:name', NS).text for a in e.findall('atom:author', NS)] - cats = [c.get('term') for c in e.findall('atom:category', NS)] - primary = e.find('arxiv:primary_category', NS).get('term') - comment = e.find('arxiv:comment', NS) - pdf_link = next((l.get('href') for l in e.findall('atom:link', NS) if l.get('title') == 'pdf'), None) - abs_link = next((l.get('href') for l in e.findall('atom:link', NS) if l.get('rel') == 'alternate'), None) - print(arxiv_id, published, title[:60]) - print(" Authors:", authors[:2]) - print(" PDF:", pdf_link) -# Confirmed output (2026-04-18): -# 2604.15259v1 2026-04-16 Stability and Generalization in Looped Transformers -# Authors: ['Asher Labovich'] -# PDF: https://arxiv.org/pdf/2604.15259v1 -``` - -### Fetch single paper by ID (API) - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -NS = {'atom': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} - -xml = http_get("http://export.arxiv.org/api/query?id_list=1706.03762") -root = ET.fromstring(xml) -e = root.find('atom:entry', NS) -title = e.find('atom:title', NS).text.strip() -abstract = e.find('atom:summary', NS).text.strip() -categories = [c.get('term') for c in e.findall('atom:category', NS)] -pdf_link = next((l.get('href') for l in e.findall('atom:link', NS) if l.get('title') == 'pdf'), None) -print("Title:", title) -print("Categories:", categories) -print("PDF:", pdf_link) -print("Abstract:", abstract[:200]) -# Confirmed output: -# Title: Attention Is All You Need -# Categories: ['cs.CL', 'cs.LG'] -# PDF: https://arxiv.org/pdf/1706.03762v7 -# Abstract: The dominant sequence transduction models are based on complex recurrent... -``` - -### Batch fetch by comma-separated IDs (single call — fast) - -Fetching 10 IDs in one call takes ~2s. Prefer this over parallel single-ID fetches. - -```python -import xml.etree.ElementTree as ET -from helpers import http_get - -NS = {'atom': 'http://www.w3.org/2005/Atom'} - -ids = ['1706.03762', '1810.04805', '2005.14165'] # Transformer, BERT, GPT-3 -xml = http_get(f"http://export.arxiv.org/api/query?id_list={','.join(ids)}&max_results={len(ids)}") -root = ET.fromstring(xml) -for e in root.findall('atom:entry', NS): - arxiv_id = e.find('atom:id', NS).text.split('/')[-1] - title = e.find('atom:title', NS).text.strip() - published = e.find('atom:published', NS).text[:10] - print(arxiv_id, published, title[:60]) -# Confirmed output: -# 1512.03385v1 2015-12-10 Deep Residual Learning for Image Recognition -# 1706.03762v7 2017-06-12 Attention Is All You Need -# 2005.14165v4 2020-05-28 Language Models are Few-Shot Learners -# 1810.04805v2 2018-10-11 BERT: Pre-training of Deep Bidirectional Transformers... -# Note: order returned may differ from order requested -``` - -### Parallel fetch (ThreadPoolExecutor for independent IDs) - -Use only when IDs are not known upfront or when mixing with other work. For pure batch, single comma-separated `id_list` call is faster. - -```python -import xml.etree.ElementTree as ET -from concurrent.futures import ThreadPoolExecutor -from helpers import http_get - -NS = {'atom': 'http://www.w3.org/2005/Atom'} - -def fetch_paper(arxiv_id): - xml = http_get(f"http://export.arxiv.org/api/query?id_list={arxiv_id}") - root = ET.fromstring(xml) - e = root.find('atom:entry', NS) - if e is None: - return None - return { - 'id': arxiv_id, - 'title': e.find('atom:title', NS).text.strip(), - 'published': e.find('atom:published', NS).text[:10], - } - -ids = ['1706.03762', '1810.04805', '2005.14165'] -with ThreadPoolExecutor(max_workers=3) as ex: - papers = list(ex.map(fetch_paper, ids)) -for p in papers: - print(p['id'], p['published'], p['title'][:60]) -# Confirmed working — max_workers=3 is safe; don't exceed 5 for continuous crawling -``` - -### HTML abstract page — citation_* meta tags - -Use this when you want the full abstract or the versionless PDF URL without parsing Atom XML. - -```python -import re -from helpers import http_get - -html = http_get("https://arxiv.org/abs/1706.03762", headers={"User-Agent": "Mozilla/5.0"}) -# HTML page is ~48 KB, fully static, no JS required - -title = re.search(r' 0` before accessing `entries[0]`. - -- **`arxiv:comment` and `arxiv:journal_ref` / `arxiv:doi` may be absent.** Not all papers have these fields. Use `e.find('arxiv:comment', NS)` and check `if el is not None and el.text`. - -- **Rate limit: 3 seconds between requests recommended for bulk crawling.** In practice, rapid bursts of 10 individual requests complete in ~6s (avg 0.63s/req) without being blocked. For sustained crawls over hundreds of papers, insert `time.sleep(3)` between requests. The API does not return rate limit headers — it just starts slowing responses or returns HTTP 503 silently. - -- **`citation_author` tags are in `"Last, First"` format**, not `"First Last"` like the Atom API. The Atom `atom:author/atom:name` field gives `"First Last"` order. Pick the format that matches your downstream use. - -- **The `arxiv:affiliation` sub-element of `atom:author` is rarely populated.** Most institutional affiliations are absent from the API response even when listed on the paper. The HTML abs page doesn't expose them in meta tags either. - -- **`sortBy=relevance` applies only with `search_query`.** Using `sortBy=relevance` with `id_list` has no effect — results still come back in date order. - -- **`max_results` cap is 2000 per call.** For bulk harvesting of a category, use `start` offset pagination and add 3s sleep between pages. `opensearch:totalResults` tells you the total so you can compute how many pages are needed. - -- **HTML `citation_abstract` meta tag contains the full abstract.** Unlike the Atom `atom:summary` which can have trailing whitespace and embedded newlines, the meta tag version is a single clean string — no `.strip()` needed. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/atlas/overview.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/atlas/overview.md deleted file mode 100644 index 7b0236318..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/atlas/overview.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -name: atlas-recruit -description: Atlas recruitment platform (my.recruitwithatlas.com) — routes, filters, GraphQL bootstrap for authenticated UI probes. ---- - -# Atlas — my.recruitwithatlas.com - -Gated recruitment SaaS. Auth via Google SSO (WebAuthn/passkey). GraphQL backend at `/graphql` (NextAuth session cookie, `credentials: 'include'` from the tab). - -## Routes - -| Route | What | -|---|---| -| `/home` | Dashboard (default landing after login) | -| `/sign-in` | Redirect target when unauthenticated | -| `/business-development/opportunities` | BD opportunities (kanban / list view) | -| `/business-development/leads` | Leads | -| `/business-development/prospects` | Prospects | -| `/business-development/playbook` | Playbook | -| `/candidates` | Candidate pipeline | -| `/projects/` | Specific job / project | -| `/graphql` | Authenticated GraphQL endpoint (POST) | - -## Filters in URL - -BD opportunities uses `?filters=[JSON]` (URL-encoded). Example "Me" filter: - -```json -[{"id":"opportunity_owner","selectedOptions":[{"id":"","title":"Me","excludeFromSearch":false}]}] -``` - -Filter IDs seen: `opportunity_owner`, `stage`, `industry`, `segment`, `conversion_probability`. - -## Finding your own user UUID - -- Apply a filter like "owner = Me" in `/business-development/opportunities`, then read `selectedOptions[0].id` out of the URL `filters=` param. -- Or: `query { me { id email } }` via the GraphQL endpoint (see below). -- User UUIDs are tenant-stable; keep them in a local secret store, not in this shared skill. - -## Stages (BD funnel) - -`Identified` → `Initial Outreach` → `Late Stage` → `Converted` → `Archived`. Seen as tab labels on `/business-development/opportunities`. - -## Auth quirks - -- Google SSO flows through `accounts.google.com/signin/oauth/id?...` — passkey / WebAuthn only, no password fallback visible. -- Session state lives in multiple cookies (JWE session + CSRF). Injecting only the JWE into a fresh Chrome profile is **not sufficient** for UI access — you land in a login loop. For UI work: log in once inside a persistent Chrome profile and let all cookies settle. For backend-only GraphQL calls: the `__Secure-authjs.session-token` JWE alone is enough when sent with `cookie: __Secure-authjs.session-token=` from an external HTTP client. - -## GraphQL endpoint - -POST `https://my.recruitwithatlas.com/graphql` using the tab's own cookies: - -```python -js(""" -fetch('/graphql', { - method: 'POST', - headers: {'Content-Type': 'application/json', 'apollo-require-preflight': 'true'}, - credentials: 'include', - body: JSON.stringify({query: 'query { me { id email } }'}) -}).then(r => r.json()).then(j => JSON.stringify(j)) -""") -``` - -This reuses the session cookies of the current tab — no JWE juggling needed when browsing from inside browser-harness. - -Known mutations (verified against production schema, April 2026): `opportunityCreate`, `opportunityUpdate`, `companyCreate`, `projectCreate`, `projectUpdate`, `opportunityAddLead`, `createOpportunityNote`. Create mutations return placeholder names; follow with an `opportunityUpdate` / `projectUpdate` to set the final name or description. `opportunityAddLead` side-effects `Project.company` onto `Opportunity.targetCompany` when the opp had none. - -## Page titles - -The app sets a green-dot emoji prefix on titles: `🟢 Atlas Agency` (sign-in), `🟢 Business development` (BD overview), etc. Useful for `wait_for` conditions — the emoji is consistent across routes. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/booking-com/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/booking-com/scraping.md deleted file mode 100644 index 1e9eaaa9b..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/booking-com/scraping.md +++ /dev/null @@ -1,578 +0,0 @@ -# Booking.com — Scraping & Data Extraction - -Field-tested against booking.com on 2026-04-18 using `http_get` and the -`dml/graphql` JSON API. All tests run without a browser session. - ---- - -## TL;DR - -**`http_get` returns nothing useful from booking.com.** Every HTML page — -search results, hotel pages, city pages, the homepage — is intercepted by an -AWS WAF JS challenge before any content is served. The challenge requires -JavaScript execution to complete a cryptographic puzzle and set an -`aws-waf-token` cookie. Without a real browser, you get a ~4-8 KB stub page. - -**What you can do without a browser:** -- Enumerate hotel/city/region URLs from XML sitemaps (Googlebot UA required). -- Read `robots.txt` for URL pattern documentation. -- Query the GraphQL endpoint `https://www.booking.com/dml/graphql` for schema - exploration (no auth = internal errors, but validation errors reveal the - schema). - -**For all actual data extraction, use the browser (`goto` + `js`).** - ---- - -## AWS WAF JS Challenge — What It Is - -Every `http_get` request to `www.booking.com` receives one of two variants of -a WAF stub: - -**Variant A (~3,962 bytes) — modern SDK:** -```html - - -``` - -**Variant B (~8,410 bytes) — with AJAX error reporting:** -Same AWS WAF SDK, plus an `XMLHttpRequest`-based error reporter that POSTs to -`https://reports.booking.com/chal_report`. This variant is more common on -non-browser UA strings. - -**Detection in your code:** -```python -def is_waf_blocked(html: str) -> bool: - return ( - 'AwsWafIntegration' in html - or 'awsWafCookieDomainList' in html - or 'challenge.js' in html - or len(html) < 10_000 and '' in html - ) -``` - -**What the challenge does:** -1. Loads a 1.3 MB obfuscated JS file (`challenge.js`) from a path-keyed URL. -2. Executes a cryptographic proof-of-work puzzle client-side. -3. Sets an `aws-waf-token` cookie on the `booking.com` domain. -4. Redirects to the original URL with `?chal_t={timestamp}&force_referer=` - appended. - -This challenge **cannot be solved by `http_get`**. It requires a real JS -engine. A `bkng` session cookie is set on the first blocked response, but it -has no value without the WAF token. - -**User agents tested — all blocked:** -- Chrome desktop (`Mozilla/5.0 ... Chrome/120`) -- iPhone/Safari mobile -- `Googlebot/2.1` (HTML pages only; sitemaps are whitelisted) -- Default `urllib` UA - ---- - -## What `http_get` CAN Access - -### 1. XML Sitemaps (URL discovery) - -Booking.com whitelists sitemap paths for Googlebot. This lets you enumerate -millions of property, city, region, and attraction URLs without a browser. - -```python -import gzip, re, urllib.request - -GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"} - -def fetch_sitemap_index(url: str) -> list[str]: - """Returns list of child sitemap URLs from an index sitemap.""" - xml = http_get(url, headers=GOOGLEBOT) - return re.findall(r'(https://[^<]+)', xml) - -def fetch_sitemap_gz(gz_url: str) -> list[str]: - """Decompresses a gzipped sitemap and returns all URLs.""" - req = urllib.request.Request(gz_url, headers=GOOGLEBOT) - with urllib.request.urlopen(req, timeout=30) as r: - data = gzip.decompress(r.read()) - return re.findall(r'(https://[^<]+)', data.decode()) - -# Example: get all en-gb hotel URLs -hotel_idx = http_get( - "https://www.booking.com/sitembk-hotel-index.xml", - headers=GOOGLEBOT -) -# 74 shards for en-gb; each shard has ~45,000-50,000 property URLs -en_gb_shards = re.findall( - r'(https://www\.booking\.com/sitembk-hotel-en-gb\.\d+\.xml\.gz)', - hotel_idx -) -# hotel_urls = fetch_sitemap_gz(en_gb_shards[0]) # ~50K URLs per shard -``` - -**Available sitemap categories (confirmed, 275 total):** - -| Index URL | Content | -|-----------|---------| -| `sitembk-hotel-index.xml` | All properties (~74 en-gb shards, ~3.5M URLs) | -| `sitembk-city-index.xml` | City landing pages (~6 en-gb shards, ~44K cities) | -| `sitembk-region-index.xml` | Region landing pages | -| `sitembk-country-index.xml` | Country landing pages | -| `sitembk-attractions-index.xml` | Attractions | -| `sitembk-hotel-review-index.xml` | Review pages | -| `sitembk-themed-city-{type}-index.xml` | Category-specific city pages (70+ types: hostels, luxury, spa, ski, etc.) | - -### 2. `robots.txt` - -```python -robots = http_get("https://www.booking.com/robots.txt", headers={"User-Agent": "Mozilla/5.0"}) -``` - -- Returns immediately, no WAF. -- 136 Disallow entries, 275 Sitemap declarations. -- Documents all URL structures (search results, hotel pages, booking flow, etc.). - -### 3. GraphQL Schema Exploration (no auth) - -The endpoint `https://www.booking.com/dml/graphql` is **not WAF-protected**. -It accepts POST requests and returns JSON. Without a session, most queries -return `Internal Server Error` from the backend (`irene` service), but -**GraphQL validation errors fire before the backend** and reveal the schema. - -```python -import json, urllib.request, gzip - -GQL_URL = "https://www.booking.com/dml/graphql?lang=en-gb" -GQL_HEADERS = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36", - "Accept": "application/json", - "Content-Type": "application/json", - "Origin": "https://www.booking.com", - "Referer": "https://www.booking.com/searchresults.html", - "x-booking-context-action-name": "searchresults", - "x-booking-context-aid": "376510", - "x-booking-site-type-id": "1", -} - -def gql(operation_name: str, query: str, variables: dict = None) -> dict: - payload = {"operationName": operation_name, "query": query} - if variables: - payload["variables"] = variables - req = urllib.request.Request( - GQL_URL, - data=json.dumps(payload).encode(), - headers=GQL_HEADERS, - method="POST" - ) - with urllib.request.urlopen(req, timeout=20) as r: - data = r.read() - if r.headers.get("Content-Encoding") == "gzip": - data = gzip.decompress(data) - return json.loads(data.decode()) -``` - -**Confirmed Query type fields (schema, field-tested 2026-04-18):** - -| Field | Input type | Notes | -|-------|-----------|-------| -| `searchQueries` | none | Root for hotel search; nested `.search(SearchQueryInput!)` | -| `searchBox` | `SearchBoxInput!` | Destination autocomplete / search form state | -| `searchProperties` | `SearchInput!` | Returns 500 without auth session | -| `propertyDetails` | `PropertyDetailsQueryInput!` | Returns 500 without auth session | -| `popularDestinations` | `PopularDestinationsInput!` | Returns validation error (type mismatch) | - -**Important:** Booking.com GraphQL uses an **operation name whitelist** for -some operations. If you get `GRAPHQL_UNKNOWN_OPERATION_NAME`, try any of the -following confirmed working names: `SearchResultsPage`, `SearchQuery`, -`HotelCardsList`, `SearchResultsList`, `PropertySearch`, `BookingSearch`. - -**Operation names that bypass the whitelist restriction** (all return -`{ data: { __typename: 'Query' } }` with `{ __typename }`): -- `SearchResultsPage` ✓ (confirmed, use this) - -**The search query structure** (known but returns 500 without session): -```graphql -query SearchResultsPage($input: SearchQueryInput!) { - searchQueries { - search(input: $input) { - __typename # Returns SearchQueryResult type - } - } -} -``` - -With `SearchQueryInput` fields (inferred from URL parameters, confirmed -accepted by validation): -```json -{ - "dest_id": "-1456928", - "dest_type": "CITY", - "checkin": "2026-05-01", - "checkout": "2026-05-03", - "group_adults": "2", - "no_rooms": "1", - "group_children": "0", - "selected_currency": "USD" -} -``` - ---- - -## URL Parameter Reference - -### Search Results -`https://www.booking.com/searchresults.html` - -| Parameter | Type | Example | Notes | -|-----------|------|---------|-------| -| `ss` | string | `Paris` | Free-text: city, hotel name, address | -| `dest_id` | string | `-1456928` | Numeric city/region ID (negative = city) | -| `dest_type` | string | `CITY` | `CITY`, `REGION`, `COUNTRY`, `HOTEL`, `AIRPORT`, `DISTRICT`, `LANDMARK` | -| `checkin` | `YYYY-MM-DD` | `2026-05-01` | | -| `checkout` | `YYYY-MM-DD` | `2026-05-03` | | -| `group_adults` | int | `2` | | -| `no_rooms` | int | `1` | | -| `group_children` | int | `0` | | -| `age` | int (repeatable) | `5` | Child age; one per child | -| `selected_currency` | string | `USD` | ISO 4217 currency code | -| `lang` | string | `en-us` | BCP 47 locale | -| `nflt` | string | `ht_id=204;class=4` | Semicolon-separated filters | -| `order` | string | `price` | Sort: `price`, `class`, `review_score`, `distance`, `upsort_bh` | -| `offset` | int | `25` | Pagination (0-based, step 25) | -| `rows` | int | `25` | Results per page (max 25) | -| `map` | `1` | `1` | Map view mode | -| `src` | string | `searchresults` | Source context (cosmetic) | - -**Common `nflt` filter codes:** -- `ht_id=204` — Hotels only -- `class=3;class=4;class=5` — Star rating -- `review_score=90` — Guest rating ≥ 9.0 -- `fc=2` — Free cancellation -- `rm_types=…` — Room type -- `pri=1;pri=2` — Price tier (budget / mid / upscale) - -### Property Pages -`https://www.booking.com/hotel/{country_code}/{hotel_slug}.html` - -Confirmed from sitemap (74 shards, ~3.5M properties): -``` -https://www.booking.com/hotel/{cc}/{slug}.html -https://www.booking.com/hotel/{cc}/{slug}.en-gb.html -https://www.booking.com/hotel/{cc}/{slug}.{lang}.html -``` -- `cc` = 2-letter ISO country code (e.g., `fr`, `us`, `gb`, `de`, `jp`) -- `slug` = hotel name, lowercase, hyphen-separated -- Locale suffix optional; omit for default (English) - -### City / Region / Country Pages -``` -https://www.booking.com/city/{cc}/{city-slug}.html -https://www.booking.com/region/{cc}/{region-slug}.html -https://www.booking.com/country/{cc}.html -``` - ---- - -## Browser-Based Extraction (Required for All Data) - -Since `http_get` is blocked, all actual data extraction requires the browser -(`goto` + `js`). The WAF challenge resolves automatically in Chrome. - -### Initial Navigation - -```python -# Always use new_tab() for the first Booking.com load in a session -tid = new_tab("https://www.booking.com/searchresults.html?ss=Paris&checkin=2026-05-01&checkout=2026-05-03&group_adults=2&no_rooms=1&selected_currency=USD") -wait_for_load() -wait(3) # React hydration takes ~3s after readyState=complete - -# Check for WAF challenge still running (rare in real Chrome) -url = page_info()["url"] -if "chal_t=" in url: - wait(5) # WAF challenge resolving - wait_for_load() -``` - -### GDPR / Cookie Consent Banner (EU Visitors) - -Shown to visitors with EU IP addresses or EU `Accept-Language` headers **after** -the WAF challenge resolves. It blocks interaction until dismissed. - -```python -def dismiss_cookie_banner(): - # Booking.com uses data-testid="accept" on the Accept button - accepted = js(""" - (function() { - var btn = document.querySelector('[data-testid="accept"]') - || document.querySelector('#onetrust-accept-btn-handler') - || document.querySelector('[aria-label*="Accept"]'); - if (btn) { btn.click(); return true; } - return false; - })() - """) - return accepted - -# Call immediately after load if you have an EU IP -if dismiss_cookie_banner(): - wait(1) -``` - -The consent banner does **not** appear in the WAF stub — it only renders after -the full React app loads. Non-EU visitors (US IP, `Accept-Language: en-US`) -may not see it at all. - -### Search Results Page Extraction - -```python -results = js(""" - Array.from(document.querySelectorAll('[data-testid="property-card"]')).map(el => ({ - name: el.querySelector('[data-testid="title"]')?.innerText?.trim(), - url: el.querySelector('[data-testid="title-link"]')?.href, - price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(), - rating: el.querySelector('[data-testid="review-score"]')?.innerText?.trim(), - stars: el.querySelectorAll('[data-testid="rating-stars"] svg').length, - location: el.querySelector('[data-testid="address"]')?.innerText?.trim(), - availability_note: el.querySelector('[data-testid="availability-rate-information"]')?.innerText?.trim(), - is_genius: !!el.querySelector('[data-testid="genius-label"]'), - })) -""") -``` - -**Field notes:** -- `data-testid="property-card"` — confirmed selector for result cards (as of - 2025-2026; Booking migrated from `sr-hotel` class to data-testid attributes). -- `data-testid="price-and-discounted-price"` — contains the nightly rate; - may show original + discounted price together as text. -- `data-testid="review-score"` — contains both the numeric score (e.g., - `"9.2"`) and the label (e.g., `"Superb"`); use `.split('\n')[0]` for score. -- `data-testid="rating-stars"` — star rating icons; count SVG children for - star count. -- Results are loaded asynchronously; 3s wait after `wait_for_load()` is - required for all cards to render. - -### Pagination - -```python -# Method 1: Next page button -next_btn = js("document.querySelector('[data-testid=\"pagination-next\"]')?.href") -if next_btn: - goto_url(next_btn) - wait_for_load() - wait(3) - -# Method 2: Offset parameter (25 results per page) -current_url = page_info()["url"] -offset = 25 # next page -goto_url(current_url + f"&offset={offset}") -wait_for_load() -wait(3) -``` - -### Property / Hotel Page Extraction - -```python -detail = js(""" - ({ - name: document.querySelector('[data-testid="property-name"]')?.innerText?.trim() - || document.querySelector('h2.hp__hotel-name, h1.pp-hotel-name-title')?.innerText?.trim(), - rating: document.querySelector('[data-testid="rating-squares"]') - ? document.querySelectorAll('[data-testid="rating-squares"] svg').length - : null, - score: document.querySelector('[data-testid="review-score-right-component"] .ac4a7896c7')?.innerText - || document.querySelector('[aria-label*="Scored"]')?.getAttribute('aria-label'), - address: document.querySelector('[data-testid="PropertyHeaderAddressDesktop"]')?.innerText?.trim() - || document.querySelector('[id="hotel_address"]')?.innerText?.trim(), - description: document.querySelector('[data-testid="property-description-content"]')?.innerText?.trim() - || document.querySelector('#property_description_content')?.innerText?.trim(), - amenities: Array.from(document.querySelectorAll('[data-testid="facility-list-item"]')) - .map(e => e.innerText?.trim()).filter(Boolean), - room_types: Array.from(document.querySelectorAll('[data-testid="roomstable-accordion"]')) - .map(el => ({ - name: el.querySelector('[data-testid="room-type-name"]')?.innerText?.trim(), - price: el.querySelector('[data-testid="price-and-discounted-price"]')?.innerText?.trim(), - })), - lat: document.querySelector('a[href*="maps.google"]') - ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[0], - lon: document.querySelector('a[href*="maps.google"]') - ?.href?.match(/[?&]q=([^&]+)/)?.[1]?.split(',')[1], - }) -""") -``` - -### JSON-LD Schema (Property Pages) - -Property pages embed JSON-LD when fully rendered in browser. The schema type -is `Hotel`: - -```python -ld_json = js(""" - (function() { - for (var s of document.querySelectorAll('script[type="application/ld+json"]')) { - try { - var d = JSON.parse(s.textContent); - if (d['@type'] === 'Hotel' || d['@type'] === 'LodgingBusiness') return d; - } catch(e) {} - } - return null; - })() -""") -# Returns: -# { -# "@type": "Hotel", -# "name": "Hotel de Crillon", -# "aggregateRating": {"ratingValue": "9.2", "reviewCount": "1423"}, -# "address": {"streetAddress": "10 Place de la Concorde", "addressLocality": "Paris", ...}, -# "geo": {"latitude": 48.865, "longitude": 2.321}, -# "starRating": {"ratingValue": 5} -# } -``` - -JSON-LD is **not present in the WAF stub** — it only exists in the fully -rendered page. `http_get` will never see it. - -### Embedded JavaScript Data (`__NEXT_DATA__` / `b_hotel_data`) - -Booking.com's React app may embed search state in `window.__NEXT_DATA__` or -legacy `b_hotel_data` globals. Access via: - -```python -next_data = js("window.__NEXT_DATA__") # dict or None -b_hotel = js("window.b_hotel_data") # dict or None — legacy pages -``` - -These globals are not present in the WAF stub and their availability depends -on page version. Prefer data-testid selectors which are more stable. - ---- - -## Pricing Extraction Patterns - -Booking.com shows prices per night with multiple formatting variants: - -```python -price_patterns = js(""" - ({ - // Search results card price - search_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText, - // Property page room price - room_price: document.querySelector('[data-testid="price-and-discounted-price"]')?.innerText, - // Original (crossed-out) price before discount - original_price: document.querySelector('[data-testid="recommended-units-price"] s')?.innerText - || document.querySelector('.prco-valign-middle-helper del')?.innerText, - // "Price for X nights" summary - total_price: document.querySelector('[data-testid="checkout-price-summary"]')?.innerText, - // Genius discount tag - genius_discount: document.querySelector('[data-testid="genius-rate-badge"]')?.innerText, - }) -""") -``` - -**Price display nuances:** -- Prices shown are **per night** by default; multiply by nights for total. -- Currency is controlled by `selected_currency` URL param or user account - setting. -- Taxes/fees may or may not be included; look for `"Includes taxes and fees"` - or `"+ taxes & fees"` text adjacent to the price element. -- The `data-testid="price-and-discounted-price"` element returns a single - string that may contain both original and discounted price - (e.g., `"US$400\nUS$320"`). - ---- - -## WAF Detection & Handling in Browser - -The WAF resolves automatically in a real Chrome session. To detect if -something went wrong: - -```python -def check_booking_waf(): - url = page_info()["url"] - html_snippet = js("document.body?.innerHTML?.slice(0, 500)") or "" - return ( - "chal_t=" in url - or "AwsWafIntegration" in html_snippet - or "challenge-container" in html_snippet - ) - -def wait_past_waf(timeout=15): - import time - deadline = time.time() + timeout - while time.time() < deadline: - if not check_booking_waf(): - return True - wait(1) - return False # timed out — WAF didn't resolve - -# Use after goto_url(): -goto_url("https://www.booking.com/searchresults.html?ss=London&checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1") -wait_for_load() -wait_past_waf() -wait(2) # hydration -``` - ---- - -## Sitemap-Based URL Discovery Workflow - -Use this when you need a list of property URLs for a given country or city, -without needing to scrape search results pages in the browser: - -```python -import gzip, re, urllib.request - -GOOGLEBOT = {"User-Agent": "Googlebot/2.1 (+http://www.google.com/bot.html)"} - -def get_hotel_urls_for_country(cc: str, lang: str = "en-gb", max_shards: int = 2) -> list[str]: - """Returns property page URLs for a country from sitemaps. No browser needed.""" - idx_url = f"https://www.booking.com/sitembk-hotel-index.xml" - idx = http_get(idx_url, headers=GOOGLEBOT) - pattern = rf'(https://www\.booking\.com/sitembk-hotel-{lang}\.\d+\.xml\.gz)' - shards = re.findall(pattern, idx)[:max_shards] - - urls = [] - for shard_url in shards: - req = urllib.request.Request(shard_url, headers=GOOGLEBOT) - with urllib.request.urlopen(req, timeout=60) as r: - xml = gzip.decompress(r.read()).decode() - all_urls = re.findall(r'(https://[^<]+)', xml) - # Filter by country code - country_urls = [u for u in all_urls if f"/hotel/{cc}/" in u] - urls.extend(country_urls) - return urls - -# Example: get French hotel URLs (no browser needed, instant) -# french_hotels = get_hotel_urls_for_country("fr", max_shards=1) -# len(french_hotels) -> ~8,000+ URLs from one shard -``` - ---- - -## Gotchas - -- **WAF blocks everything via `http_get`** — there is no User-Agent or header - combination that bypasses it. The challenge is cryptographic, not heuristic. -- **WAF has two page sizes** — ~3,962 bytes (newer SDK, no AJAX reporter) and - ~8,410 bytes (older with error reporting). Both are equally blocked. -- **Sitemaps whitelist Googlebot UA** — `Googlebot/2.1` UA works for sitemap - XML/GZ files but NOT for hotel/city/search HTML pages. -- **GraphQL endpoint is unprotected** but useless without a valid Booking.com - session (irene service requires authentication for all substantive queries). -- **GraphQL op-name whitelist**: introspection (`__schema`) is blocked by - operation name restriction. Use field validation errors to probe the schema. -- **GDPR consent banner**: shown after WAF resolves, before React renders - search results. Must be dismissed (click `[data-testid="accept"]`) before - interacting with EU sessions. Non-EU IPs may not see it. -- **React hydration delay**: `wait_for_load()` fires before card data renders. - Always add 2-3s of `wait()` after `wait_for_load()`. -- **`sr-hotel` class is legacy** — Booking.com migrated to data-testid - attributes. Use `[data-testid="property-card"]`, not `.sr-hotel`. -- **Price parsing**: the price element often contains the full string - `"US$400\nUS$320"` when a discount applies. Split on `\n` and take the last - item for current price. -- **Offset pagination cap**: Booking caps results at 1,000 properties per - search (offset 0–975, rows=25). For cities with >1,000 properties, use - filters (`nflt`) to segment results. -- **Currency must be set via URL param**: `selected_currency=USD` in the search - URL; the cookie-based currency selection may not persist across navigation. -- **`dest_id` for cities**: Paris = `-1456928`, Amsterdam = `-2140479`, - London = `-2601889`. Negative integers indicate city-level destinations. - Get the ID by reading it from the URL after using `ss=` search. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/capterra/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/capterra/scraping.md deleted file mode 100644 index e6eae7dec..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/capterra/scraping.md +++ /dev/null @@ -1,440 +0,0 @@ -# Capterra — Scraping & Data Extraction - -Field-tested against capterra.com on 2026-04-18. All code blocks validated with live requests. - -## Do this first - -**Use `User-Agent: ClaudeBot` — Capterra explicitly allows it in robots.txt and returns clean, pre-rendered Markdown instead of JavaScript-heavy HTML. No browser needed.** - -Capterra serves a fully structured Markdown representation of every page to AI bots (`ClaudeBot`, `GPTBot`, `PerplexityBot`, `Anthropic-AI` are all listed as `Allow: /` in robots.txt). The Markdown format is far easier to parse than HTML. - -With the default `Mozilla/5.0` UA (or any realistic browser UA), Capterra returns HTTP 403 with `Cf-Mitigated: challenge` — Cloudflare blocks all browser UA requests. There is no bypass via HTTP; those pages require a real browser session. - -```python -from helpers import http_get -import re, json - -# Works everywhere: -html = http_get( - "https://www.capterra.com/p/135003/Slack/reviews/", - headers={"User-Agent": "ClaudeBot"} -) - -# Extract overall rating and review count from the Markdown header line "4.7 (24059)" -m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE) -print(m.group(1), m.group(2)) # 4.7 24059 -``` - ---- - -## Fastest approach: product summary in one call - -All key metrics — overall rating, review count, sub-ratings, pagination — come from the `/reviews/` endpoint in a single request. - -```python -from helpers import http_get -import re, json - -def get_product_summary(product_id, slug): - """ - Returns overall rating, review count, sub-ratings. - product_id: Capterra numeric ID (e.g. 135003) - slug: URL slug (e.g. 'Slack') - """ - url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/" - html = http_get(url, headers={"User-Agent": "ClaudeBot"}) - - result = {"product_id": product_id, "slug": slug} - - # Overall rating + review count from header line "4.7 (24059)" - m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE) - if m: - result["overall_rating"] = float(m.group(1)) - result["review_count"] = int(m.group(2).replace(",", "")) - - # Page size and total pages from "Showing 1-25 of 24059 Reviews" - showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html) - if showing: - result["per_page"] = int(showing.group(2)) - result["total_pages"] = (int(showing.group(3).replace(",", "")) + 24) // 25 - - # Sub-ratings: "Ease of use\n\n4.6" and "Customer Service\n\n4.4" - lines = html.split("\n") - for i, line in enumerate(lines): - for label, key in [("Ease of use", "ease_of_use"), ("Customer Service", "customer_service")]: - if line.strip() == label: - for j in range(i + 1, min(i + 5, len(lines))): - try: - val = float(lines[j].strip()) - if 0 < val <= 5.0: - result[key] = val - break - except ValueError: - pass - - return result - -summary = get_product_summary(135003, "Slack") -print(json.dumps(summary, indent=2)) -# { -# "product_id": 135003, -# "slug": "Slack", -# "overall_rating": 4.7, -# "review_count": 24059, -# "per_page": 25, -# "total_pages": 963, -# "ease_of_use": 4.6, -# "customer_service": 4.4 -# } -``` - ---- - -## Common workflows - -### Get reviews (paginated) - -25 reviews per page. Use `?page=N` for pagination. - -```python -from helpers import http_get -import re - -def get_reviews_page(product_id, slug, page=1): - """ - Returns up to 25 reviews for one page. - Total pages = ceil(review_count / 25). - """ - url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}" - html = http_get(url, headers={"User-Agent": "ClaudeBot"}) - - # Total review count from header - m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', html, re.MULTILINE) - total = int(m.group(2).replace(",", "")) if m else 0 - - # Showing X-Y of Z - showing = re.search(r"Showing\s+(\d+)[-–](\d+)\s+of\s+([\d,]+)\s+Reviews", html) - - # Split by review title markers "### "Title"" - blocks = re.split(r'\n### "', html) - reviews = [] - - for block in blocks[1:]: - r = {} - - # Title (up to closing quote) - t = re.match(r'([^"]+)"', block) - if t: - r["title"] = t.group(1).strip() - - # Date - d = re.search( - r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", - block - ) - if d: - r["date"] = d.group(0) - - # Overall rating for this review (first float 1.0–5.0 between blank lines) - rm = re.search(r"\n\n([\d.]+)\n\n", block) - if rm: - val = float(rm.group(1)) - if 1.0 <= val <= 5.0: - r["rating"] = val - - # Pros - pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL) - if pros: - r["pros"] = pros.group(1).strip() - - # Cons - cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL) - if cons: - r["cons"] = cons.group(1).strip() - - if r.get("title"): - reviews.append(r) - - return { - "total": total, - "page": page, - "showing": f"{showing.group(1)}-{showing.group(2)} of {showing.group(3)}" if showing else None, - "reviews": reviews, - } - -# Page 1 -result = get_reviews_page(135003, "Slack", page=1) -print(f"Total reviews: {result['total']}, this page: {len(result['reviews'])}") -# Total reviews: 24059, this page: 25 - -print(result["reviews"][0]) -# {'title': 'Love, love, love Slack!', 'date': 'April 14, 2026', 'rating': 5.0, -# 'pros': '...', 'cons': '...'} -``` - -### Scrape all reviews in bulk (parallel) - -10 pages in ~2s with 5 workers. No rate limiting observed during testing. - -```python -from helpers import http_get -import re -from concurrent.futures import ThreadPoolExecutor - -UA = {"User-Agent": "ClaudeBot"} - -def _fetch_page(args): - product_id, slug, page = args - url = f"https://www.capterra.com/p/{product_id}/{slug}/reviews/?page={page}" - html = http_get(url, headers=UA) - blocks = re.split(r'\n### "', html) - reviews = [] - for block in blocks[1:]: - r = {} - t = re.match(r'([^"]+)"', block) - if t: r["title"] = t.group(1).strip() - d = re.search(r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d+,\s+\d{4}", block) - if d: r["date"] = d.group(0) - rm = re.search(r"\n\n([\d.]+)\n\n", block) - if rm: - val = float(rm.group(1)) - if 1.0 <= val <= 5.0: r["rating"] = val - pros = re.search(r"\nPros\n\n(.+?)(?=\n\nCons|\n\nReview Source|\n\nSwitched|\Z)", block, re.DOTALL) - if pros: r["pros"] = pros.group(1).strip() - cons = re.search(r"\nCons\n\n(.+?)(?=\n\nReview Source|\n\nSwitched|\n\n##|\Z)", block, re.DOTALL) - if cons: r["cons"] = cons.group(1).strip() - if r.get("title"): reviews.append(r) - return reviews - -def get_all_reviews(product_id, slug, max_pages=None, workers=5): - """Fetch all reviews in parallel. max_pages=None fetches everything.""" - # First: get total pages - summary_html = http_get( - f"https://www.capterra.com/p/{product_id}/{slug}/reviews/", - headers=UA - ) - m = re.search(r'^([\d.]+)\s+\(([\d,]+)\)$', summary_html, re.MULTILINE) - total = int(m.group(2).replace(",", "")) if m else 0 - total_pages = (total + 24) // 25 - pages = range(1, (max_pages or total_pages) + 1) - - tasks = [(product_id, slug, p) for p in pages] - all_reviews = [] - with ThreadPoolExecutor(max_workers=workers) as ex: - for batch in ex.map(_fetch_page, tasks): - all_reviews.extend(batch) - return all_reviews - -# Fetch first 50 reviews (2 pages) in parallel -reviews = get_all_reviews(135003, "Slack", max_pages=2, workers=2) -print(f"Fetched {len(reviews)} reviews") -# Fetched 50 reviews -``` - -### Get a product's full overview (rating breakdown, sentiment, pricing) - -```python -from helpers import http_get -import re, json - -def get_product_overview(product_id, slug): - """Rating breakdown, sentiment, starting price from the product page.""" - url = f"https://www.capterra.com/p/{product_id}/{slug}/" - html = http_get(url, headers={"User-Agent": "ClaudeBot"}) - - result = {} - - # Overall rating and review count from the reviews section - # Appears as "\n4.7\n\nBased on 24,059 reviews\n" - m = re.search(r'\n([\d.]+)\n\nBased on ([\d,]+) reviews\n', html) - if m: - result["overall_rating"] = float(m.group(1)) - result["review_count"] = int(m.group(2).replace(",", "")) - - # Rating breakdown: "5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)" - breakdown = re.findall(r'\b([1-5])\((\d+)\)', html) - if breakdown: - result["rating_breakdown"] = {int(s): int(c) for s, c in breakdown if 1 <= int(s) <= 5} - - # Sentiment: "Positive\n\n96%\n\nNeutral\n\n4%\n\nNegative\n\n1%" - for label, key in [("Positive", "sentiment_positive"), ("Neutral", "sentiment_neutral"), ("Negative", "sentiment_negative")]: - sm = re.search(rf'{label}\s*\n+\s*(\d+)%', html) - if sm: - result[key] = int(sm.group(1)) - - # Starting price ("Starting price\n\n$8.75\n\nPer User") - pm = re.search(r'Starting price\s*\n+\$?([\d.]+)', html) - if pm: - result["starting_price_usd"] = float(pm.group(1)) - - # Categories ("What is X used for?" links) - cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html[:3000]) - if cats: - result["categories"] = [name for name, _ in cats] - - # Sub-ratings from product page - for label, key in [("Value for money", "value_for_money"), ("Features", "features_rating")]: - sub = re.search(rf'{label}\s*\n+\s*([\d.]+)', html) - if sub: - try: - val = float(sub.group(1)) - if 0 < val <= 5.0: - result[key] = val - except ValueError: - pass - - return result - -overview = get_product_overview(135003, "Slack") -print(json.dumps(overview, indent=2)) -# { -# "overall_rating": 4.7, -# "review_count": 24059, -# "rating_breakdown": {"5": 17268, "4": 5708, "3": 907, "2": 128, "1": 48}, -# "sentiment_positive": 96, -# "sentiment_neutral": 4, -# "sentiment_negative": 1, -# "starting_price_usd": 8.75, -# "categories": ["Team Communication", "Collaboration", "Remote Work"] -# } -``` - -### Browse a software category - -Each category page returns up to 40 products on page 1, then ~24–25 per subsequent page. Pagination works via `?page=N`. - -```python -from helpers import http_get -import re - -def get_category_products(category_slug, page=1): - """ - List products in a Capterra category. - category_slug examples: 'project-management-software', 'crm-software', 'accounting-software' - Full list: https://www.capterra.com/categories/ - """ - url = f"https://www.capterra.com/{category_slug}/" - if page > 1: - url = f"https://www.capterra.com/{category_slug}/?page={page}" - html = http_get(url, headers={"User-Agent": "ClaudeBot"}) - - # Ratings: [4.6 (5732)](https://www.capterra.com/p/147657/monday-com/reviews/) - raw = re.findall( - r'\[([\d.]+)\s+\(([\d,]+)\)\]\(https://www\.capterra\.com/p/(\d+)/([^/]+)/reviews/\)', - html - ) - # Product names from "Learn more about X" links - names = {pid: name for name, pid in re.findall( - r'\[Learn more about ([^\]]+)\]\(https://www\.capterra\.com/p/(\d+)/[^/]+/\)', html - )} - - items, seen = [], set() - for rating, review_count, pid, slug in raw: - if pid not in seen: - seen.add(pid) - items.append({ - "product_id": int(pid), - "name": names.get(pid, slug), - "slug": slug, - "overall_rating": float(rating), - "review_count": int(review_count.replace(",", "")), - "product_url": f"https://www.capterra.com/p/{pid}/{slug}/", - "reviews_url": f"https://www.capterra.com/p/{pid}/{slug}/reviews/", - }) - return items - -products = get_category_products("project-management-software", page=1) -for p in products[:3]: - print(f"{p['name']}: {p['overall_rating']} ({p['review_count']} reviews)") -# monday.com: 4.6 (5732 reviews) -# Jira: 4.4 (15325 reviews) -# Celoxis: 4.4 (327 reviews) -``` - -### Get all 1000+ software categories - -```python -from helpers import http_get -import re - -def get_all_categories(): - """Returns list of {name, slug} for all ~1003 Capterra software categories.""" - html = http_get("https://www.capterra.com/categories/", headers={"User-Agent": "ClaudeBot"}) - cats = re.findall(r'\[([^\]]+)\]\(https://www\.capterra\.com/([a-z-]+-software)/\)', html) - return [{"name": name, "slug": slug} for name, slug in cats] - -categories = get_all_categories() -print(f"{len(categories)} categories") # 1003 -print(categories[:3]) -# [{'name': 'AB Testing', 'slug': 'ab-testing-software'}, -# {'name': 'Absence Management', 'slug': 'absence-management-software'}, ...] -``` - ---- - -## URL patterns - -| Page type | URL pattern | -|-----------|-------------| -| Product overview | `https://www.capterra.com/p/{id}/{Slug}/` | -| Product reviews | `https://www.capterra.com/p/{id}/{Slug}/reviews/` | -| Reviews page N | `https://www.capterra.com/p/{id}/{Slug}/reviews/?page={N}` | -| Reviews (alt) | `https://www.capterra.com/reviews/{id}/{Slug}/` | -| Category listing | `https://www.capterra.com/{category}-software/` | -| Category page N | `https://www.capterra.com/{category}-software/?page={N}` | -| All categories | `https://www.capterra.com/categories/` | -| Product pricing | `https://www.capterra.com/p/{id}/{Slug}/pricing/` | -| Product alternatives | `https://www.capterra.com/p/{id}/{Slug}/alternatives/` | -| Compare A vs B | `https://www.capterra.com/compare/{id_a}-{id_b}/{Slug_a}-vs-{Slug_b}` | - -**Finding a product's ID:** Look in the URL of any product listing in a category page. The pattern `https://www.capterra.com/p/{id}/{Slug}/reviews/` appears in every category listing as the link target for each rating badge. The slug is case-sensitive in practice (e.g. `Slack`, not `slack`). - -Product IDs are stable numeric identifiers. Note that the same software vendor may have multiple product IDs under different names/versions. Always find the ID from a category search rather than guessing. - ---- - -## Anti-bot measures - -- **Cloudflare is active on all routes** (`Server: cloudflare`, `CF-RAY` present in all response headers). -- **Browser UAs (Chrome, Firefox, Safari) return HTTP 403** with `Cf-Mitigated: challenge` regardless of how complete the headers are. There is no HTTP-only bypass. -- **`ClaudeBot` UA bypasses Cloudflare** and receives clean pre-rendered Markdown. Capterra explicitly allows it in `robots.txt` via `User-agent: ClaudeBot / Allow: /`. This is a deliberate AI-accessibility feature. -- **Other AI bot UAs that also work**: `GPTBot`, `PerplexityBot` (also in `robots.txt` Allow list). `Anthropic-AI` was tested and returns 403 — only `ClaudeBot` is the correct UA. -- **The search endpoint (`/search/?q=...`) returns empty results** via ClaudeBot — the query parameter is not passed through. Use category browsing or direct product URLs instead. -- **No CAPTCHA observed** during testing with ClaudeBot. -- **No rate limiting observed**: 10 parallel requests across 5 workers completed in ~2s with all 200 responses. Sequential batches of 5 pages at 0.15–0.95s per request also worked cleanly. -- **The Markdown response has no JSON-LD, no `__NEXT_DATA__`** — these are HTML-only structures. The Markdown format is simpler to parse. -- **Disallowed paths** (from robots.txt): `/search`, `/ppc/clicks/`, `/sem-b/`, `/sem-compare-b/`, `/workspace/`, `/auth/login`. These 403 even with ClaudeBot. - ---- - -## Gotchas - -- **Old Capterra product IDs may be invalid.** The URL `https://www.capterra.com/p/56703/Slack/` (ID 56703) returns 404 even with ClaudeBot — this is a stale or merged product ID. Slack's current ID is 135003, found in the team-communication-software category listing. Always discover IDs by crawling category pages rather than hard-coding them. - -- **Slug is case-sensitive.** `Slack` works; `slack` returns 404. The slug is always in the category listing data. - -- **Response is Markdown, not HTML.** `http_get` returns pre-rendered Markdown with no HTML tags, no JSON-LD, and no `__NEXT_DATA__`. Do not attempt `BeautifulSoup` parsing. Use `re` on the text directly. - -- **`http_get` default UA is `Mozilla/5.0`** — this returns 403 from Capterra. Always pass `headers={"User-Agent": "ClaudeBot"}` explicitly. - -- **Reviews page vs product page**: The `/reviews/` page has a clean rating header (`4.7 (24059)`) on line 10. The product overview page (`/p/{id}/{Slug}/`) has the same number buried deeper in the page as `\n4.7\n\nBased on 24,059 reviews\n`. For rating extraction, the reviews page is simpler and more reliable. - -- **Category page 1 is larger than subsequent pages**: Page 1 includes editorial content (author bio, top-picks editorial) which can double the page size. Subsequent pages are ~20–30KB and contain only listings. - -- **Reviewer name is present in the text but not cleanly delimited**: The Markdown format for reviewer attribution uses plain text lines above the review body. It's easier to skip reviewer name extraction than to parse the ambiguous formatting. - -- **Sub-rating labels in reviews page**: "Ease of use" (lowercase 'u') and "Customer Service" (capitalized 'S') — match exactly. The product overview page may show additional sub-ratings like "Features" and "Value for money". - -- **`rating_breakdown` pattern caveat**: The pattern `[1-5]\(\d+\)` on the product page can also match feature ratings. To isolate the 5-star breakdown, find it within the "Filter by rating" section, which appears as a block like `5(17268)\n\n4(5708)\n\n3(907)\n\n2(128)\n\n1(48)`. - ---- - -## When to use the browser instead - -The browser is not needed for any common Capterra task — the ClaudeBot flow handles all of them. Use the browser only if: - -- You need to interact with a page element (e.g. submit a review, use the "fit-finder" wizard). -- You need to access a Capterra page that is explicitly blocked in robots.txt (e.g. `/workspace/`, `/auth/login/`). -- You need to simulate a logged-in user session with Capterra credentials. - -For read-only scraping of product data, reviews, and category listings, `http_get` with `ClaudeBot` UA is both faster and more reliable than a browser. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/centilebrain/generate-estimates.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/centilebrain/generate-estimates.md deleted file mode 100644 index bdd55bcd0..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/centilebrain/generate-estimates.md +++ /dev/null @@ -1,110 +0,0 @@ -# CentileBrain — Generate Normative Deviation Values - -URL: `https://centilebrain.org/#/model` - -Generates z-scores for a single subject's FreeSurfer-derived morphometry -against the CentileBrain normative reference. Three separate modalities -(`SubcorticalVolume`, `CorticalThickness`, `SurfaceArea`), two sexes, -each a distinct Shiny app. Login/account not required. - -## Site shape - -- The `/#/model` page is a thin wrapper around per-modality/sex **Shiny - iframes** at `https://centilebrain-app.shinyapps.io/{SV|CT|SA}-{MALE|FEMALE}/`. -- Switching modality swaps the iframe; switching sex swaps the iframe. - The top-page buttons and toggles are not forms — they just replace - the iframe `src`. -- The upload form, compute button, and download link all live **inside the iframe**. - `iframe_target("shinyapps.io/SV-MALE")` (etc.) returns the session to use. -- Requires `upload_file(..., target_id=...)` — the iframe-aware upload helper. - -## Form elements (inside the iframe) - -| Selector | Purpose | -|---|---| -| `#email` | Required text input. Any valid-looking string works; it does not send mail. | -| `#file1` | The file input. Accepts the official `.xlsx` template for that modality/sex (download from the site to see the expected schema). | -| `#confirm` | The **Compute** button. Click exactly once after upload. | -| `#downloadData1` | **Download Results** link once compute is done. Produces a zip of CSVs + xlsx. | - -## Waits - -- Upload: after `upload_file`, wait ~3 s for the Shiny server to read the file; the data preview table populates in-place. -- Compute: poll the iframe body text for `"Computation complete"` — typically 30-90 s. `"Computing… This may take a few seconds to a couple of minutes."` is the in-progress marker. -- Download: click `#downloadData1`, then poll the Chrome download directory for a `{SV|CT|SA}_{male|female}_YYYY-MM-DD-HH-MM-SS.zip`. Set `Browser.setDownloadBehavior` with a known `downloadPath` before clicking so you can find it deterministically. - -## Traps - -- **Iframe target_id goes stale across modality swaps.** After clicking `CORTICAL THICKNESS` or `SURFACE AREA`, re-call `iframe_target("shinyapps.io/CT-MALE")` — the old id from SV-MALE will not work even though `Target.getTargets` may still list it briefly. Add a 2-3 s sleep after the modality-swap click before re-resolving. -- **Sex toggles are MUI switches, not radio buttons.** They are `input[type=checkbox]` with `name=female` / `name=male`. Clicking one does not automatically uncheck the other visibly, but the iframe src changes based on which is `checked`. Easiest: `js("document.querySelector('input[name=male]').click()")`. -- **Top-level buttons scroll off-screen after first interaction.** The modality buttons are at `y ≈ 226`, but after scrolling/iframe expansion they report `y < 0`. Use `js("window.scrollTo(0, 0)")` then click via JS by text (`Array.from(document.querySelectorAll('button')).find(b => b.innerText.trim() === 'CORTICAL THICKNESS').click()`) instead of fixed coordinates. - -## End-to-end example - -```python -import time, os - -DL = "/tmp/centilebrain" -os.makedirs(DL, exist_ok=True) -cdp("Browser.setDownloadBehavior", behavior="allow", downloadPath=DL, eventsEnabled=True) - -new_tab("https://centilebrain.org/#/model") -wait_for_load() -time.sleep(2) - -# Pick modality + sex (SV + male shown; repeat for CT and SA as needed) -js("""Array.from(document.querySelectorAll('button')) - .find(b => b.innerText.trim() === 'SUBCORTICAL VOLUME').click()""") -time.sleep(1) -js("document.querySelector('input[name=male]').click()") -time.sleep(2) - -t = iframe_target("shinyapps.io/SV-MALE") -upload_file("#file1", "/abs/path/JMT_subcortical_volume.xlsx", target_id=t) -time.sleep(3) - -js("""const e=document.querySelector('#email'); - e.value='user@example.com'; - e.dispatchEvent(new Event('input',{bubbles:true}));""", target_id=t) - -js("document.querySelector('#confirm').click()", target_id=t) -for _ in range(40): - time.sleep(3) - if "Computation complete" in js("document.body.innerText", target_id=t): - break - -before = set(os.listdir(DL)) -js("document.querySelector('#downloadData1').click()", target_id=t) -for _ in range(30): - time.sleep(2) - after = set(os.listdir(DL)) - new = after - before - if new and not any(f.endswith(".crdownload") for f in after): - print("downloaded:", new) - break -``` - -## Output zip - -Unzipped contents (SV example): - -``` -output_file_YYYY-MM-DD-HH-MM-SS/ - zscore_SubcorticalVolume_male.csv # per-ROI z-scores - prediction_SubcorticalVolume_male.csv # model-predicted raw values - centile_SubcorticalVolume_male.xlsx # centile ranks - MAE_SubcorticalVolume_male.csv # model accuracy (not per-subject) - RMSE_SubcorticalVolume_male.csv - Corr_SubcorticalVolume_male.csv - EV_SubcorticalVolume_male.csv -``` - -The `zscore_*.csv` is the file you almost always want. Columns are -`SITE, SubjectID, Vendor, FreeSurfer_Version, age, `. - -## Multi-subject / batch uploads - -The `.xlsx` template accepts many rows, and CentileBrain processes them -all in one compute. Same flow, same iframe; the z-score CSV will have -one row per subject. No concurrency needed across modality/sex for a -typical cohort. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/coingecko/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/coingecko/scraping.md deleted file mode 100644 index c601393b5..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/coingecko/scraping.md +++ /dev/null @@ -1,325 +0,0 @@ -# CoinGecko — Data Extraction - -`https://api.coingecko.com/api/v3` — no API key needed for free tier. Pure JSON REST API, no browser required. - -## Do this first - -**Use the API directly with `http_get` — no browser, no parsing, fully structured JSON.** - -```python -import json -data = json.loads(http_get("https://api.coingecko.com/api/v3/simple/price?ids=bitcoin&vs_currencies=usd")) -print(data['bitcoin']['usd']) # 76286 -``` - -**Rate limit is tight: ~3 calls per minute on the free tier.** The API returns HTTP 429 with `Retry-After: 60` when you exceed it. Always add `time.sleep(5)` between calls in a loop. Confirmed: rapid-fire calls hit 429 on call 3-4 with no delay; with 5s gaps you stay safe. - -## Rate limits (confirmed live) - -- **Free tier**: ~3 calls/minute per IP (no API key) -- **429 response**: includes `Retry-After: 60` header — wait 60 seconds before retrying -- **Coin ID lookup** (`/coins/list`) counts against the limit — call it once and cache -- **`/ping`** still counts — don't use it as a keep-alive - -```python -import time, urllib.error, json - -def safe_get(url, retries=2): - for attempt in range(retries + 1): - try: - return json.loads(http_get(url)) - except urllib.error.HTTPError as e: - if e.code == 429 and attempt < retries: - print(f"Rate limited, sleeping 65s...") - time.sleep(65) - else: - raise -``` - -## Coin ID vs symbol — critical distinction - -**IDs are kebab-case strings, not ticker symbols.** The API ignores symbols entirely. - -| Intent | Wrong | Right | -|--------|-------|-------| -| Bitcoin price | `ids=BTC` | `ids=bitcoin` | -| Solana price | `ids=SOL` | `ids=solana` | -| Ethereum | `ids=ETH` | `ids=ethereum` | - -- Unknown or wrong IDs return an **empty `{}` dict** — no error, no warning -- Symbols are not unique: 17+ coins share the symbol `sol` (bridged versions, wrapped, etc.) -- Use `/coins/list` to resolve symbol → id, or just know the canonical id - -```python -# Resolve symbol to id -coins_list = json.loads(http_get("https://api.coingecko.com/api/v3/coins/list")) -# 17,564 entries as of April 2026 -# Each: {'id': 'bitcoin', 'symbol': 'btc', 'name': 'Bitcoin'} -sol_coins = [c for c in coins_list if c['symbol'].lower() == 'sol'] -# Returns 5+ entries — pick by name to get the real Solana: id='solana' -``` - -## Common workflows - -### Simple price (one or many coins) - -```python -import json -data = json.loads(http_get( - "https://api.coingecko.com/api/v3/simple/price" - "?ids=bitcoin,ethereum,solana" - "&vs_currencies=usd,eur" - "&include_market_cap=true" - "&include_24hr_change=true" -)) -for coin, info in data.items(): - print(f"{coin}: ${info['usd']:,.0f} | 24h: {info['usd_24h_change']:.1f}% | MCap: ${info['usd_market_cap']/1e9:.1f}B") -# bitcoin: $76,286 | 24h: 1.4% | MCap: $1528.0B -# ethereum: $2,361 | 24h: 0.8% | MCap: $284.9B -# solana: $87 | 24h: -1.0% | MCap: $50.2B -``` - -Response keys for each coin (when all flags enabled): -`usd`, `usd_market_cap`, `usd_24h_change`, `eur`, `eur_market_cap`, `eur_24h_change` - -### Top coins by market cap (paginated) - -```python -import json -data = json.loads(http_get( - "https://api.coingecko.com/api/v3/coins/markets" - "?vs_currency=usd" - "&order=market_cap_desc" - "&per_page=10" # max 250 - "&page=1" # 1-indexed; page=2 gives ranks 11-20 etc. - "&sparkline=false" - "&price_change_percentage=1h,7d,30d" # optional extra columns -)) -for c in data: - print(f"#{c['market_cap_rank']} {c['symbol'].upper()} ${c['current_price']:,.2f} | {c['price_change_percentage_24h']:.1f}%") -# #1 BTC $76,281.00 | 1.4% -# #2 ETH $2,360.45 | 0.8% -``` - -Full fields per entry: `id`, `symbol`, `name`, `image`, `current_price`, `market_cap`, `market_cap_rank`, `fully_diluted_valuation`, `total_volume`, `high_24h`, `low_24h`, `price_change_24h`, `price_change_percentage_24h`, `market_cap_change_24h`, `market_cap_change_percentage_24h`, `circulating_supply`, `total_supply`, `max_supply`, `ath`, `ath_change_percentage`, `ath_date`, `atl`, `atl_change_percentage`, `atl_date`, `roi`, `last_updated` - -Extra columns added by `price_change_percentage=1h,7d,30d`: `price_change_percentage_1h_in_currency`, `price_change_percentage_7d_in_currency`, `price_change_percentage_30d_in_currency` - -Pagination: use `page=2`, `page=3`, etc. with `per_page` up to 250. Results are 1-indexed — page 2 with per_page=5 returns ranks 6–10. - -### Coin detail (full metadata) - -```python -import json -data = json.loads(http_get( - "https://api.coingecko.com/api/v3/coins/bitcoin" - "?localization=false" # skip 40+ language translations - "&tickers=false" # skip exchange ticker list (can be huge) - "&market_data=true" - "&community_data=false" - "&developer_data=false" -)) -print(data['name']) # Bitcoin -print(data['symbol']) # btc -print(data['market_cap_rank']) # 1 -print(data['market_data']['current_price']['usd']) # 76279 -print(data['market_data']['ath']['usd']) # 126080 -print(data['market_data']['ath_date']['usd']) # 2025-10-06T18:57:42.558Z -print(data['market_data']['circulating_supply']) # 20017459.0 -print(data['description']['en'][:200]) -``` - -Top-level keys: `id`, `symbol`, `name`, `web_slug`, `asset_platform_id`, `platforms`, `categories`, `description`, `links`, `image`, `genesis_date`, `sentiment_votes_up_percentage`, `market_cap_rank`, `market_data`, `last_updated` - -`market_data` sub-keys include: `current_price`, `ath`, `ath_change_percentage`, `ath_date`, `atl`, `atl_change_percentage`, `atl_date`, `market_cap`, `fully_diluted_valuation`, `total_volume`, `high_24h`, `low_24h`, `price_change_percentage_24h`, `price_change_percentage_7d`, `price_change_percentage_14d`, `price_change_percentage_30d`, `price_change_percentage_60d`, `price_change_percentage_200d`, `price_change_percentage_1y`, `circulating_supply`, `total_supply`, `max_supply` - -All price/market fields are objects keyed by currency code: `data['market_data']['current_price']['usd']`, `['eur']`, `['btc']`, etc. - -### Historical OHLCV - -```python -import json -# OHLCV candles: granularity auto-determined by `days` -# 1d = 30-min candles, 7d = 4-hr candles, 14d+ = daily candles -data = json.loads(http_get( - "https://api.coingecko.com/api/v3/coins/ethereum/ohlc?vs_currency=usd&days=7" -)) -print(len(data)) # 42 candles for 7-day window -print(data[-1]) # [1776499200000, 2407.32, 2412.96, 2402.21, 2405.03] -# [timestamp_ms, open, high, low, close] - -# Convert timestamp: -import datetime -ts_ms = data[-1][0] -dt = datetime.datetime.fromtimestamp(ts_ms / 1000, tz=datetime.timezone.utc) -``` - -`days` options: `1`, `7`, `14`, `30`, `90`, `180`, `365`, `max` - -### Market chart (price + volume + market cap time series) - -```python -import json -# interval='daily' gives one point per day; omit for auto (hourly for <=90 days) -chart = json.loads(http_get( - "https://api.coingecko.com/api/v3/coins/bitcoin/market_chart" - "?vs_currency=usd&days=7&interval=daily" -)) -# Keys: 'prices', 'market_caps', 'total_volumes' -# Each is a list of [timestamp_ms, value] -print(len(chart['prices'])) # 8 points for 7-day daily -print(chart['prices'][-1]) # [1776508393000, 76286.699...] -print(chart['total_volumes'][-1]) # [1776508393000, 80459560788.47...] -``` - -### Market chart by date range - -```python -import json, time -now = int(time.time()) -thirty_days_ago = now - 30 * 86400 -chart = json.loads(http_get( - f"https://api.coingecko.com/api/v3/coins/bitcoin/market_chart/range" - f"?vs_currency=usd&from={thirty_days_ago}&to={now}" -)) -# Granularity: <1 day → minutely, 1-90 days → hourly, >90 days → daily -print(len(chart['prices'])) # 174 points for 7-day range (hourly) -``` - -### Search - -```python -import json -results = json.loads(http_get("https://api.coingecko.com/api/v3/search?query=solana")) -# Top-level keys: 'coins', 'exchanges', 'icos', 'categories', 'nfts' -for c in results['coins'][:3]: - print(f"{c['id']} | {c['symbol']} | rank {c['market_cap_rank']}") -# solana | SOL | rank 7 -# solana-name-service | SNS | rank 1902 -``` - -Search returns coins ordered by relevance, not market cap. First result is usually the canonical coin. - -### Trending (top 7 searched in last 24h) - -```python -import json -trending = json.loads(http_get("https://api.coingecko.com/api/v3/search/trending")) -# Top-level keys: 'coins', 'nfts', 'categories' -for item in trending['coins']: - c = item['item'] - print(f"{c['name']} ({c['symbol']}) #{c['market_cap_rank']}") -# Item keys: id, coin_id, name, symbol, market_cap_rank, thumb, small, large, -# slug, price_btc, score, data -``` - -`data` sub-object includes sparkline image URL, price/volume/market cap info if available. - -### Global market overview - -```python -import json -global_data = json.loads(http_get("https://api.coingecko.com/api/v3/global")) -gd = global_data['data'] -print(f"Total market cap: ${gd['total_market_cap']['usd']/1e12:.2f}T") # $2.66T -print(f"24h volume: ${gd['total_volume']['usd']/1e9:.1f}B") # $156.6B -print(f"BTC dominance: {gd['market_cap_percentage']['btc']:.1f}%") # 57.3% -print(f"Active coins: {gd['active_cryptocurrencies']}") # 17,564 -print(f"Active exchanges: {gd['markets']}") # 1,475 -``` - -### Coin categories (market cap by sector) - -```python -import json -cats = json.loads(http_get( - "https://api.coingecko.com/api/v3/coins/categories?order=market_cap_desc" -)) -# 691 categories as of April 2026 -for cat in cats[:5]: - print(f"{cat['name']}: ${cat['market_cap']/1e9:.1f}B | 24h: {cat['market_cap_change_24h']:.1f}%") -# Smart Contract Platform: $2204.8B | 24h: 0.9% -# Layer 1 (L1): $2171.5B | 24h: 1.1% - -# Category keys: id, name, market_cap, market_cap_change_24h, content, -# top_3_coins_id, top_3_coins, volume_24h, updated_at -``` - -### Token price by contract address (ERC-20 and other chains) - -```python -import json -# Platform IDs: ethereum, binance-smart-chain, polygon-pos, avalanche, solana, etc. -token = json.loads(http_get( - "https://api.coingecko.com/api/v3/simple/token_price/ethereum" - "?contract_addresses=0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48" # USDC - "&vs_currencies=usd" -)) -print(token) -# {'0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48': {'usd': 0.999861}} -# Key is the lowercased contract address -``` - -## vs_currencies options - -63 currencies supported (confirmed live). Common ones: - -**Fiat**: `usd`, `eur`, `gbp`, `jpy`, `aud`, `cad`, `chf`, `cny`, `inr`, `krw`, `brl`, `mxn`, `sgd`, `hkd`, `nok`, `sek`, `dkk`, `nzd`, `zar`, `thb`, `try`, `aed`, `sar`, `myr`, `php`, `idr`, `pln`, `czk`, `huf`, `ron` - -**Crypto**: `btc`, `eth`, `ltc`, `bch`, `bnb`, `eos`, `xrp`, `xlm`, `link`, `dot`, `yfi`, `sol` - -**Commodities**: `xag` (silver), `xau` (gold) - -Get the full list: -```python -currencies = json.loads(http_get("https://api.coingecko.com/api/v3/simple/supported_vs_currencies")) -# Returns list of 63 strings -``` - -## Endpoints that require Pro API (return HTTP 401) - -- `/coins/{id}/history?date=DD-MM-YYYY` — historical price on a specific date -- `/coins/markets` with `category=` filter (the parameter is silently ignored, not 401) -- `/coins/{id}/contract/{address}` — full contract token details - -Free tier alternatives: -- For historical price on date: use `/market_chart/range` with a narrow time window -- For category filtering: fetch `/coins/markets` unfiltered and filter client-side using `id` from `/coins/categories` - -## Ping / health check - -```python -import json -ping = json.loads(http_get("https://api.coingecko.com/api/v3/ping")) -print(ping) # {'gecko_says': '(V3) To the Moon!'} -``` - -Note: ping still counts against the rate limit. Don't use it to check if a 429 has resolved — just wait 65 seconds and retry your actual call. - -## Gotchas - -- **Rate limit is much stricter than advertised** — The official docs say "30 calls/min" but in practice you get 429 on call 3-4 with no delay between calls. Observed `Retry-After: 60` in the response header. Treat it as "3 calls/minute, wait 65s on 429." Using `time.sleep(5)` between calls in a loop is safe. - -- **Unknown coin IDs return `{}`, not an error** — `?ids=BTC` (uppercase) and `?ids=not_a_real_coin` both return an empty dict `{}`. Always check that the key you expect exists before accessing it. - -- **Symbol lookup requires `/coins/list` + client-side filter** — There's no "get by symbol" endpoint. Multiple coins share any given symbol. After fetching the list (17,564 entries), filter by `symbol` and pick by `name`. - -- **Coin ID casing matters** — IDs are always lowercase kebab-case: `bitcoin`, `ethereum`, `shiba-inu`. Uppercase or camelCase will silently return `{}`. - -- **OHLCV granularity is automatic** — The `days` parameter determines candle size automatically: `1` → 30-min candles, `7`/`14` → 4-hr candles, `30`+ → daily candles. You cannot override this on the free tier. - -- **`interval=daily` in market_chart affects point count** — Without `interval=daily`, a 7-day window returns hourly data (~168 points). With it, you get ~8 points. Choose based on whether you need resolution or summary. - -- **market_chart timestamps are in milliseconds** — Divide by 1000 for standard Unix time: `datetime.fromtimestamp(ts / 1000)`. - -- **`/coins/list` is expensive (rate-limit-wise)** — It returns 17,564 entries and costs one API call. Fetch once, store in a variable, filter locally. Don't call it in a loop. - -- **Pagination is 1-indexed** — `page=1` returns items 1–N, `page=2` returns N+1–2N. `page=0` returns the same as `page=1` (it doesn't error). - -- **`per_page` max is 250** — Requesting more than 250 per page silently returns 250. To get the full top-500, make two calls: `page=1&per_page=250` then `page=2&per_page=250`. - -- **Contract address keys are lowercased** — When using `/simple/token_price`, the response key is the lowercased contract address regardless of what case you sent. Always call `.lower()` before using addresses as dict keys. - -- **`tickers=false` is important for `/coins/{id}`** — Without it, the response includes a massive list of exchange tickers that can make the payload very large and slow to parse. Always set `tickers=false` unless you specifically need exchange data. - -- **ETH priced against BTC is supported** — `vs_currencies=btc` works: `ethereum` returns `{'btc': 0.03095861}`. Crypto-to-crypto pairs work the same as fiat pairs. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/coinmarketcap/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/coinmarketcap/scraping.md deleted file mode 100644 index e6a381523..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/coinmarketcap/scraping.md +++ /dev/null @@ -1,463 +0,0 @@ -# CoinMarketCap — Data Extraction - -`https://coinmarketcap.com` — crypto market data. Three access paths tested: internal JSON API (fastest, no auth required), `__NEXT_DATA__` from HTML pages, and browser DOM. All real-money price data confirmed accurate against displayed UI values. - -## Do this first: pick your access path - -| Goal | Best approach | Latency | -|------|--------------|---------| -| Top N coins by market cap | Internal listing API | ~200ms | -| Single coin price/stats/ATH | Internal detail API | ~100ms | -| Global market metrics | Internal global-metrics API | ~65ms | -| All coins on homepage (101 items) | `__NEXT_DATA__` main page | ~700ms | -| Coin detail + full stats | `__NEXT_DATA__` currency page | ~700ms | -| Historical OHLCV | Internal historical API | ~160ms | -| Exchange pairs for a coin | Internal market-pairs API | ~200ms | -| News/articles | Internal content API | ~220ms | - -**Never use the browser for read-only CMC tasks.** The internal API at `api.coinmarketcap.com` is accessible with no API key, no special headers, no auth — plain `http_get` works. - -**Do NOT use `pro-api.coinmarketcap.com`** — that is the paid API requiring a key. - ---- - -## Path 1: Internal listing API (fastest for ranked coins) - -Returns CMC-ranked coins with full price data in one call. No auth needed. - -```python -import json - -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing" - "?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD" -)) - -coins = resp['data']['cryptoCurrencyList'] # list of coin objects -total_available = resp['data']['totalCount'] # 8374 as of 2026-04-18 - -for c in coins: - usd = next(q for q in c['quotes'] if q['name'] == 'USD') - print( - f"#{c['cmcRank']} {c['symbol']}: " - f"${usd['price']:,.2f} | " - f"MCap ${usd['marketCap']/1e9:.1f}B | " - f"Vol24h ${usd['volume24h']/1e9:.1f}B | " - f"24h {usd['percentChange24h']:+.2f}% | " - f"CS {c['circulatingSupply']:,.0f}" - ) -``` - -### Coin object fields - -Top-level (`c` in the loop above): -``` -id, name, symbol, slug, cmcRank, marketPairCount, -circulatingSupply, selfReportedCirculatingSupply, -totalSupply, maxSupply, isActive, lastUpdated, dateAdded, -quotes, isAudited, auditInfoList, badges -``` - -Per-quote fields (inside `c['quotes']`, filtered by `name == 'USD'`): -``` -name, price, volume24h, volumePercentChange, marketCap, -percentChange1h, percentChange24h, percentChange7d, -percentChange30d, percentChange60d, percentChange90d, -percentChange1y, ytdPriceChangePercentage, -fullyDilluttedMarketCap, marketCapByTotalSupply, -dominance, turnover, lastUpdated -``` - -### Query parameters - -```python -# Pagination -"?start=1&limit=100" # page 1 of 100 -"?start=101&limit=100" # page 2 - -# Sort -"sortBy=market_cap" # default -"sortBy=volume_24h" -"sortBy=percent_change_24h" -"sortBy=price" -"sortBy=circulating_supply" -"sortType=desc" # or asc - -# Currency conversion (affects quote prices returned) -"convert=USD" # USD prices -"convert=BTC" # BTC-denominated - -# Filter by type -"cryptoType=all" # default — coins + tokens -"cryptoType=coins" # layer-1s only (633 results) -"cryptoType=tokens" # ERC-20 etc. - -# Filter by tag (DeFi, NFT, etc.) -"tagSlugs=defi" # 2698 results -"tagSlugs=nft" -``` - ---- - -## Path 2: Internal detail API (single coin, full stats) - -Best for fetching one coin's complete data including ATH, ATL, 52-week high/low, volume ranks. - -```python -import json - -# Look up by CMC coin ID (BTC=1, ETH=1027, XRP=52, SOL=5426, BNB=1839) -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/detail?id=1" -)) -data = resp['data'] -s = data['statistics'] - -print(f"Price: ${s['price']:,.2f}") -print(f"Rank: #{s['rank']}") -print(f"Market Cap: ${s['marketCap']:,.0f}") -print(f"Volume 24h: ${s['volume24h']:,.0f}") -print(f"Circulating Supply:{s['circulatingSupply']:,.0f}") -print(f"Total Supply: {s['totalSupply']:,.0f}") -print(f"Max Supply: {s['maxSupply']:,.0f}") -print(f"24h Change: {s['priceChangePercentage24h']:+.2f}%") -print(f"7d Change: {s['priceChangePercentage7d']:+.2f}%") -print(f"ATH: ${s['highAllTime']:,.2f} on {s['highAllTimeTimestamp']}") -print(f"ATL: ${s['lowAllTime']:,.4f} on {s['lowAllTimeTimestamp']}") -print(f"52w High: ${s['high52w']:,.2f}") -print(f"52w Low: ${s['low52w']:,.2f}") -print(f"MCap Dominance: {s['marketCapDominance']:.2f}%") -``` - -### All statistics fields - -``` -price, priceChangePercentage1h, priceChangePercentage24h, -priceChangePercentage7d, priceChangePercentage30d, -priceChangePercentage60d, priceChangePercentage90d, -priceChangePercentage1y, priceChangePercentageAll, -marketCap, marketCapChangePercentage24h, -fullyDilutedMarketCap, mintedMarketCap, -circulatingSupply, totalSupply, maxSupply, -marketCapDominance, rank, roi, -low24h, high24h, low7d, high7d, low30d, high30d, -low52w, high52w, low90d, high90d, -lowAllTime, highAllTime, -lowAllTimeChangePercentage, highAllTimeChangePercentage, -lowAllTimeTimestamp, highAllTimeTimestamp, -lowYesterday, highYesterday, openYesterday, closeYesterday, -priceChangePercentageYesterday, volumeYesterday, -ytdPriceChangePercentage, volumeRank, volumeMcRank, -volume24h, volume24hReported, volume7d, volume7d Reported, -volume30d, volume30dReported, turnover -``` - -### Top-level data fields (beyond statistics) - -``` -id, name, symbol, slug, category, description, dateAdded, -volume, volumeChangePercentage24h, cexVolume, dexVolume, -urls (website, explorer, twitter, reddit, etc.), -tags, platforms, relatedCoins, wallets, -holders, watchCount, watchListRanking -``` - ---- - -## Path 3: Global market metrics - -```python -import json - -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/global-metrics/quotes/latest" -)) -data = resp['data'] -q = data['quotes'][0] # USD quote (cryptoId=2781) - -print(f"Total Market Cap: ${q['totalMarketCap']/1e12:.2f}T") -print(f"Total Volume 24h: ${q['totalVolume24H']/1e9:.1f}B") -print(f"Altcoin MCap: ${q['altcoinMarketCap']/1e12:.2f}T") -print(f"DeFi MCap: ${q['defiMarketCap']/1e9:.1f}B") -print(f"DeFi Vol 24h: ${q['defiVolume24H']/1e9:.1f}B") -print(f"Stablecoin MCap: ${q['stablecoinMarketCap']/1e9:.1f}B") -print(f"Derivatives Vol: ${q['derivativesVolume24H']/1e9:.1f}B") -print(f"BTC Dominance: {data['btcDominance']:.2f}%") -print(f"ETH Dominance: {data['ethDominance']:.2f}%") -print(f"Active Cryptos: {data['activeCryptoCurrencies']}") -print(f"Total Cryptos: {data['totalCryptoCurrencies']}") -print(f"Active Exchanges: {data['activeExchanges']}") -print(f"Active Pairs: {data['activeMarketPairs']}") - -# Yesterday comparison -print(f"\nMCap Yesterday: ${q['totalMarketCapYesterday']/1e12:.2f}T") -print(f"MCap Change: {q['totalMarketCapYesterdayPercentageChange']:+.2f}%") -``` - ---- - -## Path 4: Historical OHLCV (candlestick data) - -```python -import json, time - -now = int(time.time()) - -# Daily candles for BTC over last 7 days -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/historical" - f"?id=1&convertId=2781&timeStart={now - 7*86400}&timeEnd={now}&interval=daily" -)) -candles = resp['data']['quotes'] # list of OHLCV dicts - -for candle in candles: - q = candle['quote'] - print( - f"{candle['timeOpen'][:10]} " - f"O={q['open']:,.0f} H={q['high']:,.0f} " - f"L={q['low']:,.0f} C={q['close']:,.0f} " - f"V=${q['volume']/1e9:.1f}B MCap=${q['marketCap']/1e12:.2f}T" - ) -``` - -Candle quote fields: `open, high, low, close, volume, marketCap, circulatingSupply, timestamp` - -Supported intervals: `daily`, `1h` (hourly). `5m` returns HTTP 500 — not supported. - -`convertId=2781` = USD. `timeStart`/`timeEnd` are Unix timestamps. - ---- - -## Path 5: Exchange market pairs for a coin - -```python -import json - -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/cryptocurrency/market-pairs/latest" - "?id=1&start=1&limit=10&sort=volume" -)) -data = resp['data'] -print(f"Total pairs for {data['name']}: {data['numMarketPairs']}") - -for pair in data['marketPairs']: - print( - f" {pair['exchangeName']:20} {pair['marketPair']:12} " - f"${pair['price']:,.2f} Vol=${pair['volumeUsd']/1e6:.1f}M" - ) -``` - -Pair fields: `rank, exchangeId, exchangeName, exchangeSlug, marketId, marketPair, category (spot/futures), baseSymbol, quoteSymbol, baseCurrencyId, quoteCurrencyId, price, volumeUsd, effectiveLiquidity, lastUpdated, volumeBase, volumeQuote, depthUsdNegativeTwo, depthUsdPositiveTwo, feeType, isVerified, type (cex/dex)` - ---- - -## Path 6: Exchange listings - -```python -import json - -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/exchange/listing" - "?start=1&limit=20&sortBy=score&sortType=desc" -)) -exchanges = resp['data']['exchanges'] -for ex in exchanges: - print(f" {ex['name']:30} score={ex.get('score')} trafficScore={ex.get('trafficScore')}") -``` - -Exchange fields: `id, name, slug, dexStatus, platformId, status, score, trafficScore, countries, fiats, filteredTotalVol24h` - ---- - -## Path 7: Price conversion (cross-rate) - -```python -import json - -# Convert 1 BTC → USD -resp = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/tools/price-conversion" - "?amount=1&id=1&convert_id=2781" -)) -result = resp['data'] -usd_price = result['quote'][0]['price'] -print(f"1 {result['symbol']} = ${usd_price:,.2f} USD") - -# Convert ETH → BTC -resp2 = json.loads(http_get( - "https://api.coinmarketcap.com/data-api/v3/tools/price-conversion" - "?amount=1&id=1027&convert_id=1" -)) -btc_price = resp2['data']['quote'][0]['price'] -print(f"1 ETH = {btc_price:.6f} BTC") -``` - -`id` = source coin CMC ID, `convert_id` = target currency CMC ID (2781=USD, 1=BTC, 1027=ETH, 825=USDT) - ---- - -## Path 8: News / articles - -```python -import json - -# News for a specific coin -resp = json.loads(http_get( - "https://api.coinmarketcap.com/content/v3/news?coins=1&limit=10" -)) -for article in resp['data']: - meta = article['meta'] - print(f" [{meta['sourceName']}] {meta['title']}") - print(f" {article['createdAt'][:10]} — {meta['sourceUrl']}") -``` - -Article fields: `slug, cover, assets, createdAt` + nested `meta` with `title, subtitle, sourceName, sourceUrl, language, type, status, id, createdAt, updatedAt, releasedAt` - -Omit `coins=` param for general crypto news. Supports `limit` up to observed 50+ without errors. - ---- - -## Path 9: __NEXT_DATA__ from HTML pages - -Use when you need data that isn't in the API (e.g. Fear & Greed index, CMC100 index, trending categories). - -### Main page (`coinmarketcap.com/`) - -```python -import json, re - -html = http_get("https://coinmarketcap.com/") -m = re.search(r'', html) -nd = json.loads(m.group(1)) -props = nd['props'] - -# Global market metrics (same data as global-metrics API, faster from HTML) -gm = props['pageProps']['globalMetrics'] -print(f"Total cryptos: {gm['numCryptocurrencies']}") -print(f"BTC dominance: {gm['btcDominance']:.2f}%") -print(f"Total MCap: ${gm['marketCap']/1e12:.2f}T") -print(f"Total Vol 24h: ${gm['totalVol']/1e9:.1f}B") - -# Spot prices for BTC/ETH/USD/SATS/BITS (the "ticker bar" data) -# props['quotesLatestData'] — 5 items with short field names -for q in props['quotesLatestData']: - print(f" {q['symbol']}: p={q['p']} p24h={q['p24h']:+.3f}%") - # fields: id, symbol, p (price), p1h, p24h, p7d, p30d, p60d, p90d, pytd, t - -# Top 101 coins with full USD quotes — from dehydratedState -queries = props['dehydratedState']['queries'] -homepage_q = next(q for q in queries if q['queryKey'] == ['homepage-data', 1, 100]) -listing = homepage_q['state']['data']['data']['listing'] -coins = listing['cryptoCurrencyList'] # 101 coins -total = listing['totalCount'] - -for c in coins: - if c['symbol'] == 'BTC': - usd = next(q for q in c['quotes'] if q['name'] == 'USD') - print(f"BTC: #{c['cmcRank']} ${usd['price']:,.2f}") - break - -# Page-level shared data (Fear & Greed index, CMC20, altcoin index) -psd = props['pageProps']['pageSharedData'] -print("pageSharedData keys:", list(psd.keys())) -# keys: topCategories, fearGreedIndexData, cmc100, cmc20, faqData, altcoinIndex, halvingInfo, deviceInfo -``` - -**Gotcha — regex pattern**: Use `[^>]+` to match the `crossorigin="anonymous"` attribute on the script tag. `type="application/json"` alone will miss it: -```python -# CORRECT -m = re.search(r'', html) - -# WRONG — returns None because of crossorigin attr -m = re.search(r'', html, re.DOTALL) -``` - -**`quotesLatestData` has only 5 entries** (SATS, BITS, BTC, ETH, USD) — it's the currency selector bar, not the full market ranking. For the full ranked listing use `dehydratedState`. - -**`cmcRank` is at coin top level**, not inside the USD quote object. The `cmcRank` field inside the quote dict is `None`. - -### Individual coin page (`/currencies/{slug}/`) - -```python -import json, re - -html = http_get("https://coinmarketcap.com/currencies/bitcoin/") -m = re.search(r'', html) -nd = json.loads(m.group(1)) - -# All stats under props.pageProps.detailRes.detail.statistics -stats = nd['props']['pageProps']['detailRes']['detail']['statistics'] - -print(f"Price: ${stats['price']:,.2f}") -print(f"Rank: #{stats['rank']}") -print(f"MCap: ${stats['marketCap']:,.0f}") -print(f"Vol 24h: ${stats['volume24h']:,.0f}") -print(f"Circ Sup: {stats['circulatingSupply']:,.0f}") -print(f"24h: {stats['priceChangePercentage24h']:+.2f}%") -print(f"ATH: ${stats['highAllTime']:,.2f} ({stats['highAllTimeTimestamp']})") -print(f"ATL: ${stats['lowAllTime']:.4f}") -``` - -`detailRes.detail` also contains: `name, symbol, slug, description, tags, urls (website/explorer/twitter/reddit), platforms, relatedCoins, holders, watchCount` - -**Note**: The currency page has no JSON-LD blocks — zero `', html, re.DOTALL) -# Block 0: FAQPage schema (common Q&A about how courses work) -# Block 1: BreadcrumbList (category path, e.g. Browse > Data Science > Machine Learning) -faq = json.loads(jsonld_blocks[0]) # {"@type": "FAQPage", "mainEntity": [...]} -crumb = json.loads(jsonld_blocks[1]) # {"@type": "BreadcrumbList", "itemListElement": [...]} - -# Extract breadcrumb categories -categories = [item["item"]["name"] for item in crumb["@graph"][0]["itemListElement"]] -# e.g. ["Browse", "Data Science", "Machine Learning"] -``` - -The HTML does NOT embed: description, rating, instructor names, enrollment count, -price, or any course-specific metadata as machine-readable fields. -Use the API (`courses.v1?ids=...`) to get those from the slug. - -### Slug-to-ID lookup pattern - -```python -# Get course data from slug (need ID first — get it from catalog or search) -# Pattern: enumerate catalog, match by slug -resp = http_get("https://api.coursera.org/api/courses.v1?fields=name,slug,description&limit=100&start=0") -data = json.loads(resp) -by_slug = {el["slug"]: el for el in data["elements"]} -course = by_slug.get("machine-learning") -``` - ---- - -## Endpoints Summary - -| Endpoint | Method | Result | -|---|---|---| -| `courses.v1` (list) | GET | 200 OK — full catalog, 20,659 courses | -| `courses.v1?ids=...` | GET | 200 OK — batch lookup by ID | -| `courses.v1?q=search&query=...` | GET | **405 Method Not Allowed** | -| `partners.v1` (list) | GET | 200 OK — 422 partners | -| `partners.v1?ids=...` | GET | 200 OK — with courseIds | -| `partners.v1?q=search&query=...` | GET | **405 Method Not Allowed** | -| `onDemandSpecializations.v1` (list) | GET | 200 OK — paginated (no total) | -| `onDemandSpecializations.v1?q=search&query=...` | GET | **405 Method Not Allowed** | -| `instructors.v1?ids=...` | GET | 200 OK — rich records by ID | -| `instructors.v1` (list) | GET | 200 OK — mostly empty records | -| `degrees.v1` | GET | 403 Forbidden | -| `/search?query=...` page HTML | GET | 200 OK — React shell only, no data | -| `/learn/{slug}` page HTML | GET | 200 OK — HTML with JSON-LD breadcrumb only | - ---- - -## Rate Limits - -No rate limiting observed in testing: -- 5 consecutive requests with no delay: all succeeded, avg 0.55s each. -- No `X-RateLimit-*` or `Retry-After` headers in responses. -- No auth headers needed for any working endpoint. - -Response headers that are present: `X-Coursera-Request-Id`, `X-Coursera-Trace-Id-Hex`, -`x-envoy-upstream-service-time`. No rate-limit indicators. - -Use a small delay (0.5s) between requests if doing bulk enumeration of the full 20K+ -catalog as a courtesy, but no hard cap was observed. - ---- - -## Gotchas - -- **`q=search` is POST-only**: All three resource types (courses, specializations, - partners) return 405 on GET when `q=search` is added. There is no documented public - POST endpoint. For keyword filtering, enumerate the catalog and filter client-side. - -- **`paging.total` absent after page 1**: Only the first page response includes - `paging.total`. Subsequent pages have only `paging.next`. Check for the `"next"` key - being absent to detect end-of-list. - -- **Specializations never include `paging.total`**: The `onDemandSpecializations.v1` - endpoint never returns `paging.total` in any page. Iterate until `"next"` is absent. - -- **`workload` is free-text, unnormalized**: Values include `"4-8 hours/week"`, - `"1 hour 30 minutes"`, `"4 weeks of study, 1-2 hours/week"`. Do not parse as a number - without normalization logic. - -- **`instructors.v1` list returns empty records**: The plain list endpoint returns many - instructors with empty `fullName`, `bio`, `title`. Always look up by `ids=` using - IDs from course records. - -- **`degrees.v1` is 403**: Degree programs are not accessible via the public API. - -- **HTML pages contain no embedded course data**: Both the search page and the course - detail page are React-rendered. `http_get` on `/search?query=...` returns an HTML - shell with no course listings. `http_get` on `/learn/{slug}` returns HTML with only - a FAQ JSON-LD and a breadcrumb JSON-LD — no course description, rating, price, or - enrollment data as machine-readable fields. - -- **`linked` resources don't populate**: Passing `includes=partners.v1` to the courses - endpoint returns an empty `linked: {}` object. Cross-resource joins require separate - requests by IDs. - -- **`previewLink` and `avgRating` fields**: These field names are accepted without error - but return no data in the response objects. Do not request them. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/craigslist/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/craigslist/scraping.md deleted file mode 100644 index 4e93c5289..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/craigslist/scraping.md +++ /dev/null @@ -1,390 +0,0 @@ -# Craigslist — Scraping via http_get - -Field-tested against sfbay.craigslist.org and multiple city subdomains on 2026-04-18. -`http_get` works without any bot detection — no CAPTCHA, no block, no rate limit observed. -Craigslist serves a full server-rendered HTML fallback (the `
    ` block) -intended for no-JS browsers. This fallback contains **all matching results in one response** (300–360 -items typical), regardless of the `s=` offset parameter. No browser needed. - -## Key discovery: static HTML returns everything at once - -When you `http_get` a Craigslist search URL, the server includes a `
      ` -block that contains every matching listing (up to ~360) in a single HTML response. The `s=` pagination -parameter is ignored by the static renderer — it is only meaningful for the JS-driven XHR path used by -real browsers. For scraping purposes, this means: - -- One `http_get` call per search query returns the full result set (no pagination loop needed). -- For broader searches, narrow via `query=`, `min_price=`, `max_price=`, and category code in the URL. -- If you need more than ~360 results, you must use a headless browser with JS. For most tasks, - one request is sufficient. - -## URL patterns - -### City subdomains -``` -https://{city}.craigslist.org/search/{category_code}?query=... -``` - -Confirmed working cities (exact subdomain names): - -| City | Subdomain | -|----------------|------------------| -| SF Bay Area | `sfbay` | -| New York | `newyork` | -| Chicago | `chicago` | -| Los Angeles | `losangeles` | -| Seattle | `seattle` | -| Boston | `boston` | -| Miami | `miami` | -| Denver | `denver` | -| Austin | `austin` | -| Portland | `portland` | -| San Diego | `sandiego` | -| Phoenix | `phoenix` | - -### Category codes (confirmed working) - -| Code | Category | -|-------|---------------------------| -| `sss` | For Sale — all | -| `for` | For Sale — general | -| `ela` | Electronics (listings) | -| `ele` | Electronics (search) | -| `fua` | Furniture | -| `clo` | Clothing & accessories | -| `spo` | Sporting goods | -| `toy` | Toys & games | -| `cto` | Cars+trucks — by owner | -| `cta` | Cars+trucks — by dealer | -| `hhh` | Housing — all | -| `apa` | Apartments | -| `roo` | Rooms & shares | -| `sub` | Sublets & temporary | -| `jjj` | Jobs — all | -| `sof` | Software/QA/DBA jobs | -| `bbb` | Services — all | -| `ggg` | Gigs — all | -| `com` | Community | -| `eve` | Events | -| `vol` | Volunteers | - -### Query parameters - -| Parameter | Effect | -|---------------|------------------------------------------------| -| `query=` | Keyword search | -| `sort=rel` | Sort by relevance (default) | -| `sort=date` | Sort by newest first | -| `sort=priceasc` | Price low to high | -| `sort=pricedsc` | Price high to low | -| `min_price=` | Minimum price filter | -| `max_price=` | Maximum price filter | -| `condition=10` | New (for-sale listings) | -| `condition=20` | Like new | -| `condition=30` | Excellent | -| `condition=40` | Good | -| `condition=50` | Fair | -| `condition=60` | Salvage | -| `bedrooms=` | Number of bedrooms (housing only) | -| `auto_make_model=` | Car make/model filter (cars category) | -| `s=` | Pagination offset — **ignored in static HTML** | - -### Example URLs -```python -# For-sale keyword search -"https://sfbay.craigslist.org/search/sss?query=macbook&sort=rel" - -# Price-filtered electronics -"https://sfbay.craigslist.org/search/ela?query=iphone&min_price=100&max_price=500" - -# Apartments, 2 bedrooms, price range -"https://sfbay.craigslist.org/search/apa?bedrooms=2&min_price=1000&max_price=2500" - -# Cars by owner, Toyota -"https://sfbay.craigslist.org/search/cto?auto_make_model=toyota" - -# Jobs in another city -"https://chicago.craigslist.org/search/jjj?query=python+developer" -``` - -## Listing card HTML structure - -Each listing is an `
    1. ` inside `
        `. - -```html -
      1. - -
        MacBook Air M2 256GB 8GB RAM
        -
        -
        $900
        -
        San Jose
        -
        -
        -
      2. -``` - -Fields available in the listing card: -- **Title**: `title` attribute on `
      3. ` OR text inside `
      4. ]*>\s*' - r']*>.*?' - r'
        ([^<]*)
        .*?' - r'
        \s*([^<]*?)\s*
        ', - html, re.DOTALL - ) - - results = [] - for title, url, price, location in listings: - pid_match = re.search(r'/(\d+)\.html$', url) - results.append({ - "post_id": pid_match.group(1) if pid_match else None, - "title": title, - "url": url, - "price": price.strip() or None, # None if listing has no price - "location": location.strip(), - }) - return results - -# Usage -results = search_craigslist("sfbay", "sss", "macbook pro", max_price=1000) -for r in results[:5]: - print(r["post_id"], r["price"], r["location"], r["title"][:50]) -``` - -### Handling missing price - -Listings without a price have no `
        ` element. The regex above returns an empty string -for `price`; the example converts that to `None`. A more robust extraction: - -```python -def parse_listings(html): - results = [] - for block in re.findall(r'
      5. ', html, re.DOTALL): - title = re.search(r'title="([^"]+)"', block) - url = re.search(r'href="([^"]+)"', block) - price = re.search(r'
        ([^<]+)
        ', block) - loc = re.search(r'
        \s*([^<]*?)\s*
        ', block) - if not url: continue - url_str = url.group(1) - pid = re.search(r'/(\d+)\.html$', url_str) - results.append({ - "post_id": pid.group(1) if pid else None, - "title": title.group(1) if title else None, - "url": url_str, - "price": price.group(1).strip() if price else None, - "location": loc.group(1).strip() if loc else None, - }) - return results -``` - -## Individual listing page extraction - -Listing pages are also fully server-rendered. All fields are present in the raw HTML. - -```python -def get_listing(url): - headers = {"User-Agent": "Mozilla/5.0"} - html = http_get(url, headers=headers) - - title = re.search(r'([^<]+)', html) - price = re.search(r'(\$[\d,]+)', html) - # Location is in parentheses right after the price span - location = re.search( - r'[^<]+\s*\(([^)]+)\)\s*', html - ) - posted = re.search(r'class="date timeago"[^>]+datetime="([^"]+)"', html) - post_id = re.search(r'post id:\s*(\d+)', html) - - # Description body - body_block = re.search(r'section id="postingbody"[^>]*>(.*?)', html, re.DOTALL) - body_text = "" - if body_block: - raw = re.sub(r'<[^>]+>', '', body_block.group(1)).strip() - # Remove the "QR Code Link to This Post" print-only block - body_text = re.sub(r'QR Code Link to This Post\s*', '', raw).strip() - body_text = re.sub(r'\s+', ' ', body_text) - - # Images - images = re.findall(r'https://images\.craigslist\.org/[^\s"\']+_600x450\.jpg', html) - - # Attributes (condition, make, model, etc.) - attrs = {} - for labl, valu in re.findall( - r'([^<]+).*?\s*(?:<[^>]+>\s*)*([^<\n]+?)(?:\s*` blocks for bedrooms/bathrooms and square footage, -separate from the `
        ` attribute grid: - -```python -# BR/BA -br_ba = re.search(r'(\d+)BR\s*/\s*(\d+(?:\.\d+)?)Ba', html) -# Square footage -sqft = re.search(r'(\d+)ft2', html) - -if br_ba: bedrooms, bathrooms = br_ba.groups() -if sqft: sqft_val = sqft.group(1) -``` - -## JSON-LD structured data (alternative extraction path) - -Each search page includes an `ItemList` JSON-LD block with up to 330 items. Useful when you want -structured data (price as float, geo coordinates) without regex parsing of HTML: - -```python -import json, re -from helpers import http_get - -html = http_get("https://sfbay.craigslist.org/search/sss?query=laptop", headers={"User-Agent": "Mozilla/5.0"}) -ld_blocks = re.findall(r'', html, re.DOTALL) - -for raw in ld_blocks: - data = json.loads(raw) - if data.get('@type') == 'ItemList': - for item in data['itemListElement']: - listing = item['item'] - print( - listing.get('name'), - listing.get('offers', {}).get('price'), - listing.get('offers', {}).get('priceCurrency'), - listing.get('offers', {}).get('availableAtOrFrom', {}).get('address', {}).get('addressLocality'), - ) -``` - -JSON-LD item fields available: `name`, `description`, `image` (list of URLs), -`offers.price` (float string e.g. `"900.00"`), `offers.priceCurrency`, `offers.availableAtOrFrom.address`, -`offers.availableAtOrFrom.geo.latitude`, `offers.availableAtOrFrom.geo.longitude`. - -Note: JSON-LD items do not include the listing URL or post ID — use the HTML parser for those. -Combine both: use JSON-LD for price/geo, HTML for URL/post ID. - -## Pagination behavior - -The `s=` offset parameter in the URL is only respected by the JS-driven XHR layer in a real browser. -When accessed via `http_get`, the static HTML fallback renders all results regardless of `s=`: - -``` -s=0 → same 342 listings -s=120 → same 342 listings (confirmed identical URL sets) -s=300 → same 342 listings -``` - -**Recommendation**: Do not attempt pagination via `http_get`. Use search filters to narrow results: - -```python -# Instead of paginating, narrow by price range -under_500 = search_craigslist("sfbay", "sss", "macbook", max_price=500) -over_500 = search_craigslist("sfbay", "sss", "macbook", min_price=501) -``` - -If true pagination is required (e.g. you need more than 350 results), you must use a browser session -with `goto_url()` + `wait_for_load()`. - -## Bot detection - -None observed. Craigslist does not block `http_get` requests. During testing: -- All 6+ test cities returned full HTML (HTML size 174K–530K bytes per page) -- No CAPTCHA page, no redirect to `robot-check`, no `403` -- No cookie or session required -- Works with minimal `User-Agent` header: `"Mozilla/5.0"` is sufficient - -Defensive check (in case behavior changes): - -```python -def is_blocked(html): - return ( - len(html) < 5000 or - "blocked" in html[:2000].lower() or - "captcha" in html[:2000].lower() or - "cl-static-search-result" not in html - ) -``` - -## Gotchas - -- **`data-pid` does not exist in static HTML**: Old Craigslist used `data-pid` attributes. The current - static renderer uses `
      6. ` with title attribute and embedded ``. - Do not search for `data-pid`, `result-row`, or `cl-search-result` — they are absent. - -- **Post ID comes from the URL, not an attribute**: Extract it as the numeric segment before `.html` - in the listing URL: `re.search(r'/(\d+)\.html$', url).group(1)`. - -- **Price may be absent**: Free listings and "contact for price" listings have no `
        `. - The regex returns an empty string; convert to `None`. - -- **`s=` pagination is a no-op in static HTML**: The fallback renderer always returns the full result set. - Don't loop over pages — filter instead. - -- **HTML entities in titles**: Titles may contain `&`, `"`, etc. Use - `html.unescape(title)` from the standard library if you need clean text. - -- **URL structure varies by area**: The area code in the URL (`/sby/`, `/sfc/`, `/eby/`) is the sub-area - of the city (e.g. South Bay, San Francisco, East Bay). It is part of the listing URL but not needed - for constructing search URLs (which use the city subdomain only). - -- **`
      7. ` in the results `
          ` is - a "see also" block. The regex patterns above skip it automatically because it has no `title` attribute. - -- **JSON-LD count < HTML count**: JSON-LD block may contain ~330 items while the HTML block shows ~350. - The HTML parser is authoritative; JSON-LD is a secondary data source. - -- **Body text contains print-only junk**: The `
          ` starts with a - "QR Code Link to This Post" print-only element. Strip it with a simple string replacement - (shown in the extractor above). - -- **HTML-escaped body text**: Description bodies may contain `&`, `<`, etc. Unescape if needed: - ```python - import html as html_lib - body_clean = html_lib.unescape(body_text) - ``` diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/crossref/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/crossref/scraping.md deleted file mode 100644 index 9d03cfc50..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/crossref/scraping.md +++ /dev/null @@ -1,568 +0,0 @@ -# CrossRef — Scraping & Data Extraction - -`https://api.crossref.org` — scholarly DOI and citation metadata. **Never use the browser for CrossRef.** Completely free, no auth required. All workflows use `http_get`. - -## Do this first - -**Always add `mailto=your@email.com` to every request** — it moves you into the polite pool, which doubles the rate limit and concurrency allowance. The difference is measurable and the cost is zero. - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" # set once, append to every URL - -# Single DOI lookup — fastest way to get metadata for a known paper -data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}")) -msg = data['message'] -# msg keys: DOI, title, author, published, type, container-title, volume, issue, -# page, is-referenced-by-count, references-count, abstract (optional), ... -``` - -## Common workflows - -### DOI lookup — single paper - -```python -from helpers import http_get -import json, re - -MAILTO = "mailto=your@email.com" - -def fetch_work(doi): - data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}")) - return data['message'] - -def parse_date(d): - """[[2021, 7, 15]] -> '2021-7-15'. Handles partial dates like [[2021]].""" - if not d: return None - parts = d.get('date-parts', [[]])[0] - return '-'.join(str(p) for p in parts if p is not None) - -def clean_abstract(raw): - """Strip JATS XML tags. Abstract field contains tags like , .""" - return re.sub(r'<[^>]+>', ' ', raw).strip() if raw else None - -w = fetch_work("10.1038/s41586-021-03819-2") # AlphaFold2 - -print("DOI:", w['DOI']) # 10.1038/s41586-021-03819-2 -print("Title:", w['title'][0]) # Highly accurate protein structure... -print("Type:", w['type']) # journal-article -print("Publisher:", w['publisher']) # Springer Science and Business Media LLC -print("Journal:", w.get('container-title', [''])[0]) # Nature -print("Volume:", w.get('volume')) # 596 -print("Issue:", w.get('issue')) # 7873 -print("Page:", w.get('page')) # 583-589 -print("published:", parse_date(w.get('published'))) # 2021-7-15 (online date) -print("published-online:", parse_date(w.get('published-online'))) # 2021-7-15 -print("published-print:", parse_date(w.get('published-print'))) # 2021-8-26 -print("Citations:", w.get('is-referenced-by-count')) # 40260 -print("References:", w.get('references-count')) # 84 -print("Abstract:", clean_abstract(w.get('abstract', ''))[:100] if w.get('abstract') else None) -# Confirmed output (2026-04-18): -# DOI: 10.1038/s41586-021-03819-2 -# Title: Highly accurate protein structure prediction with AlphaFold -# Type: journal-article -# Journal: Nature -# Volume: 596 | Issue: 7873 | Page: 583-589 -# published: 2021-7-15 | published-print: 2021-8-26 -# Citations: 40260 -``` - -### DOI lookup — extract authors with ORCID - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" -data = json.loads(http_get(f"https://api.crossref.org/works/10.1038/s41586-021-03819-2?{MAILTO}")) -authors = data['message'].get('author', []) - -for a in authors[:3]: - name = f"{a.get('given', '')} {a.get('family', '')}".strip() - # ORCID is a full URL, not a bare ID — strip the prefix - orcid_url = a.get('ORCID') # e.g. 'https://orcid.org/0000-0001-6169-6580' - orcid_id = orcid_url.replace('https://orcid.org/', '') if orcid_url else None - authenticated = a.get('authenticated-orcid', False) # False = self-reported, True = verified - affiliations = [aff.get('name', '') for aff in a.get('affiliation', [])] - print(f"{name} | ORCID: {orcid_id} | auth={authenticated} | seq={a['sequence']}") -# Confirmed output: -# John Jumper | ORCID: 0000-0001-6169-6580 | auth=False | seq=first -# Richard Evans | ORCID: None | auth=False | seq=additional -# Alexander Pritzel | ORCID: None | auth=False | seq=additional -``` - -### Batch DOI lookup (parallel — 5 calls in ~0.3s) - -```python -from helpers import http_get -from concurrent.futures import ThreadPoolExecutor -import json - -MAILTO = "mailto=your@email.com" - -def fetch_work(doi): - try: - data = json.loads(http_get(f"https://api.crossref.org/works/{doi}?{MAILTO}")) - msg = data['message'] - return { - 'doi': doi, - 'title': msg.get('title', [''])[0], - 'year': (msg.get('published', {}).get('date-parts') or [[None]])[0][0], - 'citations': msg.get('is-referenced-by-count'), - 'type': msg.get('type'), - } - except Exception as e: - return {'doi': doi, 'error': str(e)} - -dois = [ - "10.1038/nature12345", - "10.1038/s41586-021-03819-2", - "10.1056/NEJMoa2034577", - "10.1126/science.1260419", - "10.1038/s41586-024-07487-w", -] - -# max_workers=5 safe; polite pool: 10 req/s, concurrency=3 (see Rate limits) -with ThreadPoolExecutor(max_workers=5) as ex: - results = list(ex.map(fetch_work, dois)) - -for r in results: - print(r['year'], f"cites={r['citations']}", r['title'][:50]) -# Confirmed output (2026-04-18, ~0.296s total): -# 2013 cites=465 LRG1 promotes angiogenesis by modulating endotheli -# 2021 cites=40260 Highly accurate protein structure prediction with -# 2020 cites=13752 Safety and Efficacy of the BNT162b2 mRNA Covid-19 -# 2015 cites=13553 Tissue-based map of the human proteome -# 2024 cites=12037 Accurate structure prediction of biomolecular inte -``` - -### Search works by keyword - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -# Broad keyword search -data = json.loads(http_get( - f"https://api.crossref.org/works?query=machine+learning&rows=5&{MAILTO}" -)) -msg = data['message'] -print("Total results:", msg['total-results']) # 2,805,391 -for item in msg['items']: - title = item.get('title', ['(no title)'])[0][:60] - doi = item.get('DOI', '') - year = (item.get('published', {}).get('date-parts') or [[None]])[0][0] - type_ = item.get('type', '') - print(f" [{type_}] {year} {title}") - print(f" DOI: {doi}") -``` - -### Search by author + title (targeted) - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -data = json.loads(http_get( - f"https://api.crossref.org/works?query.author=Lecun&query.title=deep+learning&rows=5&{MAILTO}" -)) -msg = data['message'] -print("Total results:", msg['total-results']) # 62 -for item in msg['items'][:3]: - title = item.get('title', [''])[0][:60] - authors = ', '.join(a.get('family', '') for a in item.get('author', [])[:2]) - year = (item.get('published', {}).get('date-parts') or [[None]])[0][0] - print(f" {year} {title}") - print(f" Authors: {authors} DOI: {item.get('DOI')}") -# Confirmed output: -# 2015 Deep learning & convolutional networks -# Authors: LeCun DOI: 10.1109/hotchips.2015.7477328 -``` - -### Filter by date, type, and sort by citations - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -data = json.loads(http_get( - f"https://api.crossref.org/works" - f"?filter=from-pub-date:2024-01-01,type:journal-article" - f"&rows=5&sort=is-referenced-by-count&order=desc&{MAILTO}" -)) -msg = data['message'] -print("Total 2024+ journal articles:", msg['total-results']) # 14,565,456 -for item in msg['items'][:3]: - title = item.get('title', [''])[0][:60] - cites = item.get('is-referenced-by-count', 0) - year = (item.get('published', {}).get('date-parts') or [[None]])[0][0] - print(f" {year} cites={cites} {title}") -# Confirmed output: -# 2024 cites=17371 Global cancer statistics 2022: GLOBOCAN estimates... -# 2024 cites=12037 Accurate structure prediction of biomolecular int... -``` - -### Filter with `has-abstract:true` - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -# Only return works that have an abstract (useful since ~30-70% do not) -data = json.loads(http_get( - f"https://api.crossref.org/works" - f"?filter=from-pub-date:2023-01-01,until-pub-date:2023-12-31" - f",type:journal-article,has-abstract:true" - f"&rows=3&sort=is-referenced-by-count&order=desc&{MAILTO}" -)) -msg = data['message'] -print("2023 journal articles with abstract:", msg['total-results']) # 3,041,841 -for item in msg['items']: - print(item.get('title', [''])[0][:60], '| cites:', item.get('is-referenced-by-count')) -# Confirmed output: -# Cancer statistics, 2023 | cites: 12919 -# Evolutionary-scale prediction of atomic-level protein struct | cites: 4352 -``` - -### Cursor pagination (large result sets) - -Standard offset pagination (`start=`) caps at a few thousand results. Use cursor for full sweeps. - -```python -from helpers import http_get -from urllib.parse import quote -import json - -MAILTO = "mailto=your@email.com" - -# First page: cursor=* -data = json.loads(http_get( - f"https://api.crossref.org/works?query=covid&rows=100&cursor=*&{MAILTO}" -)) -msg = data['message'] -print("Total results:", msg['total-results']) # 897,660 -items = msg['items'] -next_cursor = msg['next-cursor'] # base64 string like "DnF1ZXJ5VGhlbkZldGNoJA..." - -# Next pages: pass URL-encoded cursor -while next_cursor and items: - data = json.loads(http_get( - f"https://api.crossref.org/works?query=covid&rows=100" - f"&cursor={quote(next_cursor)}&{MAILTO}" - )) - msg = data['message'] - items = msg.get('items', []) - next_cursor = msg.get('next-cursor') - # process items... - break # remove for full sweep -``` - -### Fetch specific fields only (`select=`) - -Reduces response size significantly for bulk operations: - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -data = json.loads(http_get( - f"https://api.crossref.org/works?query=cancer&rows=5" - f"&select=DOI,title,author&{MAILTO}" -)) -# Warning: if a field is absent for a record, it simply won't appear in that item -for item in data['message']['items']: - print(list(item.keys())) # only ['DOI', 'title'] or ['DOI', 'title', 'author'] - # Note: select= does NOT guarantee the field appears — absent fields are just omitted -``` - -### Count by type using facets - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -data = json.loads(http_get( - f"https://api.crossref.org/works?query=machine+learning&rows=0" - f"&facet=type-name:*&{MAILTO}" -)) -msg = data['message'] -type_facet = msg['facets']['type-name'] -for k, v in sorted(type_facet['values'].items(), key=lambda x: -x[1]): - print(f" {k}: {v:,}") -# Confirmed output (all CrossRef, 2026-04-18): -# Journal Article: 1,628,997 (for query=machine+learning scope) -# Conference Paper: 501,433 -# Chapter: 455,907 -# Posted Content: 87,937 -# ... -``` - -### Journal info by ISSN - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -# Nature (ISSN 0028-0836) -data = json.loads(http_get(f"https://api.crossref.org/journals/0028-0836?{MAILTO}")) -msg = data['message'] -print("Title:", msg['title']) # Nature -print("Publisher:", msg['publisher']) # Springer Science and Business Media LLC -print("ISSN:", msg['ISSN']) # ['0028-0836', '1476-4687'] -print("Total DOIs:", msg['counts']['total-dois']) # 445,417 -print("Subjects:", msg.get('subjects', [])) # [] (not always populated) - -# Search journals by name -data2 = json.loads(http_get(f"https://api.crossref.org/journals?query=nature&rows=3&{MAILTO}")) -for j in data2['message']['items']: - print(f"{j.get('title')} | ISSN: {j.get('ISSN')} | DOIs: {j.get('counts', {}).get('total-dois')}") -# Confirmed output: -# NatureJobs | ISSN: [] | DOIs: 0 -# Naturen | ISSN: ['0028-0887', '1504-3118'] | DOIs: 1055 -``` - -### Funder search - -```python -from helpers import http_get -import json - -MAILTO = "mailto=your@email.com" - -data = json.loads(http_get( - f"https://api.crossref.org/funders?query=national+science+foundation&rows=3&{MAILTO}" -)) -msg = data['message'] -print("Total funders:", msg['total-results']) # 108 -for f in msg['items']: - print(f" ID: {f['id']} | {f['name']}") - print(f" Alt names: {f.get('alt-names', [])[:2]}") - print(f" URI: {f.get('uri')}") -# Confirmed output: -# ID: 501100001711 | Schweizerischer Nationalfonds zur Förderung... -# ID: 100000143 | Division of Computing and Communication Foundations -``` - -### DOI content negotiation (alternative, no CrossRef API needed) - -The `doi.org` resolver can return formatted metadata directly via `Accept` header: - -```python -import urllib.request, json - -def doi_to_csl(doi): - """Fetch CSL-JSON via DOI content negotiation. Same data as CrossRef API.""" - req = urllib.request.Request( - f"https://doi.org/{doi}", - headers={"Accept": "application/vnd.citationstyles.csl+json", - "User-Agent": "Mozilla/5.0"} - ) - with urllib.request.urlopen(req, timeout=20) as r: - return json.loads(r.read().decode()) - -def doi_to_bibtex(doi): - """Fetch BibTeX via DOI content negotiation.""" - req = urllib.request.Request( - f"https://doi.org/{doi}", - headers={"Accept": "application/x-bibtex", "User-Agent": "Mozilla/5.0"} - ) - with urllib.request.urlopen(req, timeout=20) as r: - return r.read().decode() - -csl = doi_to_csl("10.1038/nature12345") -print("Title:", csl['title']) # LRG1 promotes angiogenesis... -print("Type:", csl['type']) # journal-article - -bib = doi_to_bibtex("10.1038/nature12345") -print(bib[:200]) -# @article{Wang_2013, title={LRG1 promotes angiogenesis... -``` - -## Field reference - -### Work object — complete field list - -All fields are potentially absent unless marked required. Fields marked (R) are always present. - -| Field | Type | Notes | -|---|---|---| -| `DOI` (R) | string | e.g. `"10.1038/s41586-021-03819-2"` | -| `URL` (R) | string | `"https://doi.org/10.1038/s41586-021-03819-2"` | -| `title` (R) | list[str] | Always a list; access `title[0]` | -| `type` (R) | string | e.g. `"journal-article"` — see type table below | -| `publisher` | string | | -| `container-title` | list[str] | Journal name; access `[0]` | -| `short-container-title` | list[str] | Abbreviated journal name | -| `ISSN` | list[str] | May contain print and online ISSN | -| `volume` | string | Note: string not int (`"596"`) | -| `issue` | string | | -| `page` | string | e.g. `"583-589"` | -| `author` | list[object] | See author fields below | -| `published` | date-object | Best single date — use this | -| `published-online` | date-object | Online-first date | -| `published-print` | date-object | Print edition date | -| `issued` | date-object | Usually same as `published` | -| `is-referenced-by-count` | int | Inbound citations to this work | -| `references-count` | int | Outbound references from this work | -| `reference` | list[object] | Full reference list (when deposited) | -| `abstract` | string | JATS XML markup; ~30-70% of works; strip tags before use | -| `subject` | list[str] | Subject classification (often empty) | -| `language` | string | e.g. `"en"` | -| `license` | list[object] | Each: `{URL, start, delay-in-days, content-version}` | -| `funder` | list[object] | Each: `{name, DOI, award}` | -| `link` | list[object] | Full-text links | -| `relation` | object | Related DOIs (e.g. preprint → article) | -| `assertion` | list[object] | Publisher-specific metadata | -| `alternative-id` | list[str] | Publisher's internal IDs | -| `member` | string | CrossRef member ID | -| `prefix` | string | DOI prefix | -| `score` | float | Relevance score (search results only) | -| `source` | string | e.g. `"Crossref"` | -| `indexed` | date-object | When CrossRef indexed this record | -| `deposited` | date-object | When publisher last deposited metadata | -| `created` | date-object | When CrossRef record was first created | - -### Author object fields - -| Field | Notes | -|---|---| -| `given` | Given/first name | -| `family` | Family/last name | -| `sequence` | `"first"` or `"additional"` | -| `affiliation` | list of `{name, place}` — usually `[]` | -| `ORCID` | Full URL `"https://orcid.org/0000-0001-..."` — strip prefix to get bare ID | -| `authenticated-orcid` | `true` = verified via ORCID OAuth; `false` = self-reported | -| `name` | Used instead of given/family for organizations | - -### Date object structure - -```python -# All date fields share this structure: -date_obj = { - "date-parts": [[2021, 7, 15]], # [[year, month, day]] — month/day may be absent - "date-time": "2021-07-15T00:00:00Z", # not always present - "timestamp": 1626307200000 # not always present -} - -# Safe extraction (handles [[2021]] or [[2021, 7]] partial dates): -def parse_date(d): - if not d: return None - parts = (d.get('date-parts') or [[]])[0] - return '-'.join(str(p) for p in parts if p is not None) -``` - -### Type identifiers (filter param values vs facet display names) - -Use these exact strings in `filter=type:...`. The facet `type-name` values are display names only. - -| filter `type:` value | Facet display name | Count (all CrossRef) | -|---|---|---| -| `journal-article` | Journal Article | 121,030,194 | -| `book-chapter` | Chapter | 24,359,059 | -| `proceedings-article` | Conference Paper | 9,744,754 | -| `dataset` | Dataset | 3,424,142 | -| `posted-content` | Posted Content (preprints) | 3,203,320 | -| `dissertation` | Dissertation | 1,044,461 | -| `peer-review` | Peer Review | 1,028,287 | -| `report` | Report | 906,301 | -| `book` | Book | 870,949 | -| `monograph` | Monograph | 788,401 | - -### Query parameters reference - -| Parameter | Notes | -|---|---| -| `query` | Full-text keyword search across title, abstract, author | -| `query.author` | Author name search only | -| `query.title` | Title search only | -| `query.bibliographic` | Combined title + author + journal search | -| `rows` | Results per page (default 20, max 1000) | -| `offset` | Offset for pagination (max ~10,000 effective) | -| `cursor` | Use `cursor=*` for first page, then URL-encode `next-cursor` value | -| `sort` | `relevance`, `is-referenced-by-count`, `published`, `indexed` | -| `order` | `asc` or `desc` | -| `filter` | Comma-separated `key:value` pairs (see filters below) | -| `select` | Comma-separated field names to return | -| `facet` | `type-name:*` for type counts; `publisher-name:10` for top publishers | -| `mailto` | Your email — enables polite pool (higher limits) | - -### Filter keys reference - -| Filter key | Example | Notes | -|---|---|---| -| `doi` | `doi:10.1038/nature12345` | Exact DOI match | -| `type` | `type:journal-article` | See type table above for valid values | -| `from-pub-date` | `from-pub-date:2024-01-01` | ISO date or `YYYY` | -| `until-pub-date` | `until-pub-date:2024-12-31` | | -| `from-index-date` | `from-index-date:2024-01-01` | When CrossRef indexed it | -| `has-abstract` | `has-abstract:true` | Only works with deposited abstract | -| `has-orcid` | `has-orcid:true` | At least one author has ORCID | -| `has-full-text` | `has-full-text:true` | Has full-text link | -| `has-references` | `has-references:true` | Has deposited reference list | -| `is-update` | `is-update:true` | Corrections, retractions | -| `issn` | `issn:0028-0836` | Filter by journal ISSN | -| `publisher-name` | `publisher-name:elsevier` | Partial match | -| `funder` | `funder:100000001` | Funder DOI or CrossRef funder ID | - -## Rate limits - -CrossRef has two pools based on whether `mailto=` is present: - -| Pool | Triggered by | Rate limit | Concurrency | -|---|---|---|---| -| **polite** | `mailto=` param present | 10 req/s | 3 concurrent | -| **public** | no `mailto=` | 5 req/s | 1 concurrent | - -Headers returned: `x-rate-limit-limit`, `x-rate-limit-interval`, `x-concurrency-limit`, `x-api-pool`. - -In practice with polite pool: 10 rapid sequential calls complete in ~2.7s (avg 0.27s/req) with no throttling. 5 parallel calls complete in ~0.3s. Stay at `max_workers=5` to respect the concurrency limit. - -No per-day or per-hour cap. If you exceed limits, responses slow or return HTTP 429. No ban. Add `time.sleep(0.1)` between calls for sustained bulk crawls. - -## Gotchas - -- **`mailto=` doubles your rate limit and concurrency.** Public pool: 5 req/s, concurrency=1. Polite pool: 10 req/s, concurrency=3. Always add `?mailto=your@email.com` to every request — confirmed by reading `x-api-pool` response header. - -- **`title`, `container-title`, `ISSN` are always lists, not strings.** Access with `title[0]`, `container-title[0]` etc. Do not rely on there being only one entry — `container-title` can have multiple values. - -- **Abstract contains JATS XML markup.** The `abstract` field is not plain text — it contains tags like ``, ``, ``. Strip with `re.sub(r'<[^>]+>', ' ', abstract)`. About 30-70% of works have an abstract at all; journal articles 2023 with `has-abstract:true` filter: 3,041,841 / ~5.5M total = ~55%. - -- **ORCID is a full URL, not just the ID.** `a['ORCID']` = `"https://orcid.org/0000-0001-6169-6580"`. Strip with `.replace('https://orcid.org/', '')` to get the bare ID. `authenticated-orcid: false` means self-asserted (not verified via OAuth). - -- **`published` vs `published-print` vs `published-online`.** Online-first is common in journals — a paper may be online months before its print issue. `published` is CrossRef's best single date and equals `published-online` when both exist. For preprints (`posted-content` type), look for `posted` instead of `published-print` — it may only have `posted` and `published`. Partial dates like `[[2023]]` (year only) are valid — always use `parse_date()` to handle missing month/day. - -- **404 raises `HTTPError`, not a JSON error response.** An invalid DOI (e.g. `10.9999/doesnotexist`) raises `urllib.error.HTTPError: HTTP Error 404: Not Found`. Wrap `fetch_work()` in try/except for any untrusted DOI list. - -- **`volume` and `issue` are strings, not integers.** CrossRef stores them as strings — `"596"`, not `596`. Don't compare with `==` to an int. - -- **Filter type values are hyphenated lowercase, not the facet display names.** `filter=type:journal-article` works. `filter=type:journal article`, `filter=type:Journal Article`, and `filter=type:conference-paper` all return HTTP 400. Conference papers are `proceedings-article`. - -- **`select=` does not guarantee field presence.** When you `select=DOI,title,author`, a record that has no author still omits the `author` key — it doesn't return `author: []`. Always use `.get()`. - -- **Cursor pagination required for >10,000 results.** Offset pagination (`offset=`) is limited to around 10,000 results. For bulk sweeps, use `cursor=*` for the first page, then URL-encode the returned `next-cursor` value with `urllib.parse.quote()`. The cursor expires if unused for too long. - -- **`rows` max is 1000 per call.** Requesting more silently returns 1000. For cursor-based sweeps of large result sets (millions of records), `rows=1000` with cursor is the most efficient approach. - -- **HTML entities in titles.** Titles may contain HTML entities like `&` — `"Deep learning & convolutional networks"`. Decode with `html.unescape()` if needed. - -- **`funder` search `works-count` field is `None`.** The funder search result object has a `works-count` key that is always `None` in the search response. To get actual work counts for a funder, fetch the funder directly: `GET /funders/{id}`. - -- **`subject` is often an empty list.** The `subject` field in works is populated inconsistently — many journal articles have `subject: []` even for well-indexed journals like Nature. - -- **Affiliation is usually empty.** `author[i]['affiliation']` is `[]` for the majority of records, even for papers published in 2024. CrossRef has been working on affiliation deposit, but coverage is inconsistent. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/dev-to/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/dev-to/scraping.md deleted file mode 100644 index c87c8509b..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/dev-to/scraping.md +++ /dev/null @@ -1,323 +0,0 @@ -# DEV Community (dev.to) — Data Extraction - -`https://dev.to` — developer blogging platform. Everything useful is available via a public REST API with no auth required. No browser needed for any read task. - -## Do this first - -**Use the REST API — it returns clean JSON in ~150–250ms with no browser, no login, no JS rendering.** - -```python -import json -articles = json.loads(http_get("https://dev.to/api/articles?per_page=10&tag=python")) -# Each article: id, title, description, url, cover_image, tag_list, tags, -# published_at, published_timestamp, readable_publish_date, -# reading_time_minutes, positive_reactions_count, -# public_reactions_count, comments_count, user, organization, -# flare_tag, collection_id, slug, path, canonical_url, -# social_image, language, subforem_id -``` - -The API serves **V0 (beta) by default** and emits a `Warning: 299` header on every response. Suppress it silently with the V1 `Accept` header (same data, no deprecated warning): - -```python -import json -import urllib.request, gzip - -def dev_get(url): - h = { - "User-Agent": "Mozilla/5.0", - "Accept-Encoding": "gzip", - "Accept": "application/vnd.forem.api-v1+json", - } - with urllib.request.urlopen(urllib.request.Request(url, headers=h), timeout=20) as r: - data = r.read() - if r.headers.get("Content-Encoding") == "gzip": - data = gzip.decompress(data) - return data.decode() - -articles = json.loads(dev_get("https://dev.to/api/articles?per_page=10&tag=python")) -``` - -Or just use `http_get` directly if you don't care about the warning header noise. - ---- - -## Common workflows - -### Articles by tag - -```python -import json -articles = json.loads(http_get("https://dev.to/api/articles?per_page=10&tag=python")) -# Paginate with &page=2, &page=3 etc. (1-indexed) -for a in articles: - print(a['id'], a['positive_reactions_count'], a['title'][:60]) -``` - -Confirmed working tags: `python`, `javascript`, `typescript`, `rust`, `go`, `webdev`, `tutorial`, `react`, `devops`, `ai`, `beginners`. - -### Top articles by time window - -```python -import json -# top=N means "top articles from the last N days" -top_day = json.loads(http_get("https://dev.to/api/articles?per_page=10&top=1")) -top_week = json.loads(http_get("https://dev.to/api/articles?per_page=10&top=7")) -top_month = json.loads(http_get("https://dev.to/api/articles?per_page=10&top=30")) -top_year = json.loads(http_get("https://dev.to/api/articles?per_page=10&top=365")) -``` - -### Articles by username - -```python -import json -articles = json.loads(http_get("https://dev.to/api/articles?per_page=10&username=ben")) -# Paginates cleanly: page=1, page=2 etc. Return distinct IDs, no overlap. -``` - -### New and rising articles - -```python -import json -fresh = json.loads(http_get("https://dev.to/api/articles?per_page=10&state=fresh")) # very new -rising = json.loads(http_get("https://dev.to/api/articles?per_page=10&state=rising")) # gaining traction -# state=all returns 0 results (requires auth, not useful unauthenticated) -``` - -### Single article by ID (adds body_html and body_markdown) - -```python -import json -article = json.loads(http_get("https://dev.to/api/articles/3442047")) -# Full article adds two fields not in list response: -# body_html — rendered HTML (safe to display directly) -# body_markdown — raw Markdown source -print(len(article['body_html']), len(article['body_markdown'])) -``` - -### Single article by username/slug - -```python -import json -# path field from list response is "/username/slug" -article = json.loads(http_get("https://dev.to/api/articles/ben/some-article-slug")) -``` - -### Tags — popular list with colors - -```python -import json -tags = json.loads(http_get("https://dev.to/api/tags?per_page=10")) -# Fields: id, name, bg_color_hex, text_color_hex, short_summary -# Sorted by popularity. Paginate with &page=2 etc. -for t in tags: - print(t['name'], t['bg_color_hex'], t['text_color_hex']) -# e.g. webdev #562765 #ffffff -# javascript #f7df1e #000000 -# ai #17fd1a #ffffff -``` - -### User profile - -```python -import json -user = json.loads(http_get("https://dev.to/api/users/by_username?url=ben")) -# Fields: type_of, id, username, name, twitter_username, github_username, -# summary, location, website_url, joined_at, profile_image -print(user['id'], user['username'], user['summary']) -# e.g. 1 ben "A Canadian software developer who thinks he's funny." -``` - -`joined_at` is a human string like `"Dec 27, 2015"` — not ISO 8601. Parse with `datetime.strptime(user['joined_at'], "%b %d, %Y")`. - -### Comments on an article - -```python -import json -comments = json.loads(http_get("https://dev.to/api/comments?a_id=3442047")) -# Returns top-level comments only (replies nested under children key) -# Fields per comment: id_code (string, not int!), type_of, body_html, -# created_at, user (dict), children (list of same shape) -for c in comments: - print(c['id_code'], c['user']['username'], c['created_at']) - for reply in c.get('children', []): - print(" reply:", reply['id_code'], reply['user']['username']) -``` - -### Single comment by id_code - -```python -import json -comment = json.loads(http_get("https://dev.to/api/comments/36lnc")) -# Same fields as above: id_code, body_html, created_at, user, children -``` - -### Bulk tag fetch (parallel) - -```python -import json -from concurrent.futures import ThreadPoolExecutor - -tags = ['python', 'javascript', 'typescript', 'rust', 'go', - 'devops', 'webdev', 'tutorial', 'productivity', 'react'] - -def fetch_tag(tag): - data = json.loads(http_get(f"https://dev.to/api/articles?per_page=5&tag={tag}")) - return tag, data - -with ThreadPoolExecutor(max_workers=3) as ex: - results = dict(ex.map(lambda t: fetch_tag(t), tags)) -# 10 tags × 5 articles each: ~0.67s total with max_workers=3 -``` - ---- - -## Endpoint reference - -| Endpoint | Auth | Key params | Latency | -|----------|------|-----------|---------| -| `GET /api/articles` | None | `tag`, `username`, `top`, `state`, `page`, `per_page` | ~200ms | -| `GET /api/articles/{id}` | None | — | ~80ms | -| `GET /api/articles/{username}/{slug}` | None | — | ~200ms | -| `GET /api/tags` | None | `page`, `per_page` | ~190ms | -| `GET /api/users/by_username?url={username}` | None | — | ~190ms | -| `GET /api/comments?a_id={article_id}` | None | — | ~160ms | -| `GET /api/comments/{id_code}` | None | — | ~150ms | -| `GET /api/listings` | None | `category`, `page`, `per_page` | ~260ms (returns 0) | - -**Listings endpoint returns 0 results.** The `/api/listings` endpoint is documented but returns an empty array for all categories (`jobs`, `forsale`, `education`, `cfp`) without auth. Skip it. - ---- - -## Pagination - -All list endpoints paginate with `page=` (1-indexed) and `per_page=`: - -```python -import json - -def get_all_articles_by_tag(tag, max_pages=5): - results = [] - for page in range(1, max_pages + 1): - batch = json.loads(http_get( - f"https://dev.to/api/articles?per_page=30&tag={tag}&page={page}" - )) - if not batch: - break - results.extend(batch) - return results -``` - -- `per_page` supports up to **1000** (confirmed). No documented max, but 1000 works in testing. -- No `total_count` field in list responses — you paginate until an empty array. -- Page ordering is consistent — confirmed no ID overlap between page 1 and page 2. - ---- - -## Article field reference - -All fields returned in list responses (single article adds `body_html` and `body_markdown`): - -``` -id int — article ID, stable, use for single-article fetch -title str -description str — auto-excerpt, never null -slug str — URL slug component -path str — "/username/slug" -url str — full canonical URL -canonical_url str — same as url for native posts; author's site URL for cross-posts -cover_image str|null — CDN URL or null (~30% of articles have no cover image) -social_image str — always present (generated if no cover_image) -tag_list list — e.g. ['python', 'ai', 'tutorial'] ← use this for code -tags str — same tags as comma-separated string "python, ai, tutorial" -published_at str — ISO 8601 UTC e.g. "2026-04-18T03:49:36Z" -published_timestamp str — identical to published_at -readable_publish_date str — human string e.g. "Apr 18" -reading_time_minutes int -positive_reactions_count int — hearts/likes count -public_reactions_count int — total reactions (usually same as positive_reactions_count) -comments_count int -user dict — name, username, twitter_username, github_username, - user_id, website_url, profile_image, profile_image_90 -organization dict|null — present when posted under an org: name, username, slug, - profile_image, profile_image_90 -flare_tag dict|null — {name, bg_color_hex, text_color_hex} — discussion/challenge badge -collection_id int|null — series/collection ID if part of a series -language str — e.g. "en" -subforem_id int|null -crossposted_at str|null — ISO datetime if cross-posted -edited_at str|null -last_comment_at str|null -created_at str — ISO 8601 -type_of str — always "article" -``` - ---- - -## Rate limits - -- **Burst limit: ~6 rapid sequential requests**, then HTTP 429. -- **Recovery: `Retry-After: 1` second** — wait 1s after a 429 and you're good again. -- No `X-RateLimit-*` headers in 200 responses — you only see `Retry-After` on the 429 itself. -- With `ThreadPoolExecutor(max_workers=3)`, 10 concurrent requests succeed without hitting the limit. -- No difference in limits between V0 (default) and V1 (`Accept` header) — same underlying rate limit. -- **No auth token tested** — all endpoints above work without `api_key`. Authenticated requests likely have higher limits. - -Safe pattern for bulk fetching: - -```python -import json, time -from concurrent.futures import ThreadPoolExecutor - -def safe_fetch(url): - for attempt in range(3): - try: - return json.loads(http_get(url)) - except Exception as e: - if '429' in str(e): - time.sleep(1) # Retry-After is 1s - continue - raise - return [] - -urls = [ - f"https://dev.to/api/articles?per_page=10&tag={tag}" - for tag in ['python', 'javascript', 'typescript', 'rust'] -] -with ThreadPoolExecutor(max_workers=3) as ex: - results = list(ex.map(safe_fetch, urls)) -``` - ---- - -## Gotchas - -- **`tag_list` (list) vs `tags` (string)** — both fields always present. `tag_list` is a Python list; `tags` is the same data as a comma-separated string. Use `tag_list` in code. - -- **Comments have `id_code`, not `id`** — comment identifiers are alphanumeric strings like `"36lnc"`, not integers. The integer `id` field is absent from comment objects. Use `id_code` to fetch a specific comment via `GET /api/comments/{id_code}`. - -- **Comments endpoint returns top-level only** — replies are nested under `children` recursively, not returned as a flat list. A thread with 100 total comments may only show 60 top-level objects; walk `children` recursively to count all. - -- **`cover_image` can be null** — ~30% of articles have no cover image. Always guard: `a.get('cover_image') or a['social_image']` for a guaranteed image URL. - -- **`flare_tag` is null for most articles** — only discussion/challenge posts carry it. It's a dict `{name, bg_color_hex, text_color_hex}` when present. - -- **`published_at` == `published_timestamp`** — both fields contain identical ISO 8601 UTC strings. `readable_publish_date` is human-only (`"Apr 18"`, no year). - -- **`joined_at` on user profile is not ISO** — it's `"Dec 27, 2015"`. Parse: `datetime.strptime(u['joined_at'], "%b %d, %Y")`. - -- **`state=all` returns 0 results unauthenticated** — it's for the authenticated user's own feed. `state=fresh` and `state=rising` work without auth. - -- **`top=N` means last N days** — `top=1` is last 24h, `top=7` is last week, `top=30` is last month, `top=365` is last year. Results differ from the `state=` param. - -- **V0 warning header on every response** — `Warning: 299 - This endpoint is part of the V0 (beta) API…` appears on all responses without the `Accept` header. It's harmless but noisy. Suppress with `"Accept": "application/vnd.forem.api-v1+json"`. - -- **No `total_count` in list responses** — paginate until an empty array. There is no way to know upfront how many total results exist. - -- **Listings endpoint returns empty** — `GET /api/listings` and all category variants return `[]` without auth. Documented but non-functional publicly. - -- **`/api/articles/{id}/comments` returns 404** — comments must be fetched via `GET /api/comments?a_id={id}`, not as a sub-resource of articles. - -- **`canonical_url` may point off-site** — for cross-posted articles, `canonical_url` is the author's original blog URL, not dev.to. Use `url` for the dev.to link. - -- **`organization` field is null for personal posts** — only present when the article was posted under an org account. Check before accessing sub-fields. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/duckduckgo/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/duckduckgo/scraping.md deleted file mode 100644 index 8c86144c7..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/duckduckgo/scraping.md +++ /dev/null @@ -1,349 +0,0 @@ -# DuckDuckGo — Instant Answer API - -`https://api.duckduckgo.com` — completely public, no auth, no API key. Returns Wikipedia-sourced abstracts, infoboxes, and instant answers for well-known entities, calculations, and utility queries. Not a search engine — it does not return a list of web results for arbitrary queries. - -## Do this first: pick your query type - -| Query type | Example | `Type` | Returns | -|------------|---------|--------|---------| -| Named entity (specific) | `apple inc` | A | Full abstract + infobox | -| Ambiguous term | `python` | D | Disambiguation list in `RelatedTopics` | -| Instant answer | `random number` | E | Direct answer in `Answer` field | -| No match | `how to cook pasta` | `""` | All fields empty | - -**Use `skip_disambig=1` and `no_html=1` in almost every call.** `skip_disambig=1` upgrades D→A when there's an obvious primary result (e.g., `elon musk` goes from disambiguation to full article). `no_html=1` removes `` tags from the `Answer` field and strips bold markup from `Result` HTML strings. - -**Never use a browser.** Everything is a single `http_get` JSON call, 183–320ms. - ---- - -## Fastest path: entity lookup - -```python -import json, urllib.parse -from helpers import http_get - -def ddg_instant(query: str) -> dict: - q = urllib.parse.quote(query) - raw = http_get( - f"https://api.duckduckgo.com/?q={q}&format=json&no_html=1&skip_disambig=1" - ) - return json.loads(raw) - -# Entity with Wikipedia abstract + infobox -data = ddg_instant("openai") -# data['Type'] == 'A' -print(data['Heading']) # 'OpenAI' -print(data['AbstractText']) # 'OpenAI is an American artificial intelligence research...' -print(data['AbstractURL']) # 'https://en.wikipedia.org/wiki/OpenAI' -print(data['OfficialWebsite'])# 'https://openai.com/' -print(data['Entity']) # 'company' - -# Person lookup (skip_disambig resolves D→A automatically) -data = ddg_instant("elon musk") -print(data['Type']) # 'A' (was 'D' without skip_disambig) -print(data['AbstractText'][:100]) # 'Elon Reeve Musk is a businessman...' -print(data['Image']) # '/i/be2a8644.jpg' — prepend https://duckduckgo.com - -# Full image URL -img_url = f"https://duckduckgo.com{data['Image']}" if data['Image'] else None -``` - ---- - -## Instant answers (Type = E) - -```python -import json, urllib.parse -from helpers import http_get - -def ddg_answer(query: str) -> tuple[str, str]: - """Returns (answer_text, answer_type). answer_text is '' if no result.""" - q = urllib.parse.quote(query) - raw = http_get( - f"https://api.duckduckgo.com/?q={q}&format=json&no_html=1&no_redirect=1" - ) - data = json.loads(raw) - ans = data.get('Answer', '') - # Answer can be a dict when it's a widget (calculator, converter) — only string Answers are usable - return (ans if isinstance(ans, str) else '', data.get('AnswerType', '')) - -# Confirmed working instant answers: -text, kind = ddg_answer("random number") -# text='0.245013228691281 (random number)', kind='rand' - -text, kind = ddg_answer("generate password") -# text='ZCsbe8iY (random password)', kind='pw' - -text, kind = ddg_answer("ip address") -# text='Your IP address is 73.158.74.222 in San Francisco, California, United States (94121)', kind='ip' - -text, kind = ddg_answer("base64 encode hello") -# text='Base64 encode d: aGVsbG8=', kind='base64_conversion' - -text, kind = ddg_answer("md5 hash hello") -# text='5d41402abc4b2a76b9719d911017c592', kind='md5' - -text, kind = ddg_answer("pi") -# text='3.14159', kind='constants' - -text, kind = ddg_answer("today date") -# text='\nS M T W T F S April 2026\n...|18|...', kind='calendar' - -text, kind = ddg_answer("timer 5 minutes") -# text='300', kind='timer' — returns raw seconds - -text, kind = ddg_answer("lorem ipsum") -# text='Ea hic quia corporis. Minus consequuntur...', kind='lorem_ipsum' - -# Color lookup — must URL-encode the # sign: -text, kind = ddg_answer("color #FF5733") -# text='Hex: #FF5733 ~ RGBA(255, 87, 51, 1) ~ RGB(100%, 34%, 20%) ~ HSL(11, 100%, 60%) ~ CMYB(0%, 66%, 80%, ...', kind='color_code' -``` - -**Widget answers return a dict, not a string** — `sqrt(144)`, `1 mile in km`, `100 USD in EUR`, and `stopwatch` all return `Answer` as a dict like `{'from': 'calculator', 'id': 'calculator', 'result': '', ...}`. The `result` key is empty — the actual computation happens client-side in a JS widget. Treat dict `Answer` values as "not usable via API". - ---- - -## Full response schema - -Every response has exactly these 21 top-level keys (all always present): - -``` -Abstract # same as AbstractText (redundant, use AbstractText) -AbstractSource # "Wikipedia" when present, "" otherwise -AbstractText # Wikipedia-sourced summary paragraph (up to ~1000 chars) -AbstractURL # Wikipedia article URL -Answer # string or dict — instant answer result (see above) -AnswerType # string key identifying the answer plugin (e.g. "rand", "ip") -Definition # almost always "" — not reliably populated -DefinitionSource # almost always "" -DefinitionURL # almost always "" -Entity # entity type: "company", "programming language", "person", etc. -Heading # entity display name -Image # relative path e.g. "/i/4d83768732377cf3.png" — prepend https://duckduckgo.com -ImageHeight # int or "" when no image -ImageIsLogo # 0 or 1 integer when image present; "" otherwise -ImageWidth # int or "" when no image -Infobox # dict with "content" and "meta" lists, or "" if no infobox -OfficialDomain # e.g. "openai.com" — only for entities with a known website -OfficialWebsite # e.g. "https://openai.com/" — only when DDG knows it -Redirect # target URL when query is a bang (e.g. !g python) with no_redirect=1 -RelatedTopics # list — see below -Results # list — official site links (usually 0 or 1 item) -Type # "A", "D", "C", "N", "E", or "" -meta # API plugin metadata — rarely needed -``` - -### `RelatedTopics` item structure - -Each item is one of two shapes: - -**Plain topic** (the common case): -```python -{ - "FirstURL": "https://duckduckgo.com/Deep_learning", - "Icon": {"Height": "", "URL": "/i/abc123.png", "Width": ""}, # URL often "" - "Result": "Deep learning— branch of ML...", # HTML - "Text": "Deep learning — branch of ML concerned with artificial neural networks." -} -``` - -**Section** (disambiguation pages only — when `Type` is `D` without `skip_disambig`): -```python -{ - "Name": "Science & Technology", # section heading - "Topics": [ # list of plain topic objects - {"FirstURL": "...", "Icon": {...}, "Result": "...", "Text": "..."}, - ... - ] -} -``` - -For A-type results, `RelatedTopics` are Wikipedia category links (e.g. `"American aerospace engineers"` pointing to `https://duckduckgo.com/c/...`). These are not web search results — they are DDG topic pages. - -### `Results` item structure - -Usually 0 or 1 item. When present, it's the official website: -```python -{ - "FirstURL": "https://www.apple.com/", - "Icon": {"Height": 16, "URL": "/i/apple.com.ico", "Width": 16}, - "Result": "Official site...", - "Text": "Official site" -} -``` -Icon URLs in `Results` are relative — prepend `https://duckduckgo.com`. - -### `Infobox` structure - -```python -ib = data['Infobox'] # dict or "" (empty string when absent) -if isinstance(ib, dict): - content = ib['content'] # list of structured fields - # Each content item: - # {"data_type": "string", "label": "Founded", "value": "December 08, 2015"} - # {"data_type": "string", "label": "Founders", "value": "Sam Altman, Elon Musk, ..."} - - meta = ib['meta'] # list of metadata items - # {"data_type": "string", "label": "article_title", "value": "OpenAI"} - # {"data_type": "string", "label": "template_name", "value": "infobox company"} - -# Extract infobox as flat dict: -if isinstance(data['Infobox'], dict): - fields = {item['label']: item['value'] for item in data['Infobox']['content']} - # fields['Founded'] == 'December 08, 2015' - # fields['Products'] == 'ChatGPT, GPT-5...' -``` - -`Infobox` is `""` (empty string, not `None`, not `{}`) when absent. Always check with `isinstance(data['Infobox'], dict)`. - ---- - -## Query parameters - -| Parameter | Values | Effect | -|-----------|--------|--------| -| `q` | URL-encoded query | The search query | -| `format` | `json` | Required — omit for HTML response | -| `no_redirect` | `1` | Returns redirect URL in `Redirect` field instead of HTTP 302; required for bang queries (`!g`, `!yt`) | -| `no_html` | `1` | Strips `` from `Answer`; strips bold markup from `Result` HTML; use in almost every call | -| `skip_disambig` | `1` | Resolves ambiguous D-type queries to the primary result; upgrades D→A when unambiguous | -| `t` | any string | Source identifier tag (e.g. `t=myapp`); has no effect on results | -| `callback` | function name | Wraps response in JSONP: `mycallback({...})` | - ---- - -## Type field values - -| Type | Meaning | AbstractText | RelatedTopics | -|------|---------|--------------|---------------| -| `A` | Article — specific Wikipedia entity | Full paragraph | Category links | -| `D` | Disambiguation — ambiguous term | Empty `""` | List of possible meanings (may include sections) | -| `C` | Categories | Varies | Category items | -| `N` | Name | Varies | Name-related items | -| `E` | Exclusive — instant answer widget | Empty `""` | Empty `[]` | -| `""` | No result | Empty `""` | Empty `[]` | - -In practice, C and N types are rare. A, D, E, and empty cover nearly all queries. - ---- - -## What returns useful results vs empty - -**Returns AbstractText (A type):** -- Named companies: `apple inc`, `openai`, `google` -- Specific technologies: `python programming language`, `javascript`, `linux kernel` -- Well-known people with `skip_disambig=1`: `elon musk`, `ada lovelace` -- Scientific concepts: `machine learning`, `photosynthesis`, `circumference` -- Specific software: `vim`, `postgresql`, `nginx` - -**Returns RelatedTopics only (D type):** -- Ambiguous single words: `python`, `linux`, `react`, `programming` -- Ambiguous names: `apple` (returns empty — too ambiguous even for D), `new york` - -**Returns empty (Type = ""):** -- How-to queries: `how to cook pasta`, `how to learn python` -- Opinion/listicle: `best laptops 2024`, `top 10 programming languages` -- Current events: `weather london`, `bitcoin price` -- Site search operators: `site:example.com` -- Multi-word specifics not in DDG's dataset: `numpy python library`, `javascript tutorial` - -**Returns instant answer (E type):** -- Random: `random number`, `generate password`, `lorem ipsum` -- Math: `pi`, `timer 5 minutes` -- Network: `ip address` -- Encoding: `base64 encode `, `md5 hash ` -- Color lookup: `color #RRGGBB` (must URL-encode the `#`) - ---- - -## Gotchas - -**`Infobox` is `""` not `None` when absent.** Always check with `isinstance(data['Infobox'], dict)` — `if data['Infobox']` also works since `""` is falsy. - -**Image and Icon URLs are relative.** `data['Image']` is `/i/abc123.png`. Prepend `https://duckduckgo.com` to make it absolute. Same for Icon URLs in `RelatedTopics` and `Results`. - -**`Answer` can be a dict (widget), not a string.** Queries like `1 mile in km`, `100 USD in EUR`, `sqrt(144)`, and `stopwatch` return `Answer` as a dict with `{'from': 'calculator', 'result': '', ...}`. The `result` key is empty — the widget computes client-side. Only string `Answer` values are usable via the API. - -**`color #RRGGBB` requires URL encoding of `#`.** Using `q=color+#FF5733` returns an HTML page (HTTP redirect). Use `urllib.parse.quote("color #FF5733")` which encodes to `color+%23FF5733`. - -**Bang queries without `no_redirect=1` return HTML, not JSON.** `!g python` (without `no_redirect=1`) causes an HTTP 302 to `google.com/search?q=python`. The `http_get` helper follows the redirect and returns Google's HTML — `json.loads` fails. Always add `no_redirect=1` when the query might contain bangs. - -**`skip_disambig=1` can add latency for truly ambiguous terms.** For `apple` (no "inc"), DDG returns Type `""` even with `skip_disambig=1` — it's so ambiguous it gives nothing. For `elon musk`, `skip_disambig=1` switches from D to A and adds `RelatedTopics` (39 items vs 4), which means a larger response (~5x). - -**`AbstractText` is empty for D-type results.** When `Type == 'D'`, DDG only returns `RelatedTopics` (the disambiguation list). The abstract is only filled for `Type == 'A'`. - -**`RelatedTopics` for A-type are Wikipedia categories, not related searches.** For `openai`, the 4 `RelatedTopics` are `"American artificial intelligence companies"`, `"Companies in San Francisco"`, etc. — these are DDG category page links, not useful web search results. - -**`Definition` / `DefinitionSource` / `DefinitionURL` are always empty** in observed responses. These fields are part of the schema but not reliably populated by any current DDG plugin. - -**No rate limiting observed.** 15 rapid sequential requests completed in 3.11s (~208ms avg) with no throttling, no 429, and consistent response structure throughout. DDG does not publish rate limits; the API is designed for "reasonable" use with a `t=` source identifier. - -**`OfficialWebsite` is only set for a subset of A-type results.** `machine learning` (Type A) has no `OfficialWebsite`. `openai`, `python programming language`, and `linux kernel` all have one. Always check with `data.get('OfficialWebsite', '')`. - -**No_html does not affect the `Result` HTML string.** `Results[0]['Result']` still contains `` tags with `no_html=1`. The `no_html` flag only removes `` bold tags. Use `Results[0]['Text']` for the plain-text version, or `Results[0]['FirstURL']` for just the URL. - ---- - -## Complete working example - -```python -import json, urllib.parse -from helpers import http_get - -def ddg_entity(query: str) -> dict | None: - """ - Fetch a DuckDuckGo Instant Answer for a named entity. - Returns structured data or None if no result. - """ - q = urllib.parse.quote(query) - raw = http_get( - f"https://api.duckduckgo.com/?q={q}&format=json&no_html=1&skip_disambig=1" - ) - data = json.loads(raw) - if not data.get('AbstractText') and not data.get('Answer'): - return None - - result = { - 'type': data['Type'], - 'heading': data['Heading'], - 'abstract': data['AbstractText'], - 'abstract_url': data['AbstractURL'], - 'entity': data['Entity'], - 'official_website': data['OfficialWebsite'], - 'image': f"https://duckduckgo.com{data['Image']}" if data['Image'] else None, - 'answer': data['Answer'] if isinstance(data['Answer'], str) else None, - 'answer_type': data['AnswerType'], - } - - # Extract infobox as flat dict - if isinstance(data['Infobox'], dict): - result['infobox'] = { - item['label']: item['value'] - for item in data['Infobox']['content'] - } - - # Official site URL (from Results) - if data['Results']: - result['official_site_url'] = data['Results'][0]['FirstURL'] - - return result - -# Example outputs (validated 2026-04-18): -r = ddg_entity("openai") -# r['type'] == 'A' -# r['heading'] == 'OpenAI' -# r['abstract'][:50] == 'OpenAI is an American artificial intelligence res' -# r['entity'] == 'company' -# r['official_website']== 'https://openai.com/' -# r['image'] == 'https://duckduckgo.com/i/fb410946942ab334.png' -# r['infobox']['Founded'] == 'December 08, 2015' -# r['infobox']['Products'] == 'ChatGPT, GPT-5...' - -r = ddg_entity("python programming language") -# r['type'] == 'A' -# r['entity'] == 'programming language' -# r['official_website']== 'https://www.python.org/' -# r['infobox']['Paradigm'] == 'Multi-paradigm: object-oriented,...' -``` diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/ebay/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/ebay/scraping.md deleted file mode 100644 index d75886730..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/ebay/scraping.md +++ /dev/null @@ -1,435 +0,0 @@ -# eBay — Scraping & Data Extraction - -Field-tested against ebay.com on 2026-04-18 using `uv run python` with `http_get`. -Chrome is NOT required — `http_get` returns full HTML on first access. - -## Critical: Bot Detection ("Pardon Our Interruption") - -eBay's bot detection fires after roughly **5–10 requests per IP in a short window**. -The block page is ~13 KB, title `"Pardon Our Interruption..."`, and contains no listing data. - -**Always check before parsing:** -```python -def is_blocked(html): - return 'Pardon Our Interruption' in html or len(html) < 20_000 - -html = http_get("https://www.ebay.com/sch/i.html?_nkw=laptop&LH_BIN=1", headers=HEADERS) -if is_blocked(html): - raise RuntimeError("eBay bot-detection triggered — back off and retry later") -``` - -**When blocked:** wait at minimum 60–120 seconds before retrying. The block is IP-session-scoped, -not a hard IP ban; it clears after inactivity. - -**Headers required (minimal UA gets blocked faster, full browser UA lasts longer):** -```python -HEADERS = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", - "Accept-Language": "en-US,en;q=0.9", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", -} -``` - -A plain `"User-Agent": "Mozilla/5.0"` also works for the first few requests, -but the full Chrome UA lasts slightly longer before triggering the block. - -## Search URL Structure - -``` -https://www.ebay.com/sch/i.html?_nkw={query}&{filters} -``` - -Confirmed working URL examples: -```python -# Buy It Now only, sorted by lowest price -"https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&LH_BIN=1&_sop=15" - -# Auctions only -"https://www.ebay.com/sch/i.html?_nkw=vintage+camera&LH_Auction=1" - -# New condition only, page 2 -"https://www.ebay.com/sch/i.html?_nkw=laptop&LH_ItemCondition=1000&_pgn=2" -``` - -### Filter Parameters (all confirmed working) - -| Parameter | Value | Effect | -|-----------|-------|--------| -| `LH_BIN` | `1` | Buy It Now only | -| `LH_Auction` | `1` | Auctions only | -| `LH_ItemCondition` | see below | Filter by condition | -| `_sop` | see below | Sort order | -| `_pgn` | `2`, `3`, … | Page number (confirmed: returns ~65–88 items/page) | -| `_ipg` | `25`, `50`, `100`, `200` | Items per page (unconfirmed, standard eBay param) | - -### Condition Codes for `LH_ItemCondition` - -| Code | Label | -|------|-------| -| `1000` | New | -| `1500` | New Other (open box, no original packaging) | -| `2000` | Manufacturer Refurbished | -| `2500` | Seller Refurbished | -| `2750` | Like New | -| `3000` | Used | -| `4000` | Very Good | -| `5000` | Good | -| `6000` | Acceptable | -| `7000` | For parts or not working | - -### Sort Codes for `_sop` - -| Code | Sort Order | -|------|-----------| -| `1` | Best Match (default) | -| `10` | Ending Soonest | -| `12` | Newly Listed | -| `15` | Lowest Price + Shipping | -| `16` | Highest Price | - -### Item Detail URL - -``` -https://www.ebay.com/itm/{listing_id} -``` - -The listing ID is a plain integer (e.g. `167040158614`). Always strip query parameters -from extracted URLs — tracking params bloat the URL and are not needed for navigation. - -## Search Results: HTML Structure (No JSON-LD) - -**JSON-LD is absent on search results pages.** The listing data is embedded in HTML -with eBay-specific class names. The response is large (~1.5–1.8 MB uncompressed). - -### Card Structure - -Each result is an `
        1. ` element with `data-listingid=`. Key elements within each card: - -| Data | Pattern | -|------|---------| -| Listing ID | `data-listingid=(\d+)` on the `
        2. ` | -| Item URL | `href=(https://(?:www\.)?ebay\.com/itm/(\d+))` | -| Title | `s-card__title` > `su-styled-text primary` > text | -| Current price | `class=price">\$([0-9,\.]+)<` | -| Original/list price | `strikethrough[^>]*>\$([0-9,\.]+)` | -| Image | `class=s-card__image[^>]*src=([^\s>]+)` | -| Alt title | `img[alt]` in the card (same as product title) | - -### Confirmed Extractor (field-tested, 60 items from a single search) - -```python -import re - -def extract_search_results(html): - """ - Parse eBay search results HTML into a list of dicts. - Returns [] if blocked or no results. - """ - if 'Pardon Our Interruption' in html or len(html) < 20_000: - return [] - - cards = re.split(r'(?=]+data-listingid=)', html) - results = [] - seen_ids = set() - - for card in cards[1:]: # skip preamble before first card - # Listing ID (dedup) - lid_m = re.search(r'data-listingid=(\d+)', card) - if not lid_m: - continue - listing_id = lid_m.group(1) - if listing_id in seen_ids: - continue - seen_ids.add(listing_id) - - # Item URL (clean, no tracking params) - url_m = re.search(r'href=(https://(?:www\.)?ebay\.com/itm/(\d+))', card) - item_url = url_m.group(1).split('?')[0] if url_m else None - - # Title from s-card__title - title_m = re.search(r's-card__title[^>]*>.*?primary[^>]*>([^<]+)', card, re.DOTALL) - title = title_m.group(1).strip() if title_m else None - - # Skip placeholder "Shop on eBay" stub cards - if not title or title == 'Shop on eBay': - continue - - # Current price - price_m = re.search(r'class=(?:["\'])?[a-z- ]*price["\']?>\$([0-9,\.]+)<', card) - if not price_m: - price_m = re.search(r'price">\$([0-9,\.]+)<', card) - price = '$' + price_m.group(1) if price_m else None - - # Original / list price (strikethrough — present when discounted) - orig_m = re.search(r'strikethrough[^>]*>\$([0-9,\.]+)', card) - original_price = '$' + orig_m.group(1) if orig_m else None - - # Thumbnail image URL - img_m = re.search(r'class=s-card__image[^>]*src=([^\s>]+)', card) - image = img_m.group(1) if img_m else None - - results.append({ - 'listing_id': listing_id, - 'url': item_url, - 'title': title, - 'price': price, - 'original_price': original_price, # None if not on sale - 'image': image, - }) - - return results -``` - -**Usage:** -```python -from helpers import http_get -import re - -HEADERS = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", - "Accept-Language": "en-US,en;q=0.9", -} - -html = http_get("https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&LH_BIN=1&_sop=15", headers=HEADERS) -items = extract_search_results(html) -print(f"{len(items)} items") -for item in items[:5]: - print(f" {item['listing_id']} | {item['title'][:50]} | {item['price']}") -# Output (confirmed): 60 items -# 168219240588 | One Plus Keyboard 81 Pro Winter Bonfire Mecha... | $159.00 -# 167461643107 | Logitech 920-012869 G515 TKL Wired Low Profil... | $49.99 -# 167040158614 | Logitech - PRO X TKL LIGHTSPEED Wireless Mech... | $74.99 -``` - -## Item Detail Pages: JSON-LD (Reliable) - -Item detail pages at `/itm/{id}` serve **two JSON-LD blocks**: `BreadcrumbList` and `Product`. -The `Product` schema is the most useful — it contains price, condition, availability, brand, images, and return policy. - -```python -import re, json - -def extract_item_detail(html): - """ - Extract structured data from an eBay item page. - Returns dict or None if blocked. - """ - if 'Pardon Our Interruption' in html: - return None - - ld_blocks = re.findall(r'application/ld\+json[^>]*>(.*?)', html, re.DOTALL) - product = None - breadcrumbs = [] - - for ld_str in ld_blocks: - try: - d = json.loads(ld_str.strip()) - except Exception: - continue - - if d.get('@type') == 'Product': - product = d - elif d.get('@type') == 'BreadcrumbList': - breadcrumbs = [i.get('name') for i in d.get('itemListElement', [])] - - if not product: - return None - - offers = product.get('offers', {}) - if isinstance(offers, list): - offers = offers[0] - - # Schema.org condition URL -> human label - CONDITION_MAP = { - 'NewCondition': 'New', - 'UsedCondition': 'Used', - 'RefurbishedCondition': 'Refurbished', - 'DamagedCondition': 'For Parts / Not Working', - 'LikeNewCondition': 'Like New', - 'VeryGoodCondition': 'Very Good', - 'GoodCondition': 'Good', - 'AcceptableCondition': 'Acceptable', - } - cond_url = offers.get('itemCondition', '') - cond_key = cond_url.split('/')[-1] # e.g. "RefurbishedCondition" - condition = CONDITION_MAP.get(cond_key, cond_key) - - # List price from priceSpecification (only present when there's a "was" price) - price_spec = offers.get('priceSpecification', {}) - list_price = price_spec.get('price') if price_spec.get('name') == 'List Price' else None - - # Shipping (first destination) - shipping_details = offers.get('shippingDetails', []) - if shipping_details: - shipping_val = shipping_details[0].get('shippingRate', {}).get('value', '') - shipping = 'Free' if str(shipping_val) in ('0', '0.0') else f"${shipping_val}" - else: - shipping = None - - # Return policy - return_policies = offers.get('hasMerchantReturnPolicy', []) - return_days = return_policies[0].get('merchantReturnDays') if return_policies else None - - return { - 'listing_id': offers.get('url', '').split('/itm/')[-1], - 'name': product.get('name'), - 'brand': product.get('brand', {}).get('name') if isinstance(product.get('brand'), dict) else product.get('brand'), - 'price': offers.get('price'), - 'list_price': list_price, # was-price, None if no discount shown - 'currency': offers.get('priceCurrency'), - 'availability': offers.get('availability', '').split('/')[-1], # e.g. "InStock" - 'condition': condition, - 'condition_url': cond_url, - 'shipping': shipping, - 'return_days': return_days, - 'images': product.get('image', []), - 'gtin13': product.get('gtin13'), - 'mpn': product.get('mpn'), - 'color': product.get('color'), - 'breadcrumbs': breadcrumbs, - } -``` - -**Field-tested on item 167040158614:** -```python -html = http_get("https://www.ebay.com/itm/167040158614", headers=HEADERS) -detail = extract_item_detail(html) -# { -# 'listing_id': '167040158614', -# 'name': 'Logitech - PRO X TKL LIGHTSPEED Wireless Mechanical Gaming Keyboard - 920-012118', -# 'brand': 'Logitech', -# 'price': 74.99, -# 'list_price': '219.99', -# 'currency': 'USD', -# 'availability': 'InStock', -# 'condition': 'Refurbished', -# 'shipping': 'Free', -# 'return_days': 30, -# 'images': ['https://i.ebayimg.com/images/g/vwsAAeSwEcFpw~hW/s-l1600.jpg', ...], # 5 images -# 'gtin13': '097855189066', -# 'mpn': '920-012118', -# 'color': 'Black', -# 'breadcrumbs': ['eBay', 'Electronics', 'Computers/Tablets & Networking', ...], -# } -``` - -### Item Specifics from `ux-textspans` (complementary to JSON-LD) - -The `ux-textspans` elements in item pages contain additional data not in JSON-LD, -including seller name, feedback %, items sold, detailed condition text, and all item specifics. - -```python -import re - -def extract_ux_textspans(html): - """Return list of all ux-textspans text values from an item page.""" - return [m.group(1) for m in re.finditer(r'ux-textspans[^>]*>([^<]+)', html)] - -# From item 167040158614 (confirmed): -# Index [3] -> item title -# Index [4] -> subtitle / seller tagline -# Index [5] -> seller name ("Logitech") -# Index [6] -> seller feedback count ("(20742)") -# Index [7] -> seller feedback % ("99.6% positive") -# Index [10] -> current price ("US $74.99") -# Index [12] -> list price ("US $219.99") -# Index [33] -> condition label ("Excellent - Refurbished") -# Index [36] -> quantity sold ("45 sold") -# Pairs from [105] onward: item specifics as label/value pairs -``` - -## Pagination - -Use `_pgn=N` (confirmed working, returns ~65–88 items per page): -```python -for page in range(1, 4): - url = f"https://www.ebay.com/sch/i.html?_nkw=laptop&LH_BIN=1&_sop=15&_pgn={page}" - html = http_get(url, headers=HEADERS) - if is_blocked(html): - break - items = extract_search_results(html) - print(f"Page {page}: {len(items)} items") - # IMPORTANT: add delay between pages to avoid bot detection - time.sleep(3) -``` - -**Rate-limit safe pattern**: 3–5 second delay between requests. Beyond ~10 rapid requests -in a session, eBay returns "Pardon Our Interruption" for all subsequent requests from that IP. - -## APIs (All Require Auth or Are Dead) - -| API | Status | Notes | -|-----|--------|-------| -| Finding API (svcs.ebay.com) | **Dead** — HTTP 500 | Was free/JSONP, no longer works | -| Browse API (api.ebay.com) | **Requires OAuth** — HTTP 400 | Needs eBay developer account + token | -| Shopping API (open.api.ebay.com) | **Requires token** | Returns `"Token not available"` error | -| RSS feed (`_rss=1`) | **Blocked same as HTML** | Returns "Pardon Our Interruption" when rate-limited | - -**Bottom line**: There is no public unauthenticated eBay API in 2026. Use HTML scraping. - -## Practical Workflow - -### Scrape a search and follow top items - -```python -import re, json, time -from helpers import http_get - -HEADERS = { - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", - "Accept-Language": "en-US,en;q=0.9", -} - -def is_blocked(html): - return 'Pardon Our Interruption' in html or len(html) < 20_000 - -# Step 1: Search -html = http_get( - "https://www.ebay.com/sch/i.html?_nkw=mechanical+keyboard&LH_BIN=1&_sop=15&LH_ItemCondition=1000", - headers=HEADERS -) -if is_blocked(html): - raise RuntimeError("Rate limited — wait 60-120s and retry") - -items = extract_search_results(html) -print(f"Found {len(items)} items") - -# Step 2: Fetch details for top results (with delay) -details = [] -for item in items[:5]: - time.sleep(3) - detail_html = http_get(item['url'], headers=HEADERS) - if is_blocked(detail_html): - print(f"Blocked on item {item['listing_id']}, stopping") - break - detail = extract_item_detail(detail_html) - if detail: - details.append(detail) - print(f" {detail['name'][:50]} | {detail['price']} {detail['currency']} | {detail['condition']}") -``` - -## Gotchas - -- **"Pardon Our Interruption" is not a CAPTCHA** — it's eBay's bot-detection interstitial. It doesn't require solving — just wait and back off. `'captcha'` does NOT appear in the blocked page. - -- **No JSON-LD on search results** — The `application/ld+json` blocks that Amazon and other sites embed are absent from eBay search pages. Parse the HTML using regex on `s-card` class names. - -- **JSON-LD IS on item pages** — Two blocks: `BreadcrumbList` and `Product`. The `Product` block is authoritative. Use the regex `r'application/ld\+json[^>]*>(.*?)'` (note the `[^>]*` before `>` — eBay doesn't use `type="..."` quote style consistently in all contexts). - -- **Duplicate listing IDs in the HTML** — Each card's listing ID appears 2–3 times (image link, title link, watch button). Always deduplicate using a `seen_ids` set when splitting on `data-listingid`. - -- **Placeholder cards ("Shop on eBay")** — The first card slot may be a promoted/placeholder card with title `"Shop on eBay"` and listing ID `"123456"`. Filter these out. - -- **Item URLs have tracking params** — Raw extracted URLs look like `https://www.ebay.com/itm/167040158614?_skw=...&epid=...&hash=...&itmprp=...`. Always strip to `itm/{id}` with `.split('?')[0]`. - -- **`www.ebay.com` vs `ebay.com`** — Some item URLs in search results omit `www.`. Normalize with `url.replace('//ebay.com/', '//www.ebay.com/')`. - -- **Search response is large** — Uncompressed HTML is 1.5–1.8 MB per page. The `http_get` helper handles gzip transparently, so the actual transfer is much smaller, but parsing a 1.8 MB string is slow. Use `re.split` on card boundaries rather than an HTML parser for speed. - -- **`_sop` sort and `LH_ItemCondition` require full browser-like UA** — Requests with just `"Mozilla/5.0"` (minimal UA) return empty results for these parameters more quickly than full Chrome UA. Always use the full UA string. - -- **Condition in JSON-LD is a schema.org URL** — `offers.itemCondition` returns `"https://schema.org/RefurbishedCondition"`, not a human label. Split on `/` and map the last segment using `CONDITION_MAP` (see `extract_item_detail` above). - -- **`list_price` only present when discounted** — `offers.priceSpecification` only appears in JSON-LD when eBay shows a "List Price" comparison. Check `price_spec.get('name') == 'List Price'` before using. - -- **Seller data is NOT in JSON-LD** — `d.get('seller')` returns `None` on item pages. The seller name, feedback %, and items sold count are only in `ux-textspans` elements in the HTML body. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/etsy/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/etsy/scraping.md deleted file mode 100644 index 8b63370e7..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/etsy/scraping.md +++ /dev/null @@ -1,506 +0,0 @@ -# Etsy — Scraping & Data Extraction - -Field-tested against `www.etsy.com` on 2026-04-18 using `http_get` (no browser) and direct `urllib` probes. - -## Quick summary - -**`http_get` does NOT work on Etsy.** Every page type — search, listing, shop, category, market — returns HTTP 403 with DataDome bot protection. This is not negotiable: no header combination, User-Agent string, or cookie replay bypasses it. Etsy requires a real browser with JavaScript execution. - -- **All HTML pages (`/search`, `/listing/`, `/shop/`, `/c/`, `/market/`)** — HTTP 403, `Server: DataDome` -- **Official Etsy API v3 (`openapi.etsy.com/v3/`)** — requires a registered API key; returns JSON -- **`robots.txt`** — HTTP 200, plain text, no DataDome -- **Browser (Chrome CDP)** — works; Etsy is a React SPA with JSON-LD and `__NEXT_DATA__` embedded in SSR HTML - ---- - -## Bot detection: DataDome - -Etsy uses [DataDome](https://datadome.co/) for every user-facing HTML endpoint. - -### What you receive - -``` -HTTP 403 Forbidden -Server: DataDome -X-DataDome: protected -X-DataDome-riskscore: 0.14–0.95 (varies per request) -X-DD-B: 2 -Content-Type: text/html;charset=utf-8 -Set-Cookie: datadome=; Max-Age=31536000; Domain=.etsy.com; Secure; SameSite=Lax -``` - -Body (816 bytes — a JavaScript challenge, not a hard block): - -```html -etsy.com... - -

          Please enable JS and disable any ad blocker

          - - - -``` - -`'rt':'c'` means **challenge** (browser must run JS at `geo.captcha-delivery.com` to get a valid `datadome` cookie). `'rt':'b'` would be a hard block; `'rt':'i'` an interstitial. All tested requests returned `'rt':'c'` — the JS challenge variant. - -### What was tested (all 403) - -| URL pattern | Status | DataDome | -|---|---|---| -| `/search?q=handmade+candle&explicit=1` | **403** | JS challenge | -| `/search?q=handmade+candle&explicit=1&page=2` | **403** | JS challenge | -| `/listing/{id}/{slug}` | **403** | JS challenge | -| `/shop/{ShopName}` | **403** | JS challenge | -| `/c/home-living/candles-holders/candles` | **403** | JS challenge | -| `/market/handmade_candle` | **403** | JS challenge | - -### User-Agents tested (all blocked) - -- `Mozilla/5.0 (Macintosh; ...) Chrome/120` — **403** -- `facebookexternalhit/1.1` — **403** -- `Twitterbot/1.0` — **403** -- `LinkedInBot/1.0` — **403** -- `ia_archiver` — **403** -- `curl/7.68.0` — **403** -- `python-requests/2.28.0` — **403** -- `Googlebot/2.1` — **429** (rate-limited, different path) -- `Mozilla/5.0` (http_get default) — **403** - -**Conclusion**: No UA bypasses DataDome. The challenge requires TLS fingerprinting + JS execution that only a real browser provides. - ---- - -## What works without a browser - -### `robots.txt` (200 OK) - -```python -from helpers import http_get -text = http_get("https://www.etsy.com/robots.txt") -# Returns 51 KB plain-text file — no DataDome -``` - -The robots.txt reveals URL structure, disallowed parameters, and allowed paths. Etsy disallows `/search?*q=` (no-empty-q searches) and faceted search params (`attr_*`, `price_bucket`, `ship_to`, `search_type`). Basic search with `?q=keyword` is not explicitly disallowed by robots but is blocked by DataDome in practice. - -### Official Etsy API v3 (requires API key) - -The `openapi.etsy.com/v3/` endpoint is NOT DataDome-protected. It returns structured JSON but requires a free API key from [developer.etsy.com](https://developer.etsy.com/). - -```python -import json -from helpers import http_get - -API_KEY = "your_key_here" # from developer.etsy.com - -def etsy_api(path, **params): - from urllib.parse import urlencode - qs = urlencode(params) - url = f"https://openapi.etsy.com/v3/application/{path}?{qs}" - data = http_get(url, headers={"x-api-key": API_KEY}) - return json.loads(data) - -# Search listings -results = etsy_api("listings/active", limit=25, keywords="handmade candle", - sort_on="created", sort_order="desc") -# results['results'] is a list of listing dicts -# results['count'] is total match count - -# Get a single listing -listing = etsy_api("listings/1234567890") - -# Get all listings for a shop -shop_listings = etsy_api("shops/CandlesByNature/listings/active", limit=100) - -# Get shop info -shop = etsy_api("shops/CandlesByNature") -``` - -Error without a key: -``` -HTTP 403: {"error": "Invalid API key: should be in the format 'keystring:shared_secret'."} -``` - -Error with wrong key: -``` -HTTP 403: {"error": "API key not found or not active, or incorrect shared secret for API key."} -``` - -### API v3 key data fields - -``` -listings/active response: - results[i].listing_id → int (e.g. 1234567890) - results[i].title → string - results[i].description → string (full HTML, may be truncated by API) - results[i].price.amount → int (in currency subunit, e.g. 2599 = $25.99) - results[i].price.divisor → int (100 for USD) - results[i].price.currency_code → "USD" - results[i].quantity → int (stock remaining) - results[i].tags → [string] (up to 13 tags) - results[i].materials → [string] - results[i].shipping_profile_id → int - results[i].shop_id → int - results[i].url → "https://www.etsy.com/listing/..." - results[i].views → int - results[i].num_favorers → int - results[i].featured_rank → int (-1 if not featured) - results[i].is_digital → bool - results[i].has_variations → bool - results[i].taxonomy_id → int (category) - results[i].state → "active" | "draft" | "expired" | "sold_out" - results[i].creation_timestamp → unix int - results[i].last_modified_timestamp → unix int -``` - ---- - -## Browser-based scraping (required for HTML data) - -Since http_get is blocked, all HTML scraping requires the Chrome browser via CDP. - -### Navigation pattern - -```python -from helpers import goto, wait_for_load, wait, js, new_tab - -# Always use new_tab() for the first Etsy navigation in a session -tid = new_tab("https://www.etsy.com/search?q=handmade+candle&explicit=1") -wait_for_load() -wait(3) # Etsy React SPA needs extra time after readyState=complete -``` - -### Search URL construction - -``` -https://www.etsy.com/search?q={query}&explicit=1 -``` - -Parameters: -- `q` — search query (URL-encoded, spaces as `+`) -- `explicit=1` — disables the "adult content" NSFW filter (safe to include always) -- `page=2`, `page=3` — pagination (confirmed from robots.txt URL patterns) -- `min_price=10.00&max_price=50.00` — price range filter -- `order=price_asc` / `order=price_desc` / `order=most_relevant` (default) / `order=newest` -- `ship_to=US` — filter by shipping destination (CAUTION: disallowed by robots.txt, use only with browser) -- `listing_type=handmade` / `listing_type=vintage` / `listing_type=supplies` - -**Disallowed URL params** (per robots.txt — avoid in automated crawls): -- `attr_*=*` — attribute filters -- `price_bucket=*` — price bucket filter -- `ship_to=*` — shipping destination -- `search_type=*` — search type - -### Search results extraction (browser) - -Etsy renders results as a React SPA. The listing cards use data attributes and consistent class patterns: - -```python -results = js(""" - Array.from(document.querySelectorAll('[data-listing-id]')).map(el => ({ - listing_id: el.getAttribute('data-listing-id'), - title: el.querySelector('h3, [class*="listing-link"]')?.innerText?.trim() - || el.querySelector('h2')?.innerText?.trim(), - price: el.querySelector('[class*="currency-value"]')?.innerText?.trim() - || el.querySelector('.currency-value')?.innerText?.trim(), - shop: el.querySelector('[class*="shop-name"], [data-shop-name]')?.innerText?.trim(), - url: el.querySelector('a[href*="/listing/"]')?.href, - thumbnail: el.querySelector('img[src*="etsystatic"]')?.src, - is_ad: !!el.querySelector('[class*="ad-label"], [class*="sponsored"]') - })).filter(r => r.listing_id) -""") -``` - -**Alternative — JSON-LD ItemList** (more reliable than DOM selectors): - -Etsy's SSR HTML embeds a `', html, re.DOTALL) -for block in ld_blocks: - parsed = json.loads(block) - if isinstance(parsed, dict) and parsed.get('@type') == 'ItemList': - for item in parsed['itemListElement']: - ev = item['item'] - print(ev['name'], ev['startDate'], ev['url']) - break -# Returns 18–40 events per page -``` - -**For a single event, fetch the detail page and extract the `Event` JSON-LD block.** It contains all fields including `offers` (pricing). There is also a richer `__NEXT_DATA__` block if you need venue coordinates, refund policy, or sales status. - -## URL structure - -### Search / listing pages - -``` -https://www.eventbrite.com/d/{location}/{category}/ -https://www.eventbrite.com/d/{location}/{category}/?page=2 -https://www.eventbrite.com/d/{location}/{category}/?start_date=2026-05-01&end_date=2026-05-31 -``` - -**Location format:** `{state-abbreviation}--{city}` (lowercase, hyphens for spaces) -- `ca--san-francisco` -- `ny--new-york` -- `ca--los-angeles` -- Use `online` for virtual events - -**Category slugs (confirmed working):** -- `tech` — Technology events -- `music` — Music -- `food--drink` — Food & Drink -- `health` — Health & Wellness -- `sports--fitness` — Sports & Fitness -- `arts--entertainment` — Arts & Entertainment -- `family--education` — Family & Education -- `business--professional` — Business & Networking -- `science--tech` — Science & Technology -- `community--culture` — Community & Culture -- `networking` — Networking -- `events` — All events (broadest, returns ~40/page) - -**Filter slugs (replace category):** -- `free--events` — Free events only -- `events--today` — Today -- `events--tomorrow` — Tomorrow -- `events--this-weekend` — This weekend - -**Query params:** -- `?page=N` — Pagination (page 2+ confirmed working, each returns 18–20 events) -- `?start_date=YYYY-MM-DD&end_date=YYYY-MM-DD` — Date range filter (confirmed, narrows results) - -### Event detail pages - -``` -https://www.eventbrite.com/e/{slug}-tickets-{event_id} -``` - -Example: `https://www.eventbrite.com/e/icontact-the-tactile-tech-opera-tickets-1982861003639` - -- `event_id` is a numeric string (10–13 digits) -- Extract with: `re.search(r'-tickets-(\d+)$', url).group(1)` -- Extract slug with: `re.search(r'/e/(.+)-tickets-\d+$', url).group(1)` - -Other TLDs (`.ca`, `.co.uk`, etc.) use the same structure — event IDs are globally unique across TLDs. - -## Listing page: JSON-LD `ItemList` schema - -The first `', html, re.DOTALL) -event_data = None -for block in ld_blocks: - parsed = json.loads(block) - if isinstance(parsed, dict) and parsed.get('@type') in ('Event', 'BusinessEvent', 'MusicEvent', 'EducationEvent'): - event_data = parsed - break - -print(event_data['name']) # "iContact the tactile tech opera" -print(event_data['startDate']) # "2026-06-21T17:05:00-07:00" (ISO 8601 with TZ) -print(event_data['endDate']) # "2026-06-21T20:08:00-07:00" -print(event_data['eventStatus']) # "https://schema.org/EventScheduled" -print(event_data['eventAttendanceMode']) # "https://schema.org/OfflineEventAttendanceMode" -print(event_data['location']['name']) # "Little Boxes Theater" -print(event_data['location']['address']['streetAddress']) # "94107 1661 Tennessee Street, San Francisco, CA 94107" -print(event_data['organizer']['name']) # "Beth McNamara" -print(event_data['organizer']['url']) # "https://www.eventbrite.com/o/beth-mcnamara-120755148166" -``` - -Full confirmed schema on detail page: -``` -name str Event title -description str Short summary -url str Canonical event URL -image str Event banner image URL -startDate str ISO 8601 with timezone offset -endDate str ISO 8601 with timezone offset -eventStatus str URI: EventScheduled / EventCancelled / EventPostponed -eventAttendanceMode str URI: OfflineEventAttendanceMode / OnlineEventAttendanceMode / MixedEventAttendanceMode -location.@type str "Place" (in-person) or "VirtualLocation" (online) -location.name str Venue name -location.address.streetAddress str -location.address.addressLocality str City -location.address.addressRegion str State abbreviation -location.address.addressCountry str Country code -organizer.name str Organizer display name -organizer.url str Organizer profile URL -offers list AggregateOffer object(s) -``` - -### Offers / pricing - -```python -offers = event_data.get('offers', []) -if offers: - offer = offers[0] # always a list; typically one AggregateOffer - print(offer['@type']) # "AggregateOffer" - print(offer['lowPrice']) # "50.0" (string, not float) - print(offer['highPrice']) # "50.0" - print(offer['priceCurrency']) # "USD" - print(offer['availability']) # "InStock" / "SoldOut" - print(offer['availabilityStarts']) # ISO 8601 UTC - print(offer['availabilityEnds']) # ISO 8601 UTC - -# Free events: lowPrice="0.0", highPrice="0.0" -# Free check: float(offer['lowPrice']) == 0.0 -``` - -`@type` on the event itself varies by format (all scrape identically): -- `Event` — general -- `BusinessEvent` — networking, professional -- `MusicEvent` — concerts -- `EducationEvent` — classes, workshops - -## Detail page: `__NEXT_DATA__` (richer structured data) - -Every event detail page embeds a `', html, re.DOTALL) -nd = json.loads(nextjs.group(1)) -context = nd['props']['pageProps']['context'] - -bi = context['basicInfo'] -print(bi['id']) # "1982861003639" (event ID string) -print(bi['name']) # event title -print(bi['isFree']) # bool -print(bi['isOnline']) # bool -print(bi['currency']) # "USD" -print(bi['status']) # "live" / "completed" / "canceled" -print(bi['organizationId']) # numeric string -print(bi['formatId']) # numeric string (event format category) -print(bi['isProtected']) # bool — password-protected events -print(bi['isSeries']) # bool — recurring series -print(bi['created']) # ISO 8601 UTC creation timestamp - -# Venue with coordinates -venue = bi['venue'] -print(venue['name']) # "Little Boxes Theater" -print(venue['address']['city']) # "San Francisco" -print(venue['address']['region']) # "CA" -print(venue['address']['latitude']) # "37.7508806" -print(venue['address']['longitude']) # "-122.3881427" -print(venue['address']['localizedMultiLineAddressDisplay']) # list of strings - -# Organizer details -org = bi['organizer'] -print(org['name']) # "Beth McNamara" -print(org['url']) # organizer profile URL -print(org['numEvents']) # int -print(org['verified']) # bool - -# Sales status -ss = context['salesStatus'] -print(ss['salesStatus']) # "on_sale" / "sold_out" / "sales_ended" -print(ss['startSalesDate']['local']) # local datetime string - -# Good to know -gtk = context['goodToKnow']['highlights'] -print(gtk['ageRestriction']) # "18+" or null -print(gtk['durationInMinutes']) # int (e.g. 183) -print(gtk['doorTime']) # local datetime string or null -print(gtk['locationType']) # "in_person" or "online" - -# Refund policy -refund = context['goodToKnow']['refundPolicy'] -print(refund['policyType']) # "custom" / "no_refunds" / "standard" -print(refund['isRefundAllowed']) # bool -print(refund['validDays']) # int or null - -# Full event description (HTML) -for module in context['structuredContent']['modules']: - if module['type'] == 'text': - print(module['text']) # raw HTML, may need BeautifulSoup to strip tags -``` - -## Complete workflow: scrape events from a category - -```python -import re, json - -def get_events_from_listing(location, category, page=1): - """Returns list of event dicts with name, url, startDate, endDate, location.""" - headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"} - url = f"https://www.eventbrite.com/d/{location}/{category}/?page={page}" - html = http_get(url, headers=headers) - ld_blocks = re.findall(r'', html, re.DOTALL) - for block in ld_blocks: - parsed = json.loads(block) - if isinstance(parsed, dict) and parsed.get('@type') == 'ItemList': - return [item['item'] for item in parsed.get('itemListElement', [])] - return [] - -def get_event_detail(event_url): - """Returns full Event JSON-LD + NEXT_DATA context for a single event.""" - headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"} - html = http_get(event_url, headers=headers) - - # JSON-LD Event block - ld_blocks = re.findall(r'', html, re.DOTALL) - event_ld = None - for block in ld_blocks: - parsed = json.loads(block) - if isinstance(parsed, dict) and parsed.get('@type') in ('Event', 'BusinessEvent', 'MusicEvent', 'EducationEvent'): - event_ld = parsed - break - - # NEXT_DATA context - nextjs = re.search(r'', html, re.DOTALL) - context = None - if nextjs: - nd = json.loads(nextjs.group(1)) - context = nd['props']['pageProps']['context'] - - return event_ld, context - -# Usage -events = get_events_from_listing("ca--san-francisco", "tech", page=1) -print(f"Found {len(events)} events") # 18–20 typical - -for ev in events[:3]: - print(ev['name'], ev['startDate'], ev['url']) - -# Deep-fetch one event -ld, ctx = get_event_detail(events[0]['url']) -if ld and ld.get('offers'): - price = float(ld['offers'][0]['lowPrice']) - currency = ld['offers'][0]['priceCurrency'] - print(f"Price: {price} {currency}") # 0.0 USD (free) or e.g. 50.0 USD -``` - -## Public API: requires auth - -The Eventbrite REST API (`https://www.eventbriteapi.com/v3/`) requires an OAuth token for all endpoints: - -- `GET /v3/events/{id}/` — HTTP 401 without auth -- `GET /v3/events/search/` — HTTP 404 (endpoint changed; auth also required) - -**Use HTML scraping instead** — the JSON-LD and `__NEXT_DATA__` data is equivalent to the API response and requires no credentials. - -If you have a token (`EVENTBRITE_TOKEN`): -```python -import os -token = os.environ.get('EVENTBRITE_TOKEN') -headers = { - "User-Agent": "Mozilla/5.0", - "Authorization": f"Bearer {token}" -} -data = json.loads(http_get(f"https://www.eventbriteapi.com/v3/events/{event_id}/", headers=headers)) -``` - -## Gotchas - -- **Event URLs in the HTML use relative `/e/` paths, not absolute URLs** — Search listing HTML contains `/e/slug-tickets-id?aff=...` relative paths (with tracking params). Extract event URLs from the JSON-LD `ItemList` instead — they are absolute, clean URLs without tracking params. - -- **`re.findall(r'href="https://www.eventbrite.com/e/...')` returns 0 results** — Confirmed: event cards in the HTML do not have `https://www.eventbrite.com/e/` in href attributes. Use JSON-LD extraction only. - -- **`__SERVER_DATA__` does not exist** — Both search and detail pages were checked. There is no `window.__SERVER_DATA__` or `window.__redux_state__`. The embedded data is in `', html, re.DOTALL) - ap = json.loads(nd.group(1))['props']['pageProps']['apolloState'] - - # The primary Book entity matches the URL's legacy ID - book = next(v for v in ap.values() - if v.get('__typename') == 'Book' and v.get('legacyId') == int(book_id)) - work = next((v for v in ap.values() if v.get('__typename') == 'Work'), {}) - author_ref = book['primaryContributorEdge']['node']['__ref'] - author = ap.get(author_ref, {}) - - stats = work.get('stats', {}) - work_details = work.get('details', {}) - book_details = book.get('details', {}) - - return { - 'title': book['title'], - 'title_complete': book['titleComplete'], - 'book_id': book['legacyId'], - 'url': book['webUrl'], - 'cover_url': book['imageUrl'], - # Strip HTML tags from description - 'description': re.sub(r'<[^>]+>', '', book.get('description({"stripped":true})', - book.get('description', ''))).strip(), - 'genres': [g['genre']['name'] for g in book.get('bookGenres', [])], - 'series': [{'name': s['series']['title'], 'position': s.get('userPosition')} - for s in book.get('bookSeries', [])], - # Author - 'author_name': author.get('name'), - 'author_url': author.get('webUrl'), - # Edition details - 'format': book_details.get('format'), - 'num_pages': book_details.get('numPages'), - 'publisher': book_details.get('publisher'), - 'language': (book_details.get('language') or {}).get('name'), - 'isbn': book_details.get('isbn'), - 'isbn13': book_details.get('isbn13'), - 'pub_timestamp_ms': book_details.get('publicationTime'), - # Ratings (from Work, not Book) - 'avg_rating': stats.get('averageRating'), - 'ratings_count': stats.get('ratingsCount'), - 'text_reviews': stats.get('textReviewsCount'), - # ratings_dist is list of counts for [1-star, 2-star, 3-star, 4-star, 5-star] - 'ratings_dist': stats.get('ratingsCountDist'), - # Awards - 'awards': [a['name'] + (' — ' + a['category'] if a.get('category') else '') - for a in work_details.get('awardsWon', [])], - } - -# Example -book = parse_book(149267) # The Stand by Stephen King -# book['title'] => "The Stand" -# book['avg_rating'] => 4.35 -# book['ratings_count']=> 845591 -# book['genres'] => ["Horror", "Fiction", "Fantasy", ...] -# book['awards'] => ["Locus Award — Best SF Novel", ...] -``` - -**Field notes:** -- `book['legacyId']` is the integer in the URL (e.g. `149267`). Use it to match the correct entity — the `apolloState` often contains 2-3 Book entries for different editions. -- Ratings and awards live in the `Work` entity, not `Book`. The `Work` is always `__typename == 'Work'`. -- `description` comes in two forms: `description` (HTML) and `description({"stripped":true})` (plain text). Prefer the stripped version. -- `pub_timestamp_ms` is a Unix timestamp in **milliseconds**. Convert: `datetime.fromtimestamp(ts/1000)`. -- `isbn` / `isbn13` are often `null` on older editions — the JSON-LD path (below) is no more reliable. - ---- - -## Book Page — Fast Path (JSON-LD) - -Use when you only need title, author, rating, page count, and awards. ~3× less parsing code. - -```python -import re, json -from helpers import http_get - -def parse_book_fast(book_id): - html = http_get(f"https://www.goodreads.com/book/show/{book_id}") - blocks = re.findall(r'', html, re.DOTALL) - if not blocks: - return None - ld = json.loads(blocks[0]) - return { - 'title': ld.get('name'), - 'author': ld['author'][0]['name'] if ld.get('author') else None, - 'avg_rating': ld.get('aggregateRating', {}).get('ratingValue'), - 'ratings_count':ld.get('aggregateRating', {}).get('ratingCount'), - 'review_count': ld.get('aggregateRating', {}).get('reviewCount'), - 'num_pages': ld.get('numberOfPages'), - 'isbn': ld.get('isbn'), - 'cover_url': ld.get('image'), - 'awards': ld.get('awards'), # single string, comma-separated - 'format': ld.get('bookFormat'), - } - -book = parse_book_fast(149267) -# book['avg_rating'] => 4.35 -# book['ratings_count']=> 845591 -``` - -**JSON-LD does NOT include:** description, genres, series membership, per-star rating distribution, publisher, language. -Use `parse_book()` (the `__NEXT_DATA__` path) when you need any of those. - ---- - -## Search Results - -URL: `https://www.goodreads.com/search?q={query}&search_type=books&page={n}` - -Search uses server-rendered HTML with schema.org microdata `` rows. No `__NEXT_DATA__`. - -```python -import re, json -from helpers import http_get - -def search_books(query, page=1): - from urllib.parse import quote_plus - url = f"https://www.goodreads.com/search?q={quote_plus(query)}&search_type=books&page={page}" - html = http_get(url) - - rows = re.findall( - r'(.*?)', - html, re.DOTALL - ) - - results = [] - for row in rows: - bid = re.search(r'
          ', row) - title = re.search(r"itemprop='name'[^>]*>([^<]+)", row) - author = re.search(r'class="authorName"[^>]*>]*>([^<]+)', row) - avg = re.search(r'(\d+\.\d+)\s*avg rating', row) - cnt = re.search(r'(\d[\d,]*)\s*rating', row) - cover = re.search(r'img alt="[^"]*" class="bookCover"[^>]*src="([^"]+)"', row) - if not (bid and title): - continue - results.append({ - 'book_id': bid.group(1), - 'title': title.group(1).strip(), - 'author': author.group(1).strip() if author else None, - 'avg_rating': float(avg.group(1)) if avg else None, - 'ratings_count':cnt.group(1).replace(',', '') if cnt else None, - 'cover_url': cover.group(1) if cover else None, - 'url': f"https://www.goodreads.com/book/show/{bid.group(1)}", - }) - - total_m = re.search(r'([\d,]+)\s+results', html) - total = int(total_m.group(1).replace(',', '')) if total_m else None - - return {'total': total, 'page': page, 'results': results} - -# Example -r = search_books("dune") -# r['total'] => 101026 -# r['results'] => [{'book_id':'44767458', 'title':'Dune (Dune, #1)', 'avg_rating':4.29, ...}, ...] -``` - -**Field notes:** -- Returns exactly 20 results per page. -- `total` is the result count shown in `"N results for…"` header. -- The `avg rating` regex uses `—` (HTML entity) in the raw HTML — the pattern above matches the decoded text. -- `ratings_count` regex hits the first occurrence of `\d+ rating` in the row, which is always the book's count (not a user review count). -- `cover_url` is a 75px thumbnail (`._SY75_.jpg`). Swap `_SY75_` → `_SX315_` for a larger image. - ---- - -## Author Page - -URL: `https://www.goodreads.com/author/show/{author_id}.{Slug}` - -Author pages are **not** Next.js — they use classic server-rendered HTML with OG meta tags and microdata. -The author ID and slug can be obtained from a book's `author_url` field. - -```python -import re, json -from helpers import http_get - -def parse_author(author_id_and_slug): - # author_id_and_slug e.g. "58.Frank_Patrick_Herbert" - html = http_get(f"https://www.goodreads.com/author/show/{author_id_and_slug}") - - # Name and basic info from OG/meta tags - name = re.search(r"", html) - img = re.search(r"", html) - website = re.search(r"Website\s*
          \s*]*>\s*]*href=\"([^\"]+)\"", html) - - # Full biography from hidden span (shown/hidden by "...more" toggle in browser) - bio_span = re.search( - r']*>(.*?)', - html, re.DOTALL - ) - bio = re.sub(r'<[^>]+>', '', bio_span.group(1)).strip() if bio_span else None - - # Top books listed on the page (10 rows, same microdata format as search) - rows = re.findall( - r'(.*?)', - html, re.DOTALL - ) - books = [] - for row in rows: - bid = re.search(r'
          ', row) - title = re.search(r"itemprop='name'[^>]*>([^<]+)", row) - avg = re.search(r'(\d+\.\d+)\s*avg rating', row) - cnt = re.search(r'(\d[\d,]*)\s*rating', row) - if bid and title: - books.append({ - 'book_id': bid.group(1), - 'title': title.group(1).strip(), - 'avg_rating': float(avg.group(1)) if avg else None, - 'ratings_count':cnt.group(1).replace(',', '') if cnt else None, - 'url': f"https://www.goodreads.com/book/show/{bid.group(1)}", - }) - - return { - 'name': name.group(1) if name else None, - 'profile_image':img.group(1) if img else None, - 'bio': bio, - 'website': website.group(1) if website else None, - 'top_books': books, - } - -# Example -author = parse_author("58.Frank_Patrick_Herbert") -# author['name'] => "Frank Patrick Herbert" -# author['bio'] => "Franklin Patrick Herbert Jr. was an American science fiction..." -# len(author['top_books']) => 10 -``` - -**Field notes:** -- Author IDs can be found in a book's `author_url` (from `__NEXT_DATA__` or JSON-LD). -- The slug is optional in the URL — numeric ID alone redirects correctly. -- `profile_image` from OG tag is a large portrait (p8 suffix = 800px). Swap to `p5` for 500px. -- The bio is server-rendered in a `` or `` — which variant appears depends on length. -- Follower count is **not** present in the static HTML — it requires JS execution to appear. -- Page lists exactly 10 books. To get all books, paginate `/author/list/{author_id}?page=N`. - ---- - -## Listopia List Page - -URL: `https://www.goodreads.com/list/show/{list_id}.{Slug}?page={n}` - -Returns 100 books per page with rank numbers. - -```python -import re, json -from helpers import http_get - -def parse_list(list_id_and_slug, page=1): - url = f"https://www.goodreads.com/list/show/{list_id_and_slug}?page={page}" - html = http_get(url) - - rows = re.findall( - r'(.*?)', - html, re.DOTALL - ) - - results = [] - for row in rows: - rank = re.search(r']*class="number"[^>]*>(\d+)', row) - bid = re.search(r'
          ', row) - title = re.search(r"itemprop='name'[^>]*>([^<]+)", row) - author = re.search(r'class="authorName"[^>]*>]*>([^<]+)', row) - avg = re.search(r'(\d+\.\d+)\s*avg rating', row) - cnt = re.search(r'(\d[\d,]*)\s*rating', row) - if not (bid and title): - continue - results.append({ - 'rank': int(rank.group(1)) if rank else None, - 'book_id': bid.group(1), - 'title': title.group(1).strip(), - 'author': author.group(1).strip() if author else None, - 'avg_rating': float(avg.group(1)) if avg else None, - 'ratings_count':cnt.group(1).replace(',', '') if cnt else None, - 'url': f"https://www.goodreads.com/book/show/{bid.group(1)}", - }) - - return {'page': page, 'results': results} - -# Example -lst = parse_list("1.Best_Books_Ever") -# lst['results'][0] => {'rank': 1, 'book_id': '2767052', -# 'title': 'The Hunger Games (The Hunger Games, #1)', -# 'author': 'Suzanne Collins', 'avg_rating': 4.35, ...} -``` - -**Field notes:** -- 100 rows per page. Ranks are sequential across pages (page 2 starts at rank 101). -- Paginate with `?page=2`, `?page=3` etc. -- List pages do not use `__NEXT_DATA__` — same classic HTML format as author pages. - ---- - -## Open Library API Fallback - -Use Open Library when you need structured JSON without HTML parsing, or when you want supplementary data (birth/death dates, ISBNs across editions, subjects). - -Open Library's ratings are from its own user base (~400 ratings vs. Goodreads' 800k+ for Dune) — use Goodreads ratings when accuracy matters. - -### Search - -```python -import json -from urllib.parse import quote_plus -from helpers import http_get - -def ol_search(query, limit=10): - url = f"https://openlibrary.org/search.json?q={quote_plus(query)}&limit={limit}" - data = json.loads(http_get(url)) - results = [] - for doc in data.get('docs', []): - cover_id = doc.get('cover_i') - results.append({ - 'ol_key': doc['key'], # e.g. "/works/OL893415W" - 'title': doc.get('title'), - 'author': (doc.get('author_name') or [''])[0], - 'author_key': (doc.get('author_key') or [''])[0], - 'first_pub_year': doc.get('first_publish_year'), - 'edition_count': doc.get('edition_count'), - 'series': doc.get('series_name'), - 'cover_url': f"https://covers.openlibrary.org/b/id/{cover_id}-M.jpg" if cover_id else None, - }) - return {'total': data.get('numFound'), 'results': results} - -r = ol_search("dune frank herbert", limit=5) -# r['results'][0]['ol_key'] => "/works/OL893415W" -# r['results'][0]['title'] => "Dune" -``` - -### Work (book details) - -```python -def ol_work(ol_key): - # ol_key like "/works/OL893415W" or just "OL893415W" - key = ol_key if ol_key.startswith('/') else f'/works/{ol_key}' - data = json.loads(http_get(f"https://openlibrary.org{key}.json")) - desc = data.get('description', '') - if isinstance(desc, dict): - desc = desc.get('value', '') - return { - 'title': data.get('title'), - 'subjects': data.get('subjects', []), - 'series': data.get('series', []), - 'description': desc, - 'covers': data.get('covers', []), - 'links': data.get('links', []), - } - -work = ol_work("OL893415W") -# work['title'] => "Dune" -# work['subjects'] => ["Dune (Imaginary place)", "Fiction", ...] -``` - -### Ratings for a work - -```python -def ol_ratings(ol_key): - key = ol_key if ol_key.startswith('/') else f'/works/{ol_key}' - data = json.loads(http_get(f"https://openlibrary.org{key}/ratings.json")) - return data.get('summary', {}) - -# {'average': 4.30, 'count': 414, 'sortable': 4.21} -``` - -### Author - -```python -def ol_author(author_key): - # author_key like "OL79034A" - data = json.loads(http_get(f"https://openlibrary.org/authors/{author_key}.json")) - bio = data.get('bio', '') - if isinstance(bio, dict): - bio = bio.get('value', '') - return { - 'name': data.get('name'), - 'birth_date': data.get('birth_date'), - 'death_date': data.get('death_date'), - 'bio': bio, - 'ol_key': data.get('key'), - } - -author = ol_author("OL79034A") -# author['name'] => "Frank Herbert" -# author['birth_date'] => "8 October 1920" -# author['death_date'] => "11 February 1986" -``` - ---- - -## Combining Goodreads + Open Library - -```python -# Get full book data: Goodreads for ratings/genres/description, OL for ISBNs/edition details -def get_book_full(goodreads_book_id, ol_work_key=None): - gr = parse_book(goodreads_book_id) - result = dict(gr) - if ol_work_key: - ol = ol_work(ol_work_key) - result['ol_subjects'] = ol['subjects'] - result['ol_description'] = ol['description'] - result['ol_covers'] = ol['covers'] - return result -``` - ---- - -## Gotchas - -- **Goodreads API is gone**: The official API was shut down in December 2020. All data must come from HTML scraping or the unofficial paths documented here. - -- **Book ID 5107 redirects**: The URL `goodreads.com/book/show/5107.The_Stand` actually resolves to *The Catcher in the Rye* (ID 5107). The Stand is ID `149267`. Always verify `book['legacyId']` matches the URL ID. - -- **Author page ID mismatch**: Author ID `10538` in the URL resolves to Carl Sagan, not Frank Herbert (ID `58`). Always obtain author IDs from the `author_url` field inside a book's data rather than guessing. - -- **Two Book entities in `apolloState`**: The `apolloState` contains multiple `Book:` entries — one is a stub (only has `legacyId` and `webUrl`), and one is full. Filter by `legacyId == int(book_id)` AND check that the entry has more than 3 fields. - -- **Ratings are on `Work`, not `Book`**: `avg_rating`, `ratingsCount`, and `ratingsCountDist` are in the `Work` entity's `stats` key. The `Book` entity has no rating fields. - -- **Author pages are old-style HTML**: Author pages (`/author/show/`) do not use Next.js or `__NEXT_DATA__`. Use OG meta tags and regex for extraction. The follower count only loads via JS — it will be missing from `http_get` responses. - -- **Search has no `__NEXT_DATA__`**: Search result pages (`/search`) are classic server-rendered HTML. JSON-LD is absent. Use the `` microdata rows. - -- **`ratings_count` regex order matters**: The pattern `r'(\d[\d,]*)\s*rating'` always matches the book's aggregate rating count first in each search row — this is reliable. Do not use `minirating` span text as it contains nested HTML. - -- **Open Library cover URLs return binary JPEG**: `http_get()` will raise a `UnicodeDecodeError` on cover image URLs. Use `urllib.request.urlopen()` directly and read bytes, or just store the URL string without fetching. - -- **Open Library ratings are sparse**: OL has ~400 community ratings for Dune vs. Goodreads' 1.6M. Use OL ratings only as a last resort. - -- **Search page `—` entity**: The raw HTML uses `—` (not `—`) between rating value and count in search and author pages. The regex patterns above match the decoded text because Python's `re` operates on the decoded string after `http_get()` decodes UTF-8. - -- **Book slug is optional**: `goodreads.com/book/show/44767458` (no slug) works identically to `goodreads.com/book/show/44767458-dune`. Redirects are transparent. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/gutenberg/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/gutenberg/scraping.md deleted file mode 100644 index 8a4800e51..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/gutenberg/scraping.md +++ /dev/null @@ -1,383 +0,0 @@ -# Project Gutenberg — Scraping & Data Extraction - -`https://www.gutenberg.org` — 78 000+ free public-domain ebooks. Every workflow here is pure `http_get` — no browser needed. - -## Do this first - -**Use the Gutendex REST API (`gutendex.com`) for all search and discovery. It is one call, returns clean JSON, and requires no auth. Go to gutenberg.org URLs only to fetch actual file content.** - -```python -import json - -# Search by title/author keyword -data = json.loads(http_get("https://gutendex.com/books/?search=pride+and+prejudice")) -# data['count'] = 6 (total matches) -# data['results'] = list of up to 32 book objects -book = data['results'][0] -# book['id'] = 1342 ← use this ID for all further calls -# book['formats']['text/plain; charset=utf-8'] = direct txt URL - -# Fetch the plain-text content of that book -text = http_get(book['formats']['text/plain; charset=utf-8']) -# Returns 763 083 chars including Project Gutenberg header/footer boilerplate -``` - -For a known book ID, skip search entirely: - -```python -book = json.loads(http_get("https://gutendex.com/books/1342/")) -``` - -## Common workflows - -### Search by keyword and get the first result - -```python -import json - -data = json.loads(http_get("https://gutendex.com/books/?search=frankenstein")) -if data['results']: - b = data['results'][0] - print(b['id'], b['title'], b['authors'][0]['name']) - # 84 Frankenstein; or, the modern prometheus Shelley, Mary Wollstonecraft - txt_url = b['formats'].get('text/plain; charset=utf-8') - if txt_url: - text = http_get(txt_url) -``` - -### Get the most downloaded books (popularity ranking) - -```python -import json - -data = json.loads(http_get("https://gutendex.com/books/?sort=popular")) -for b in data['results'][:10]: - authors = ', '.join(a['name'] for a in b['authors']) - print(f"[{b['id']}] {b['title']} — {authors} ({b['download_count']:,} downloads)") -# [84] Frankenstein — Shelley, Mary Wollstonecraft (178,271) -# [45304] The City of God, Volume I — Augustine, of Hippo, Saint (147,663) -# [2701] Moby Dick; Or, The Whale — Melville, Herman (112,302) -# [1342] Pride and Prejudice — Austen, Jane (107,502) -# [768] Wuthering Heights — Brontë, Emily (72,775) -# [1513] Romeo and Juliet — Shakespeare, William (70,272) -# [11] Alice's Adventures in Wonderland — Carroll, Lewis (65,243) -# [64317] The Great Gatsby — Fitzgerald, F. Scott (60,632) -# [100] Complete Works of Shakespeare — Shakespeare, William (60,527) -# [1260] Jane Eyre: An Autobiography — Brontë, Charlotte (57,602) -``` - -### Browse by genre / topic - -```python -import json - -# 'topic' matches both subjects and bookshelves fields -data = json.loads(http_get("https://gutendex.com/books/?topic=science+fiction")) -# data['count'] = 3473 total results, 32 per page - -data = json.loads(http_get("https://gutendex.com/books/?topic=detective+fiction")) -# data['count'] = 111 -# data['results'][0]: id=1661 The Adventures of Sherlock Holmes — Doyle, Arthur Conan - -# Filter by language (ISO 639-1 code) -data = json.loads(http_get("https://gutendex.com/books/?languages=fr&topic=roman")) -# data['count'] = 254 French books with 'roman' in topic -``` - -### Paginate through results - -```python -import json - -url = "https://gutendex.com/books/?topic=science+fiction" -books = [] -while url: - data = json.loads(http_get(url)) - books.extend(data['results']) - url = data['next'] # None on last page - # data['previous'] is also populated after page 1 - # e.g. data['next'] = "https://gutendex.com/books/?page=3&topic=science+fiction" -# All 3473 sci-fi books loaded across ~109 pages of 32 each -``` - -### Fetch multiple specific books by ID - -```python -import json - -data = json.loads(http_get("https://gutendex.com/books/?ids=1342,11,84")) -# Returns exactly those 3 books, count=3 -for b in data['results']: - print(b['id'], b['title']) -# 84 Frankenstein; or, the modern prometheus -# 1342 Pride and Prejudice -# 11 Alice's Adventures in Wonderland -``` - -### Read the plain text of a book (boilerplate stripped) - -```python -raw = http_get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt") -# 763 083 chars total including PG licence header and footer - -START = "*** START OF THE PROJECT GUTENBERG EBOOK" -END = "*** END OF THE PROJECT GUTENBERG EBOOK" -s = raw.find(START) -e = raw.find(END) -if s != -1: - content = raw[raw.index('\n', s) + 1 : e].strip() - # 743 241 chars of actual novel text -``` - -The cache URL is the most reliable direct path. The `formats` dict in Gutendex also provides a redirect URL that resolves to the same file: - -```python -# Both of these return identical content (763 083 chars): -http_get("https://www.gutenberg.org/ebooks/1342.txt.utf-8") # redirect -http_get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt") # direct cache -``` - -### Download formats available per book - -Every book's `formats` dict maps MIME type to URL. All URLs resolve to `/cache/epub/{id}/` files via redirect. - -| MIME type | URL pattern (after redirect) | Typical size | -|---|---|---| -| `text/plain; charset=utf-8` | `pg{id}.txt` | ~750 KB | -| `text/html` | `pg{id}-images.html` | ~850 KB | -| `application/epub+zip` | `pg{id}-images-3.epub` | ~25 MB | -| `application/x-mobipocket-ebook` | `pg{id}-images-kf8.mobi` | ~25 MB | -| `application/rdf+xml` | `{id}.rdf` via gutenberg.org | metadata XML | -| `image/jpeg` | `pg{id}.cover.medium.jpg` | cover image | -| `application/octet-stream` | `pg{id}-h.zip` | HTML+images zip | - -```python -import json - -b = json.loads(http_get("https://gutendex.com/books/1342/")) -# Grab every downloadable format URL: -for mime, url in b['formats'].items(): - print(mime, '->', url) -# text/html -> https://www.gutenberg.org/ebooks/1342.html.images -# application/epub+zip -> https://www.gutenberg.org/ebooks/1342.epub3.images -# application/x-mobipocket-ebook -> https://www.gutenberg.org/ebooks/1342.kf8.images -# application/rdf+xml -> https://www.gutenberg.org/ebooks/1342.rdf -# image/jpeg -> https://www.gutenberg.org/cache/epub/1342/pg1342.cover.medium.jpg -# application/octet-stream -> https://www.gutenberg.org/cache/epub/1342/pg1342-h.zip -# text/plain; charset=utf-8 -> https://www.gutenberg.org/ebooks/1342.txt.utf-8 -``` - -### Fetch RDF/XML metadata for a book - -```python -import re - -rdf = http_get("https://www.gutenberg.org/cache/epub/1342/pg1342.rdf") -# Also available as: http_get("https://www.gutenberg.org/ebooks/1342.rdf") - -title = re.search(r'(.*?)', rdf, re.DOTALL) -creator = re.findall(r'(.*?)', rdf) -birth = re.findall(r']*>(\d+)', rdf) -death = re.findall(r']*>(\d+)', rdf) -issued = re.search(r']*>(.*?)', rdf) -rights = re.search(r'(.*?)', rdf) -downloads = re.search(r']*>(\d+)', rdf) -language = re.search(r'.*?(.*?)', rdf, re.DOTALL) -subjects = re.findall(r'.*?(.*?).*?', rdf, re.DOTALL) - -print(title.group(1)) # Pride and Prejudice -print(creator) # ['Austen, Jane'] -print(birth, death) # ['1775'] ['1817'] -print(issued.group(1)) # 1998-06-01 -print(rights.group(1)) # Public domain in the USA. -print(int(downloads.group(1))) # 107502 -print(subjects[:3]) # ['England -- Fiction', 'Young women -- Fiction', 'Love stories'] -``` - -Note: `` value is a subject string, not a language code. For language codes use the Gutendex `languages` field instead. - -### Search the HTML catalog (25 results per page) - -Use this only when you need to leverage Gutenberg's own search index (author:, title:, subject: prefix syntax). - -```python -import re, json - -html = http_get( - "https://www.gutenberg.org/ebooks/search/" - "?query=shakespeare&sort_order=downloads" -) -# sort_order options: downloads, title, release_date, last_update, random - -entries = re.findall(r'
        3. ', html, re.DOTALL) -books = [] -for e in entries: - book_id = re.search(r'/ebooks/(\d+)', e) - title = re.search(r'(.*?)', e) - author = re.search(r'(.*?)', e) - downloads = re.search(r'([^<]+)', e) - books.append({ - 'id': int(book_id.group(1)) if book_id else None, - 'title': title.group(1) if title else '', - 'author': author.group(1) if author else '', - 'downloads': downloads.group(1).strip() if downloads else '', - }) - -# books[0] = {'id': 1513, 'title': 'Romeo and Juliet', -# 'author': 'William Shakespeare', 'downloads': '74316 downloads'} - -# Paginate with start_index (25 per page) -html_p2 = http_get( - "https://www.gutenberg.org/ebooks/search/" - "?query=shakespeare&sort_order=downloads&start_index=26" -) -``` - -### Browse a bookshelf (curated genre list) - -```python -import re - -# Bookshelf 68 = Science Fiction -html = http_get("https://www.gutenberg.org/ebooks/bookshelf/68") -titles = re.findall(r'(.*?)', html) -# ['Twenty Thousand Leagues under the Sea', 'The War of the Worlds', -# 'The Time Machine', 'Thuvia, Maid of Mars', ...] -``` - -### OPDS catalog (machine-readable Atom feed) - -```python -import re - -feed = http_get("https://www.gutenberg.org/ebooks/search.opds/?query=dracula") -# Returns Atom XML, 7 entries per page (including 1 metadata entry) -entries = re.findall(r'(.*?)', feed, re.DOTALL) -for e in entries: - title = re.search(r'(.*?)', e) - entry_id = re.search(r'(.*?)', e) - if title and entry_id and 'opds' in entry_id.group(1): - book_id = re.search(r'/ebooks/(\d+)\.opds', entry_id.group(1)) - print(book_id.group(1), title.group(1)) -# 345 Dracula -``` - -## Gutendex API — full response schema - -Validated against a real call to `GET https://gutendex.com/books/1342/`: - -```json -{ - "id": 1342, - "title": "Pride and Prejudice", - "authors": [ - {"name": "Austen, Jane", "birth_year": 1775, "death_year": 1817} - ], - "summaries": ["...automatically generated summary..."], - "editors": [], - "translators": [], - "subjects": [ - "Courtship -- Fiction", - "Domestic fiction", - "England -- Fiction", - "Love stories", - "Sisters -- Fiction", - "Women -- England -- Fiction", - "Young women -- Fiction" - ], - "bookshelves": [ - "Best Books Ever Listings", - "Category: British Literature", - "Category: Classics of Literature", - "Category: Novels", - "Category: Romance", - "Harvard Classics" - ], - "languages": ["en"], - "copyright": false, - "media_type": "Text", - "formats": { - "text/html": "https://www.gutenberg.org/ebooks/1342.html.images", - "application/epub+zip": "https://www.gutenberg.org/ebooks/1342.epub3.images", - "application/x-mobipocket-ebook": "https://www.gutenberg.org/ebooks/1342.kf8.images", - "application/rdf+xml": "https://www.gutenberg.org/ebooks/1342.rdf", - "image/jpeg": "https://www.gutenberg.org/cache/epub/1342/pg1342.cover.medium.jpg", - "application/octet-stream": "https://www.gutenberg.org/cache/epub/1342/pg1342-h.zip", - "text/plain; charset=utf-8": "https://www.gutenberg.org/ebooks/1342.txt.utf-8" - }, - "download_count": 107502 -} -``` - -List response wrapper (from `GET /books/`): - -```json -{ - "count": 6, - "next": null, - "previous": null, - "results": [...] -} -``` - -`count` is the total across all pages. `next` / `previous` are fully-formed URLs ready to pass to `http_get`, or `null` when absent. - -## Gutendex query parameters - -All parameters combine freely. - -| Parameter | Example | Notes | -|---|---|---| -| `search` | `search=moby+dick` | Matches title and author | -| `ids` | `ids=1342,11,84` | Comma-separated; returns only those books | -| `languages` | `languages=fr` | ISO 639-1 code; comma-separated for multiple | -| `topic` | `topic=science+fiction` | Matches subjects + bookshelves | -| `author_year_start` | `author_year_start=1800` | Author born on/after year | -| `author_year_end` | `author_year_end=1850` | Author born on/before year | -| `copyright` | `copyright=false` | `false`=public domain, `true`=copyrighted | -| `sort` | `sort=popular` | `popular` (default), `ascending`, `descending` | -| `page` | `page=2` | 1-based; 32 results per page (not configurable) | - -`page_size` is not supported — always 32 results per page regardless. - -## Finding book IDs - -Three ways, in order of preference: - -1. **Gutendex search** — returns `id` directly in JSON. -2. **Gutenberg HTML catalog** — `book_id = re.search(r'/ebooks/(\d+)', entry)`. IDs in the URL. -3. **URL pattern** — `https://www.gutenberg.org/ebooks/{id}` — if you already know the ID from any source. - -Notable IDs validated in tests: `84` (Frankenstein), `1342` (Pride and Prejudice), `11` (Alice in Wonderland), `2701` (Moby Dick), `64317` (The Great Gatsby), `1513` (Romeo and Juliet), `100` (Complete Works of Shakespeare), `1661` (Adventures of Sherlock Holmes), `345` (Dracula). - -## Rate limits - -Gutendex (`gutendex.com`) returns no `X-RateLimit-*` headers. Server is Apache/2.4.58 on Ubuntu. Rapid sequential calls can trigger connection resets — observed a timeout on the second call in a tight loop. Add a small delay between calls when paginating: - -```python -import time, json - -url = "https://gutendex.com/books/?sort=popular" -while url: - data = json.loads(http_get(url)) - # ... process data['results'] ... - url = data['next'] - if url: - time.sleep(0.5) # be respectful — no published rate limit but timeouts observed -``` - -For gutenberg.org file downloads (txt, epub, etc.) there is no documented rate limit but Gutenberg asks not to use automated bulk downloading; use their [offline catalogs](https://www.gutenberg.org/ebooks/offline_catalogs.html) for bulk access. - -## Gotchas - -- **`.opf` 404**: `https://www.gutenberg.org/cache/epub/1342/pg1342.opf` returns 404. Use `.rdf` instead — same path prefix, same data in RDF/XML. -- **`formats` URLs redirect**: URLs like `https://www.gutenberg.org/ebooks/1342.txt.utf-8` are redirect endpoints that resolve to `/cache/epub/1342/pg1342.txt`. Either form works with `http_get` (urllib follows redirects automatically), but the `/cache/epub/` direct URL avoids an extra round trip. -- **Two text files**: `/files/1342/1342-0.txt` (older Project Gutenberg edition, 729 KB) and `/cache/epub/1342/pg1342.txt` (modern edition, 763 KB) contain different versions of the same book. The Gutendex `formats` entry always points to the cache/modern version. -- **Boilerplate**: Every `.txt` file opens with a PG licence header and closes with a footer. Strip with `START`/`END` markers (see "Read the plain text" section above). -- **`summaries` field is AI-generated**: The `summaries` array in Gutendex responses contains automatically generated summaries, not the author's original blurb. -- **`copyright: false`** means public domain in the USA. Non-US copyright status is not tracked. -- **`page_size` ignored**: Passing `?page_size=5` to Gutendex has no effect — always returns 32 results. -- **Gutendex `sort=ascending/descending`** sorts by ID (oldest/newest book in the catalog), not by title or author name. -- **Catalog search `author:` prefix**: `?query=author:dickens` searches within author names but Gutenberg's relevance ranking is fuzzy and can return unexpected results. For precise author lookup use Gutendex `?search=charles+dickens`. -- **OPDS pagination**: Only 7 entries per page (1 metadata + 6 books). Slow for bulk extraction — use Gutendex instead. -- **HTML catalog `start_index`**: Pagination is 25 per page. Next page = `start_index=26`, then `51`, `76`, etc. The value appears in the rendered HTML (`re.findall(r'start_index=(\d+)', html)` returns the next page's value). diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/hackernews/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/hackernews/scraping.md deleted file mode 100644 index 86ac6b785..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/hackernews/scraping.md +++ /dev/null @@ -1,243 +0,0 @@ -# Hacker News — Data Extraction - -`https://news.ycombinator.com` — YCombinator's link aggregator. Three access paths tested: `http_get` DOM scraping, Algolia search API, and the official HN Firebase API. All work without a browser. - -## Do this first: pick your access path - -| Goal | Best approach | Latency | -|------|--------------|---------| -| Current front page (30 stories, real-time) | `http_get` + regex | ~170ms | -| Historical / keyword search | Algolia search API | ~400ms | -| Full comment tree (nested) | Algolia items API | ~300ms | -| Specific item by ID | Firebase API | ~200ms | -| 500 ranked story IDs | Firebase topstories | ~200ms (+ ~190ms/item after) | - -**Never use a browser for read-only HN tasks.** Everything is accessible over HTTP with no auth, no JS rendering needed. - ---- - -## Path 1: http_get front page (fastest for real-time data) - -The front page HTML is ~34KB. Story order matches Firebase `/topstories.json` exactly — confirmed identical on 2026-04-18. - -```python -import re, html as htmllib - -page = http_get("https://news.ycombinator.com") - -# Extract all 30 story IDs (in rank order) -story_ids = re.findall(r'', page) - -# Extract titles + URLs (same order as IDs) -titles_urls = re.findall( - r'class="titleline"[^>]*>
          ]*>(.*?)', page -) - -# Extract scores keyed by story ID (job posts have no score row) -scores_by_id = { - m.group(1): int(m.group(2)) - for m in re.finditer( - r'(\d+) points', page - ) -} - -# Extract authors keyed by story ID (anchor on score span) -authors_by_id = {} -for m in re.finditer( - r'\d+ points' - r'.*?class="hnuser">(.*?)', - page, re.DOTALL -): - authors_by_id[m.group(1)] = m.group(2) - -# Extract comment counts keyed by story ID -comments_by_id = { - m.group(1): int(m.group(2)) - for m in re.finditer( - r'href="item\?id=(\d+)">(\d+) comments', page - ) -} - -stories = [] -for i, sid in enumerate(story_ids): - url, raw_title = titles_urls[i] if i < len(titles_urls) else ('', '') - stories.append({ - 'rank': i + 1, - 'id': sid, - 'title': htmllib.unescape(raw_title), # MUST unescape — titles contain ' etc. - 'url': url, - 'score': scores_by_id.get(sid), # None for job posts - 'author': authors_by_id.get(sid), - 'comments': comments_by_id.get(sid, 0), - }) -``` - -**Gotchas:** -- Titles contain HTML entities (`'` `&` `"` `>`). Always call `html.unescape()`. -- `` — the class is `athing submission`, not just `athing`. The `athing comtr` class is for comment rows. -- Job/hiring posts (YC ads) appear in the list but have no score or author. `scores_by_id.get(sid)` returns `None` for them — check before comparing. -- `re.DOTALL` multi-line patterns can cross story boundaries. Use ID-anchored patterns (as above) instead of positional zip for score/author. -- The page only serves page 1 (30 items). Pages 2–4 exist at `?p=2` etc. but require a login cookie for page 3+. - ---- - -## Path 2: Algolia search API (best for historical / keyword search) - -No rate limiting observed. Returns up to 1000 hits per query (`hitsPerPage` max is capped at ~1000 per Algolia plan). - -```python -import json - -# Keyword search — sorted by relevance -data = json.loads(http_get( - "https://hn.algolia.com/api/v1/search" - "?query=llm&tags=story&hitsPerPage=20" -)) - -# Date-sorted (most recent first) -data = json.loads(http_get( - "https://hn.algolia.com/api/v1/search_by_date" - "?tags=story&hitsPerPage=20" -)) - -# Paginate: add &page=N (0-indexed), up to data['nbPages']-1 -``` - -**Fields returned per story hit:** -``` -objectID, title, url, author, points, num_comments, -created_at (ISO 8601), created_at_i (unix ts), story_id, -children (list of comment IDs — flat, not tree), -_tags, _highlightResult -``` - -**Fields returned per comment hit:** -``` -objectID, comment_text, author, story_id, story_title, story_url, -parent_id, created_at, created_at_i, points -``` -Note: comment hits use `comment_text`, NOT `text`. Story hits use `story_text` for self-post body. - -### Tag filters - -Tags are AND by default, OR with parentheses: - -```python -# Story types -"tags=story" # regular link/self posts -"tags=show_hn" # Show HN -"tags=ask_hn" # Ask HN -"tags=poll" # polls -"tags=job" # job posts - -# Combined AND -"tags=story,front_page" # currently on front page -"tags=story,author_pg" # stories submitted by pg - -# OR -"tags=(ask_hn,show_hn),story" # Ask OR Show HN - -# By story ID (gets story + all its comments) -"tags=story_47806725" -``` - -### Numeric filters - -```python -# Date range (unix timestamps) -"numericFilters=created_at_i>1745000000" -"numericFilters=created_at_i>1700000000,created_at_i<1750000000" - -# Point threshold -"numericFilters=points>100" -"numericFilters=points>500,points<1000" -``` - -### Full Algolia items API (nested comment tree) - -```python -import json - -thread = json.loads(http_get( - "https://hn.algolia.com/api/v1/items/47806725" -)) -# thread['children'] = list of top-level comment objects -# Each comment: author, text (HTML), created_at, children (nested replies) -# Recursively walk children for full thread - -# Total comment count (recursive walk with stack): -stack = list(thread.get('children', [])) -total = 0 -while stack: - node = stack.pop() - total += 1 - stack.extend(node.get('children', [])) -``` - -Confirmed: Algolia items returns 653 total comments for a 659-comment thread (some deleted). `text` field in items API is HTML with `

          ` tags and `` links — may need to strip tags. - ---- - -## Path 3: Official HN Firebase API - -Clean JSON, no scraping. Use for fetching specific items or building live feeds. - -```python -import json - -# Ranked story ID lists (no metadata — just IDs) -top = json.loads(http_get("https://hacker-news.firebaseio.com/v0/topstories.json")) # 500 IDs -new = json.loads(http_get("https://hacker-news.firebaseio.com/v0/newstories.json")) # 500 IDs -best = json.loads(http_get("https://hacker-news.firebaseio.com/v0/beststories.json")) # 200 IDs -ask = json.loads(http_get("https://hacker-news.firebaseio.com/v0/askstories.json")) # ~32 IDs -show = json.loads(http_get("https://hacker-news.firebaseio.com/v0/showstories.json")) # ~119 IDs -jobs = json.loads(http_get("https://hacker-news.firebaseio.com/v0/jobstories.json")) # ~31 IDs - -# Fetch a single item -item = json.loads(http_get( - "https://hacker-news.firebaseio.com/v0/item/47806725.json" -)) -# Fields: id, type, by, title, url, score, descendants (total comment count), -# time (unix ts), kids (list of top-level comment IDs), text (self-post body) - -# Fetch a user profile -user = json.loads(http_get( - "https://hacker-news.firebaseio.com/v0/user/pg.json" -)) -# Fields: id, karma, created (unix ts), about (HTML), submitted (list of item IDs) - -# Highest current item ID (useful for polling new items) -maxid = json.loads(http_get("https://hacker-news.firebaseio.com/v0/maxitem.json")) -``` - -**Firebase vs Algolia tradeoff:** -- Firebase `topstories` gives you 500 IDs in one call but then requires one HTTP call per item (~190ms each). Fetching all 500 items sequentially would take ~100 seconds. -- Algolia returns full story data (title, points, author, comments) in one call for up to ~1000 results. -- For "top 30 stories with full metadata": use `http_get` front page scrape (170ms total). For "top 500 stories with full metadata": use Algolia with `tags=front_page` or loop pages. - ---- - -## Comment thread HTML (item page) - -For a large thread, the item page HTML (~1MB for 659 comments) loads ALL comments flat in a single request — no pagination, no JS required. - -```python -import re, html as htmllib - -page = http_get("https://news.ycombinator.com/item?id=47806725") - -# Count all comment IDs -comment_ids = re.findall(r'', page) -# len(comment_ids) matches total comment count - -# Extract comment texts (careful: text spans multiple lines with

          tags) -# Use Algolia items API instead for structured access -``` - -For structured comment access prefer Algolia items API — it returns a proper nested tree. The HTML item page is useful only when you need approximate comment count without an API call. - ---- - -## Do NOT use a browser for HN - -All data is in plain HTML or JSON APIs. `goto_url()` + `wait_for_load()` takes 3–8 seconds; `http_get` takes 170–400ms. The JS `querySelectorAll` approach works (tested, returns correct data) but is 20–50x slower with no benefit. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/howlongtobeat/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/howlongtobeat/scraping.md deleted file mode 100644 index e93bba749..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/howlongtobeat/scraping.md +++ /dev/null @@ -1,473 +0,0 @@ -# HowLongToBeat — Scraping & Data Extraction - -Field-tested against howlongtobeat.com on 2026-04-18. All code blocks validated with live requests. - -## Do this first - -**Use the search API — it returns structured JSON with all completion times in one POST call.** - -HLTB runs a token-gated POST endpoint at `/api/find`. You must first fetch a session token from `/api/find/init`, then include it in the search request. Both steps are plain HTTP — no browser required. - -```python -import json, re, urllib.request, time -from helpers import http_get - -UA = "Mozilla/5.0" - -def get_token(): - """Fetch a fresh session token. Token encodes IP+UA+timestamp, reusable for ~15 min.""" - url = f"https://howlongtobeat.com/api/find/init?t={int(time.time()*1000)}" - data = http_get(url, headers={"Referer": "https://howlongtobeat.com/"}) - return json.loads(data) # {token, hpKey, hpVal} - -def search_hltb(title, size=20, page=1, token_data=None): - """ - Search HLTB for games. Returns raw API dict: - {count, pageCurrent, pageTotal, pageSize, data: [...]} - token_data can be reused across searches (fetch once, use many times). - """ - if token_data is None: - token_data = get_token() - hp_key, hp_val = token_data['hpKey'], token_data['hpVal'] - payload = { - "searchType": "games", - "searchTerms": title.split(), - "searchPage": page, - "size": size, - "searchOptions": { - "games": { - "userId": 0, "platform": "", "sortCategory": "popular", - "rangeCategory": "main", "rangeTime": {"min": None, "max": None}, - "gameplay": {"perspective": "", "flow": "", "genre": "", "difficulty": ""}, - "rangeYear": {"min": "", "max": ""}, "modifier": "" - }, - "users": {"sortCategory": "postcount"}, - "lists": {"sortCategory": "follows"}, - "filter": "", "sort": 0, "randomizer": 0 - }, - "useCache": True, - hp_key: hp_val # honeypot field — key and value vary per token - } - req = urllib.request.Request( - "https://howlongtobeat.com/api/find", - data=json.dumps(payload).encode(), - headers={ - "User-Agent": UA, - "Content-Type": "application/json", - "Origin": "https://howlongtobeat.com", - "Referer": "https://howlongtobeat.com/", - "x-auth-token": token_data['token'], - "x-hp-key": hp_key, - "x-hp-val": hp_val, - }, - method="POST" - ) - with urllib.request.urlopen(req, timeout=20) as r: - return json.loads(r.read().decode()) - -# Usage -tok = get_token() - -result = search_hltb("elden ring", token_data=tok, size=3) -for g in result['data']: - print(g['game_id'], g['game_name'], g['release_world']) - print(f" Main: {g['comp_main']/3600:.1f}h +Extras: {g['comp_plus']/3600:.1f}h 100%: {g['comp_100']/3600:.1f}h") - -# Confirmed output (2026-04-18): -# 68151 Elden Ring 2022 -# Main: 60.0h +Extras: 101.2h 100%: 135.5h -# 160589 Elden Ring: Nightreign 2025 -# Main: 28.1h +Extras: 40.1h 100%: 66.9h -# 139385 Elden Ring: Shadow of the Erdtree 2024 -# Main: 25.7h +Extras: 39.0h 100%: 51.1h -``` - -Token is reusable — fetch it once and pass it to multiple `search_hltb()` calls. No need to re-fetch per search. - ---- - -## Fastest approach: search + parse in one helper - -```python -import json, re, urllib.request, time -from helpers import http_get - -UA = "Mozilla/5.0" - -def hltb_search(title, size=5): - """One-shot: get token + search, return list of dicts with hours.""" - url = f"https://howlongtobeat.com/api/find/init?t={int(time.time()*1000)}" - tok = json.loads(http_get(url, headers={"Referer": "https://howlongtobeat.com/"})) - hp_key, hp_val = tok['hpKey'], tok['hpVal'] - payload = { - "searchType": "games", "searchTerms": title.split(), "searchPage": 1, "size": size, - "searchOptions": { - "games": {"userId": 0, "platform": "", "sortCategory": "popular", - "rangeCategory": "main", "rangeTime": {"min": None, "max": None}, - "gameplay": {"perspective": "", "flow": "", "genre": "", "difficulty": ""}, - "rangeYear": {"min": "", "max": ""}, "modifier": ""}, - "users": {"sortCategory": "postcount"}, "lists": {"sortCategory": "follows"}, - "filter": "", "sort": 0, "randomizer": 0 - }, - "useCache": True, hp_key: hp_val - } - req = urllib.request.Request( - "https://howlongtobeat.com/api/find", data=json.dumps(payload).encode(), - headers={"User-Agent": UA, "Content-Type": "application/json", - "Origin": "https://howlongtobeat.com", "Referer": "https://howlongtobeat.com/", - "x-auth-token": tok['token'], "x-hp-key": hp_key, "x-hp-val": hp_val}, - method="POST" - ) - with urllib.request.urlopen(req, timeout=20) as r: - data = json.loads(r.read().decode()) - - def h(secs): - return round(secs / 3600, 1) if secs else None - - return [ - { - "game_id": g["game_id"], - "name": g["game_name"], - "type": g["game_type"], # "game" | "dlc" | "expansion" | "hack" - "year": g["release_world"], - "platforms": g["profile_platform"], - "main": h(g["comp_main"]), # Main Story hours (polled average) - "main_plus": h(g["comp_plus"]), # Main + Extras hours - "completionist":h(g["comp_100"]), # Completionist hours - "all_styles": h(g["comp_all"]), # All playstyles combined - "main_count": g["comp_main_count"], # Number of submissions - "plus_count": g["comp_plus_count"], - "comp_count": g["comp_100_count"], - "review_score": g["review_score"], # 0–100 - "image_url": f"https://howlongtobeat.com/games/{g['game_image']}", - "page_url": f"https://howlongtobeat.com/game/{g['game_id']}", - } - for g in data["data"] - ] - -# Verified results (2026-04-18): -print(hltb_search("the witcher 3")[0]) -# {'game_id': 10270, 'name': 'The Witcher 3: Wild Hunt', 'type': 'game', 'year': 2015, -# 'main': 51.6, 'main_plus': 103.8, 'completionist': 174.4, 'all_styles': 103.8, -# 'main_count': 2681, 'plus_count': 6708, 'comp_count': 2327, 'review_score': 93, ...} - -print(hltb_search("gone home")[0]) -# {'game_id': 4010, 'name': 'Gone Home', 'main': 2.0, 'main_plus': 2.5, 'completionist': 3.1, ...} -``` - ---- - -## Game detail page (full stat breakdown, speedrun data, per-platform times) - -When you have a `game_id`, fetch the game page and extract `__NEXT_DATA__` for the complete dataset — includes median/avg/low/high times, speedrun data, co-op/multiplayer times, and per-platform breakdowns. - -```python -import json, re -from helpers import http_get - -def get_game_detail(game_id): - """ - Fetch complete game data from the HLTB game page. - Returns pageProps['game']['data'] with keys: 'game', 'individuality', 'relationships'. - """ - html = http_get(f"https://howlongtobeat.com/game/{game_id}") - nd = json.loads(re.search( - r'', html, re.DOTALL - ).group(1)) - return nd['props']['pageProps']['game']['data'] - -data = get_game_detail(10270) # Witcher 3 -g = data['game'][0] - -# Core completion times (all in seconds — divide by 3600 for hours) -print(g['comp_main'] / 3600) # 51.6 — Main Story (polled avg) -print(g['comp_main_med'] / 3600) # 50.0 — Main Story median -print(g['comp_main_l'] / 3600) # 32.7 — Main Story low -print(g['comp_main_h'] / 3600) # 85.8 — Main Story high -print(g['comp_main_count']) # 2681 — submission count - -print(g['comp_plus'] / 3600) # 103.8 — Main + Extras -print(g['comp_100'] / 3600) # 174.4 — Completionist -print(g['comp_all'] / 3600) # 103.8 — All Styles - -# Speedrun times -print(g['comp_lvl_spd']) # 1 if speedrun data exists, 0 if not -print(g['comp_speed'] / 3600) # 19.2 — any% (polled avg) -print(g['comp_speed_min'] / 3600) # 3.2 — fastest submission -print(g['comp_speed_max'] / 3600) # 30.0 — slowest speedrun -print(g['comp_speed_count']) # 15 — speedrun submissions - -print(g['comp_speed100'] / 3600) # 59.4 — 100% speedrun -print(g['comp_speed100_count']) # 4 - -# Multiplayer / co-op invested time -print(g['comp_lvl_co']) # 1 if co-op data exists -print(g['comp_lvl_mp']) # 1 if multiplayer data exists -print(g['invested_co'] / 3600) # hours in co-op mode -print(g['invested_mp'] / 3600) # hours in competitive multiplayer -print(g['invested_co_count']) # submission count - -# Metadata -print(g['profile_dev']) # "CD Projekt RED" -print(g['profile_pub']) # "CD Projekt, Warner Bros..." -print(g['profile_platform']) # "Nintendo Switch, PC, PlayStation 4, ..." -print(g['profile_genre']) # "Third-Person, Action, Open World, Role-Playing" -print(g['profile_steam']) # 292030 — Steam App ID (0 if not on Steam) -print(g['release_world']) # "2015-05-19" -print(g['rating_esrb']) # "M" -print(g['review_score']) # 93 (0–100) -print(g['count_comp']) # 26007 — times completed -print(g['count_backlog']) # 31083 - -# Per-platform breakdown (individuality) -for plat in data['individuality']: - print(plat['platform'], - int(plat['comp_main'])/3600, # main hours - int(plat['comp_plus'])/3600, # +extras hours - int(plat['comp_100'])/3600, # 100% hours - plat['count_comp']) # completions on this platform -# Example: -# Nintendo Switch 57.0h 112.3h 194.9h 236 -# PC, PS4, Xbox One 52.9h 110.0h 179.4h 11136 -# PS5, Xbox Series X/S 52.1h 92.5h 168.8h 343 - -# DLC / expansion completion times -for rel in data['relationships'][:3]: - print(rel['game_id'], rel['game_name'], rel['game_type'], - rel['comp_main']/3600 if rel['comp_main'] else None) -``` - ---- - -## Common workflows - -### Quick lookup: name → completion times - -```python -import json, re, urllib.request, time -from helpers import http_get - -UA = "Mozilla/5.0" - -def get_times(title): - """Return Main/+Extras/100% hours for the top search match.""" - tok_url = f"https://howlongtobeat.com/api/find/init?t={int(time.time()*1000)}" - tok = json.loads(http_get(tok_url, headers={"Referer": "https://howlongtobeat.com/"})) - hp_key, hp_val = tok['hpKey'], tok['hpVal'] - payload = { - "searchType": "games", "searchTerms": title.split(), "searchPage": 1, "size": 1, - "searchOptions": { - "games": {"userId": 0, "platform": "", "sortCategory": "popular", - "rangeCategory": "main", "rangeTime": {"min": None, "max": None}, - "gameplay": {"perspective": "", "flow": "", "genre": "", "difficulty": ""}, - "rangeYear": {"min": "", "max": ""}, "modifier": ""}, - "users": {"sortCategory": "postcount"}, "lists": {"sortCategory": "follows"}, - "filter": "", "sort": 0, "randomizer": 0 - }, - "useCache": True, hp_key: hp_val - } - req = urllib.request.Request( - "https://howlongtobeat.com/api/find", data=json.dumps(payload).encode(), - headers={"User-Agent": UA, "Content-Type": "application/json", - "Origin": "https://howlongtobeat.com", "Referer": "https://howlongtobeat.com/", - "x-auth-token": tok['token'], "x-hp-key": hp_key, "x-hp-val": hp_val}, - method="POST" - ) - with urllib.request.urlopen(req, timeout=20) as r: - data = json.loads(r.read().decode()) - if not data['data']: - return None - g = data['data'][0] - h = lambda s: round(s/3600, 1) if s else None - return { - "id": g['game_id'], "name": g['game_name'], - "main": h(g['comp_main']), "main_plus": h(g['comp_plus']), - "completionist": h(g['comp_100']) - } - -# Verified: -print(get_times("celeste")) -# {'id': 42818, 'name': 'Celeste', 'main': 8.3, 'main_plus': 14.6, 'completionist': 39.2} -print(get_times("stardew valley")) -# {'id': 34716, 'name': 'Stardew Valley', 'main': 53.4, 'main_plus': 94.6, 'completionist': 171.5} -print(get_times("hades")) -# {'id': 62941, 'name': 'Hades', 'main': 23.4, 'main_plus': 48.5, 'completionist': 95.0} -``` - -### Paginated search (all results for a query) - -`count` = total matches, `pageTotal` = total pages with current `size`. The same token works across all pages. - -```python -def search_all_pages(title, size=20): - """Yield every search result for a query across all pages.""" - tok_url = f"https://howlongtobeat.com/api/find/init?t={int(time.time()*1000)}" - tok = json.loads(http_get(tok_url, headers={"Referer": "https://howlongtobeat.com/"})) - hp_key, hp_val = tok['hpKey'], tok['hpVal'] - - page = 1 - while True: - payload = { - "searchType": "games", "searchTerms": title.split(), - "searchPage": page, "size": size, - "searchOptions": { - "games": {"userId": 0, "platform": "", "sortCategory": "popular", - "rangeCategory": "main", "rangeTime": {"min": None, "max": None}, - "gameplay": {"perspective": "", "flow": "", "genre": "", "difficulty": ""}, - "rangeYear": {"min": "", "max": ""}, "modifier": ""}, - "users": {"sortCategory": "postcount"}, "lists": {"sortCategory": "follows"}, - "filter": "", "sort": 0, "randomizer": 0 - }, - "useCache": True, hp_key: hp_val - } - req = urllib.request.Request( - "https://howlongtobeat.com/api/find", data=json.dumps(payload).encode(), - headers={"User-Agent": UA, "Content-Type": "application/json", - "Origin": "https://howlongtobeat.com", "Referer": "https://howlongtobeat.com/", - "x-auth-token": tok['token'], "x-hp-key": hp_key, "x-hp-val": hp_val}, - method="POST" - ) - with urllib.request.urlopen(req, timeout=20) as r: - data = json.loads(r.read().decode()) - yield from data['data'] - if page >= data['pageTotal']: - break - page += 1 - -# "mario" returns 308 results across 16 pages (size=20) -mario_games = list(search_all_pages("mario", size=20)) -print(len(mario_games)) # 308 -``` - -### Batch lookup by game ID (parallel) - -```python -import json, re, urllib.request -from concurrent.futures import ThreadPoolExecutor -from helpers import http_get - -def fetch_game(game_id): - html = http_get(f"https://howlongtobeat.com/game/{game_id}") - nd = json.loads(re.search( - r'', html, re.DOTALL - ).group(1)) - g = nd['props']['pageProps']['game']['data']['game'][0] - return { - "id": g['game_id'], "name": g['game_name'], - "main": round(g['comp_main']/3600, 1) if g['comp_main'] else None, - "main_plus": round(g['comp_plus']/3600, 1) if g['comp_plus'] else None, - "completionist": round(g['comp_100']/3600, 1) if g['comp_100'] else None, - } - -ids = [10270, 68151, 42818, 26803, 34716] # Witcher3, Elden Ring, Celeste, DS3, Stardew -with ThreadPoolExecutor(max_workers=5) as ex: - results = list(ex.map(fetch_game, ids)) - -for r in results: - print(f"[{r['id']}] {r['name']}: {r['main']}h / {r['main_plus']}h / {r['completionist']}h") - -# Confirmed output: -# [10270] The Witcher 3: Wild Hunt: 51.6h / 103.8h / 174.4h -# [68151] Elden Ring: 60.0h / 101.2h / 135.5h -# [42818] Celeste: 8.3h / 14.6h / 39.2h -# [26803] Dark Souls III: 31.2h / 48.4h / 100.5h -# [34716] Stardew Valley: 53.4h / 94.6h / 171.5h -``` - ---- - -## Search response field reference - -Every item in `data[]` from `/api/find`: - -| Field | Type | Description | -|-------|------|-------------| -| `game_id` | int | HLTB internal game ID | -| `game_name` | str | Full game title | -| `game_alias` | str | Alternate title / edition name | -| `game_type` | str | `"game"` \| `"dlc"` \| `"expansion"` \| `"hack"` | -| `game_image` | str | Image filename → `https://howlongtobeat.com/games/{game_image}` | -| `release_world` | int | Release year (just the year integer, not a date) | -| `profile_platform` | str | Comma-separated platform list | -| `comp_main` | int | Main Story seconds (polled average), 0 if no data | -| `comp_plus` | int | Main + Extras seconds | -| `comp_100` | int | Completionist seconds | -| `comp_all` | int | All Styles combined seconds | -| `comp_main_count` | int | Submission count for Main Story | -| `comp_plus_count` | int | Submission count for Main + Extras | -| `comp_100_count` | int | Submission count for Completionist | -| `comp_all_count` | int | Total submissions across all categories | -| `comp_lvl_sp` | int | 1 if single-player data exists | -| `comp_lvl_co` | int | 1 if co-op data exists | -| `comp_lvl_mp` | int | 1 if multiplayer data exists | -| `invested_co` | int | Average co-op time in seconds | -| `invested_mp` | int | Average multiplayer time in seconds | -| `count_comp` | int | Total completions logged | -| `count_backlog` | int | Users with game in backlog | -| `count_playing` | int | Currently playing | -| `count_speedrun` | int | Speedrun entries | -| `count_review` | int | Review count | -| `review_score` | int | Community review score 0–100 | -| `profile_popular` | int | Popularity rank | - -Additional fields in `__NEXT_DATA__` game page only: - -| Field | Description | -|-------|-------------| -| `comp_main_med/avg/l/h` | Median / average / low / high for main time | -| `comp_plus_med/avg/l/h` | Same for Main + Extras | -| `comp_100_med/avg/l/h` | Same for Completionist | -| `comp_speed` | Speedrun any% average seconds | -| `comp_speed_min/max/med` | Speedrun spread | -| `comp_speed100` | 100% speedrun average | -| `comp_speed_count` | Speedrun submission count | -| `comp_lvl_spd` | 1 if speedrun data exists | -| `profile_dev` | Developer name | -| `profile_pub` | Publisher name | -| `profile_genre` | Comma-separated genres | -| `profile_steam` | Steam App ID (0 if not on Steam) | -| `release_world` | Full release date `"YYYY-MM-DD"` | -| `rating_esrb` | ESRB rating string (may be empty) | -| `count_replay` | Times replayed | -| `count_total` | Total user entries | - ---- - -## Anti-bot measures - -- **Cloudflare** is present (confirmed by `CF-Ray` response header), but does not block plain HTTP with a browser UA. -- **Token system**: Every search requires a fresh token from `/api/find/init`. Token encodes `timestamp::IP|UA|hpKey|hmacHash`. The server validates that the UA used to fetch the token matches the UA used in the search POST. -- **Honeypot field**: `hpKey` and `hpVal` from the init response must appear as a top-level field in the POST body (e.g., `{"ign_7671546b": "a6679ea54598d502", ...}`). The key name rotates per request. -- **Required headers on search POST**: `Origin: https://howlongtobeat.com` AND `Referer: https://howlongtobeat.com/` — missing either causes HTTP 403 or 404. `x-auth-token`, `x-hp-key`, `x-hp-val` are also required. -- **Required header on init GET**: `Referer: https://howlongtobeat.com/` — missing causes HTTP 403. -- **Token reuse**: A single token works for multiple searches and multiple pages. No per-request token fetch needed. -- **No CAPTCHA** observed during testing with standard UA strings. -- **Rate limits**: Not triggered during testing (token fetches + 10+ searches sequentially). Fetching many game pages in parallel (5 workers) worked without 429s. - ---- - -## Gotchas - -- **Completion times are in seconds** — all `comp_*` fields are integer seconds. Divide by 3600 for hours. `0` means no data (not 0 hours). - -- **`release_world` is a year int in search, a full date in game page** — in the `/api/find` response, `release_world` is an integer year (e.g., `2015`). In `__NEXT_DATA__` on the game page, it's `"2015-05-19"`. - -- **UA fingerprinting** — the token from `/api/find/init` encodes the User-Agent. The search POST must use the identical UA that fetched the token, or you'll get HTTP 403. Since `http_get` sends `Mozilla/5.0`, use that same string for the search POST. - -- **Honeypot key name rotates** — `hpKey` is something like `ign_7671546b` (changes each token fetch). Always read it from the init response and use it dynamically. Never hardcode it. - -- **Both `x-hp-key`/`x-hp-val` headers AND the body field are required** — the server checks the request headers (`x-hp-key`, `x-hp-val`) against the dynamic key in the POST body. If either is wrong or missing, you get HTTP 404 (wrong body value) or HTTP 403 (missing/wrong header). - -- **`game_type` in search results** — can be `"game"`, `"dlc"`, `"expansion"`, or `"hack"`. Search results mix these by default. Filter with `if g['game_type'] == 'game'` if you only want base games. - -- **Games with no submission data** — `comp_main`, `comp_plus`, `comp_100` are `0` (not `None`) when no users have submitted times. Always check `if g['comp_main']:` before dividing. - -- **`individuality` (per-platform) data** — available only in `__NEXT_DATA__` on the game page, not in search results. `comp_main` etc. are strings, not ints, in this sub-object — cast with `int(plat['comp_main'])`. - -- **`profile_platform` in search** — a comma-separated string that HLTB displays. Not structured. Use game page `individuality` for per-platform time breakdowns. - -- **Token expiry** — if a long-running loop gets HTTP 403 with `{"error":"Session expired or invalid fingerprint"}`, call `get_token()` again and retry. Token lifetime appears to be ~15 minutes based on the timestamp embedded in the decoded value. - -- **No slug-based URLs** — HLTB uses integer `game_id` for all game pages, not slugs. There is no `title-to-slug` mapping; use search to find the `game_id` first. - -- **`sortCategory` options** — `"popular"` ranks by community engagement (best for "top result = intended game"). `"name"` sorts alphabetically. Other values (`"madnessTime"`, `"mainThenExtras"`) exist but return same results as `"name"` in testing. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/imdb/scraping.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/imdb/scraping.md deleted file mode 100644 index 10e15f9de..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/imdb/scraping.md +++ /dev/null @@ -1,271 +0,0 @@ -# IMDb — Charts, Search, and "More Like This" Scraping - -`https://www.imdb.com` — the Internet Movie Database. Field-tested on 2026-04-24 against `chart/top`, `chart/moviemeter`, `find/?s=tt&q=`, and `title/tt{id}/` pages. - -IMDb's app shell is React with a shared design system (`ipc-*` classes). The same `li.ipc-metadata-list-summary-item` row primitive is reused across Top 250, MovieMeter, Search, and most other list pages — learn one selector set, scrape many pages. - -The `tt`-prefixed title ID in the URL (`/title/tt0111161/`) is IMDb's stable primary key. Titles, prefixes, rankings, and CSS class hashes change between releases; `tt`-ids do not. Always dedupe by `tt`-id. - ---- - -## Access path decision table - -| Goal | Method | Page | Notes | -|------|--------|------|-------| -| Top 250 films (ranked) | browser | `/chart/top` | 250 rows, fully rendered server-side | -| MovieMeter (trending top 100) | browser | `/chart/moviemeter` | 100 rows, same row structure as Top 250 | -| Keyword/title search | browser | `/find/?s=tt&q=KEYWORD` | `s=tt` restricts to titles | -| "More Like This" recommendations | browser | `/title/tt{id}/` | Lazy-loaded, requires scroll | -| Title metadata (year, runtime, genres) | `http_get` + JSON-LD | `/title/tt{id}/` | The `', - html, re.DOTALL - ): - ld = json.loads(block.strip()) - if ld.get('@type') == 'Product': - ld_product = ld - break - - # --- Info panel table (Status, Platforms, Genre, Tags, Author, etc.) --- - info = {} - panel_m = re.search( - r'class="game_info_panel_widget[^"]*"[^>]*>(.*?)
          ', - html, re.DOTALL - ) - if panel_m: - for row in re.finditer( - r'([^<]+)(.*?)', - panel_m.group(1), re.DOTALL - ): - key = row.group(1).strip() - val = re.sub(r'<[^>]+>', '', row.group(2)).strip() - # Multi-value fields become lists (Tags, Platforms, Genre, Links) - info[key] = [v.strip() for v in val.split(',')] if ',' in val else val - - # --- Cover image --- - cover_m = re.search(r'`. - -```python -import re -from helpers import http_get - -def paginate_listing(base_url, max_pages=10): - """ - Scrape multiple pages from any itch.io browse URL. - base_url: https://itch.io/games/top-rated (no ?page= suffix) - Returns flat list of game dicts. - Stops when HTTP 404 or no found. - """ - all_games = [] - page = 1 - while page <= max_pages: - url = base_url if page == 1 else f"{base_url}?page={page}" - try: - html = http_get(url) - except Exception: - break # 404 = past last page - all_games.extend(parse_game_cards(html)) - if not re.search(r']+rel="next"[^>]*/>', html): - break - page += 1 - return all_games - -# Confirmed: page 1 has -# page 2 has and -# past last page returns HTTP 404 -# top-rated has at least 200 pages (each 36 games); page 300+ -> 404 -``` - ---- - -## Browse URL patterns - -All confirmed working via `http_get`: - -```python -BASE = "https://itch.io/games" - -# Sort orders -f"{BASE}/top-rated" # all-time top rated (rated by community, 0–5 stars) -f"{BASE}/newest" # most recently published -f"{BASE}/featured" # itch.io staff picks -f"{BASE}/on-sale" # discounted games -f"{BASE}/free" # free games only - -# Genre/tag paths (append .xml for RSS) -f"{BASE}/tag-puzzle" # tag slug — prefix with 'tag-' -f"{BASE}/genre-action" # genre — prefix with 'genre-' (less common) - -# Combine: tag + sort via separate pages (no combined URL that survives http_get) -# Note: https://itch.io/games/top-rated/tag-puzzle -> HTTP 403 -# Note: ?tag= query param does NOT filter server-side (returns same games) - -# Pagination -f"{BASE}/top-rated?page=2" -f"{BASE}/tag-puzzle?page=3" - -# RSS equivalents (36 items, no pagination needed for small sets) -f"{BASE}/top-rated.xml" -f"{BASE}/tag-puzzle.xml" -f"{BASE}/tag-puzzle.xml?page=2" - -# Search (54 results/page, no server-side pagination beyond page 1 via http_get) -"https://itch.io/search?q=platformer" - -# Author profile -"https://.itch.io" -``` - ---- - -## API (requires key) - -itch.io has an official REST API. A free key is issued per-account with no rate limit published. -Get one at: `https://itch.io/user/settings/api-keys` - -Base URL: `https://itch.io/api/1//` - -```python -import json -from helpers import http_get - -ITCH_KEY = "your_api_key_here" # from https://itch.io/user/settings/api-keys - -def api(path): - return json.loads(http_get(f"https://itch.io/api/1/{ITCH_KEY}/{path}")) - -# Authenticated user info -api("me") -# -> {"user": {"id": ..., "username": "...", "url": "...", "display_name": "...", ...}} - -# Games owned by authenticated user -api("my-games") -# -> {"games": [{"id": ..., "title": "...", "url": "...", "created_at": "...", -# "published": true/false, "min_price": 0, ...}, ...]} - -# Download keys for a game (owner only) -api("game/434554/download_keys") - -# Credentials (for authenticated purchases) -api("game/434554/credentials") -``` - -**Error structure:** invalid/missing key returns `{"errors": ["invalid key"]}` with HTTP 200. -Non-existent endpoints return HTTP 404. - -**No unauthenticated game lookup API.** `https://itch.io/api/1/x/games` -> HTTP 404. -Use HTML scraping or RSS for unauthenticated game data. - ---- - -## Gotchas - -1. **Attribute order flips page 1 vs 2+.** On page 1, game cards use `class="game_cell ..." data-game_id="..."`. On pages 2+, the order is `data-game_id="..." class="game_cell ..."`. Always match `data-game_id` independently of class ordering. - -2. **Ratings absent on tag/genre listing pages.** The `data-tooltip` with rating is often missing from card HTML on `/games/tag-*` pages even though the game has ratings. Fetch the detail page for `aggregateRating` via JSON-LD. - -3. **`price_value` absent = Free.** Paid games have `

          `. Free games have no such element. Default to `'Free'` when absent. - -4. **Free-game JSON-LD has no `offers` block.** Only paid games include the `offers` object. For free games, use absence of `offers` as the signal, not presence of `price: 0`. - -5. **`/games/top-rated/tag-puzzle` returns HTTP 403.** Cannot combine sort + tag in a path. Use separate `/games/tag-puzzle` (top-rated is the default sort anyway). - -6. **`?tag=` query param is ignored server-side.** `https://itch.io/games/top-rated?tag=puzzle` returns the same games as `?top-rated`. Use `/games/tag-puzzle` path instead. - -7. **Download/purchase counts are not public.** No count field appears anywhere in the public HTML, JSON-LD, RSS, or unauthenticated API. Game owners see their stats in the dashboard only. - -8. **Search beyond page 1 is AJAX-only.** `https://itch.io/search?q=X&page=2` via `http_get` returns the same 54 results as page 1. To get more search results use the browser and scroll/click "load more". - -9. **RSS is capped at 36 items per page.** Paginate with `?page=N`. Very high page numbers (300+) return HTTP 404 on browse pages. - -10. **Unicode zero-width space in some titles.** `\u200b` (zero-width space) appears at the start of certain titles (e.g. "​Our Life: Beginnings & Always"). Strip with `.replace('\u200b', '').strip()` or `.strip()` alone won't remove it — use `title.replace('\u200b', '').strip()`. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md deleted file mode 100644 index 7d14bf5ba..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/job-boards/indeed-glassdoor.md +++ /dev/null @@ -1,1021 +0,0 @@ -# Job Boards — Indeed, Glassdoor, Stepstone - -Covers: `indeed.com`, `glassdoor.com`, `stepstone.de` - ---- - -## Do this first: construct search URLs directly - -Never type into the search box on the homepage — bot detection triggers immediately. Build search URLs directly and navigate straight to results. - -```python -from urllib.parse import quote_plus - -# Indeed — English (US) -query, location = "Python developer", "San Francisco" -goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}") -wait_for_load() -wait(2) - -# Indeed — last 24 hours -goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1") -wait_for_load() -wait(2) - -# Glassdoor — public search (no login required for result cards) -goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}") -wait_for_load() -wait(2) - -# Stepstone (Germany) -keyword, city = "Data Scientist", "Berlin" -goto_url(f"https://www.stepstone.de/jobs/{quote_plus(keyword)}/in-{quote_plus(city)}.html") -wait_for_load() -wait(2) -``` - ---- - -## URL patterns - -### Indeed - -| Goal | URL pattern | -|---|---| -| Keyword + location | `/jobs?q={title}&l={location}` | -| Last 24 hours | `/jobs?q={title}&l={location}&fromage=1` | -| Last 3 days | `/jobs?q={title}&l={location}&fromage=3` | -| Last week | `/jobs?q={title}&l={location}&fromage=7` | -| Remote only | `/jobs?q={title}&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11` | -| Full-time only | `/jobs?q={title}&l={location}&jt=fulltime` | -| Part-time | `/jobs?q={title}&l={location}&jt=parttime` | -| With salary | `/jobs?q={title}&l={location}&rbl=%24{min}%2B` | -| Page 2 (results 11-20) | append `&start=10` | -| Page 3 (results 21-30) | append `&start=20` | -| Job detail page | `https://www.indeed.com/viewjob?jk={job_key}` | - -**Indeed country variants**: `.co.uk`, `.de`, `.fr`, `.com.au` — same URL structure, different base domain. - -### Glassdoor - -| Goal | URL pattern | -|---|---| -| Keyword search | `/Job/jobs.htm?sc.keyword={title}` | -| Keyword + city name | `/Job/jobs.htm?sc.keyword={title}&locT=C&locKeyword={city}` | -| Remote filter | `/Job/jobs.htm?sc.keyword={title}&remoteWorkType=1` | -| Next page | append `&p=2`, `&p=3` | - -### Stepstone (Germany) - -| Goal | URL pattern | -|---|---| -| Keyword in city | `/jobs/{keyword}/in-{city}.html` | -| Page 2 | `/jobs/{keyword}/in-{city}/page-2.html` | -| Page 3 | `/jobs/{keyword}/in-{city}/page-3.html` | -| Full-time | `/jobs/{keyword}/in-{city}.html?of=1` | - -For Stepstone, keyword and city go directly in the path — encode spaces as `-`: -```python -kw_path = keyword.replace(" ", "-") -city_path = city.replace(" ", "-") -goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html") -``` - ---- - -## Cookie / consent banner dismissal - -Indeed (EU/UK) and Glassdoor show GDPR consent overlays. Dismiss before extraction. - -```python -def dismiss_cookie_banner(): - """Try common consent button patterns. Safe to call even if no banner is present.""" - dismissed = js(""" - (function() { - // Indeed: "Accept all cookies" button - var selectors = [ - 'button[id*="onetrust-accept"]', - 'button[id*="accept-all"]', - '#onetrust-accept-btn-handler', - 'button[data-testid="cookie-consent-accept"]', - // Glassdoor: consent modal - 'button[data-test="accept-cookies"]', - // Generic patterns - 'button[class*="accept"]', - 'button[class*="consent"]', - ]; - for (var i = 0; i < selectors.length; i++) { - var btn = document.querySelector(selectors[i]); - if (btn && btn.offsetParent !== null) { - btn.click(); - return selectors[i]; - } - } - return null; - })() - """) - if dismissed: - wait(1) - return dismissed -``` - -Call immediately after `wait_for_load()` on `.co.uk`, `.de`, or `glassdoor.com`: - -```python -goto_url("https://www.indeed.co.uk/jobs?q=Python+developer&l=London") -wait_for_load() -wait(2) -dismiss_cookie_banner() -wait(1) -``` - ---- - -## Workflow 1: Indeed — search result card extraction - -Each result card on Indeed carries a `data-jk` attribute (the job key). Use it to construct direct URLs. - -```python -import json -from urllib.parse import quote_plus - -query, location = "machine learning engineer", "New York" -goto_url(f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}") -wait_for_load() -wait(2) -dismiss_cookie_banner() - -jobs = js(""" -(function() { - // Cards live in
          or
        4. with data-jk attribute - var cards = document.querySelectorAll('[data-jk]'); - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - var jk = c.getAttribute('data-jk') || ''; - if (!jk) continue; - - // Title - var titleEl = c.querySelector('h2.jobTitle span[title], h2.jobTitle span:not(.visually-hidden), [data-testid="job-title"]'); - var title = titleEl ? titleEl.innerText.trim() : ''; - - // Company name - var compEl = c.querySelector('[data-testid="company-name"], .companyName, span[data-testid="company-name"]'); - var company = compEl ? compEl.innerText.trim() : ''; - - // Location - var locEl = c.querySelector('[data-testid="text-location"], .companyLocation'); - var location = locEl ? locEl.innerText.trim() : ''; - - // Salary — may not always be present in the card - var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container, .metadata.salary-snippet'); - var salary = salEl ? salEl.innerText.trim() : ''; - - // Posting date / age - var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date, .result-link-bar-container .date'); - var posted = dateEl ? dateEl.innerText.trim() : ''; - - // Direct URL via job key - var url = 'https://www.indeed.com/viewjob?jk=' + jk; - - if (title) { - out.push({jk, title, company, location, salary, posted, url}); - } - } - return JSON.stringify(out); -})() -""") - -results = json.loads(jobs) -for r in results: - print(r) -# Typically returns 10–15 cards per page -``` - ---- - -## Workflow 2: Indeed — pagination (multi-page extraction) - -Indeed paginates using `&start=N` where N increments by 10 per page. - -```python -import json -from urllib.parse import quote_plus - -query, location = "data scientist", "remote" -base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}" - -all_jobs = [] - -for page in range(3): # 3 pages = up to ~30 results - start = page * 10 - url = base_url if start == 0 else f"{base_url}&start={start}" - goto_url(url) - wait_for_load() - wait(2) # mandatory — bot detection is aggressive on rapid loads - - if page == 0: - dismiss_cookie_banner() - - batch_json = js(""" - (function() { - var cards = document.querySelectorAll('[data-jk]'); - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - var jk = c.getAttribute('data-jk') || ''; - if (!jk) continue; - var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]'); - var compEl = c.querySelector('[data-testid="company-name"], .companyName'); - var locEl = c.querySelector('[data-testid="text-location"], .companyLocation'); - var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container'); - var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date'); - out.push({ - jk, - title: titleEl ? titleEl.innerText.trim() : '', - company: compEl ? compEl.innerText.trim() : '', - location: locEl ? locEl.innerText.trim() : '', - salary: salEl ? salEl.innerText.trim() : '', - posted: dateEl ? dateEl.innerText.trim() : '', - url: 'https://www.indeed.com/viewjob?jk=' + jk, - }); - } - return JSON.stringify(out.filter(j => j.title)); - })() - """) - - batch = json.loads(batch_json) - if not batch: - break # no results on this page — stop - all_jobs.extend(batch) - -print(f"Collected {len(all_jobs)} jobs across {page+1} pages") -``` - -**For `fromage` (date filter) + pagination**: keep the `fromage` param in the base URL: -```python -base_url = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}&fromage=1" -``` - ---- - -## Workflow 3: Indeed — job detail page extraction - -Fetch the full job description from the detail page. The `viewjob?jk=` URL is canonical and stable. - -```python -import json, re - -def get_indeed_job_detail(jk: str) -> dict: - """Fetch full job details from an Indeed job key.""" - goto_url(f"https://www.indeed.com/viewjob?jk={jk}") - wait_for_load() - wait(2) - - detail = js(""" - (function() { - // Title - var titleEl = document.querySelector('[data-testid="jobsearch-JobInfoHeader-title"], h1.jobsearch-JobInfoHeader-title'); - var title = titleEl ? titleEl.innerText.trim() : ''; - - // Company - var compEl = document.querySelector('[data-testid="inlineHeader-companyName"] a, [data-company-name="true"]'); - var company = compEl ? compEl.innerText.trim() : ''; - - // Location - var locEl = document.querySelector('[data-testid="inlineHeader-companyLocation"], [data-testid="job-location"]'); - var location = locEl ? locEl.innerText.trim() : ''; - - // Salary — shown when available in header - var salEl = document.querySelector('[data-testid="jobsearch-OtherJobDetailsContainer"] [aria-label*="alary"], #salaryInfoAndJobType span'); - var salary = salEl ? salEl.innerText.trim() : ''; - - // Full job description text - var descEl = document.getElementById('jobDescriptionText'); - var description = descEl ? descEl.innerText.trim() : ''; - - // Job type (Full-time, Part-time, Contract, etc.) - var typeEl = document.querySelector('[data-testid="attribute_snippet_testid"]'); - var jobType = typeEl ? typeEl.innerText.trim() : ''; - - // "Apply on company site" link — external application URL - var externalBtn = document.querySelector('[data-jk][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]'); - var externalUrl = externalBtn ? externalBtn.href : ''; - - return JSON.stringify({title, company, location, salary, jobType, description, externalUrl}); - })() - """) - return json.loads(detail) - -# Example -detail = get_indeed_job_detail("abc123def456xyz") -print(detail["title"], "—", detail["salary"]) -print(detail["description"][:500]) # first 500 chars -``` - ---- - -## Workflow 4: Glassdoor — search result extraction - -Glassdoor shows a login modal after a few scrolls. Extract cards from the first visible load before triggering that wall. - -```python -import json -from urllib.parse import quote_plus - -query = "product manager" -goto_url(f"https://www.glassdoor.com/Job/jobs.htm?sc.keyword={quote_plus(query)}") -wait_for_load() -wait(3) # Glassdoor JS rendering takes longer - -# Dismiss cookie banner if present -dismiss_cookie_banner() - -# Extract cards before any scroll (avoid triggering login modal) -jobs = js(""" -(function() { - // Glassdoor job cards: li[data-jobid] or article[data-id] - var cards = document.querySelectorAll('li[data-jobid], li[class*="JobsList_jobListItem"]'); - if (!cards.length) { - // Fallback: try generic article cards - cards = document.querySelectorAll('[data-test="jobListing"], [id^="job-listing-"]'); - } - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - - // Job ID (used for canonical URL) - var jobId = c.getAttribute('data-jobid') || c.getAttribute('data-id') || ''; - - // Title - var titleEl = c.querySelector('[data-test="job-title"], a[class*="JobCard_jobTitle"], .job-title'); - var title = titleEl ? titleEl.innerText.trim() : ''; - - // Company - var compEl = c.querySelector('[data-test="employer-name"], [class*="JobCard_employer"], .employer-name'); - var company = compEl ? compEl.innerText.trim() : ''; - - // Location - var locEl = c.querySelector('[data-test="emp-location"], [class*="JobCard_location"], .location'); - var location = locEl ? locEl.innerText.trim() : ''; - - // Salary estimate (not always shown in card) - var salEl = c.querySelector('[data-test="detailSalary"], [class*="salary"], .salaryEstimate'); - var salary = salEl ? salEl.innerText.trim() : ''; - - // Company rating - var ratingEl = c.querySelector('[data-test="rating"], [class*="ratingNumber"], .rating'); - var rating = ratingEl ? ratingEl.innerText.trim() : ''; - - // Canonical URL - var linkEl = c.querySelector('a[href*="/job-listing/"], a[href*="glassdoor.com/job"]'); - var url = linkEl ? linkEl.href : (jobId ? 'https://www.glassdoor.com/job-listing/glassdoor-jl' + jobId + '.htm' : ''); - - if (title) out.push({jobId, title, company, location, salary, rating, url}); - } - return JSON.stringify(out); -})() -""") - -results = json.loads(jobs) -for r in results: - print(r) -``` - -**If `jobs` returns an empty list**, Glassdoor has changed its DOM structure. Take a screenshot and inspect: - -```python -capture_screenshot() -# Look for the actual card selector, then update the querySelectorAll above -``` - ---- - -## Workflow 5: Glassdoor — handling the login wall - -Glassdoor increasingly shows a login modal after viewing a few listings. Detect and dismiss it. - -```python -def dismiss_glassdoor_login_modal(): - """Close the Glassdoor sign-in / register modal if it appears.""" - closed = js(""" - (function() { - // Close button on the modal - var closeBtn = document.querySelector( - '[alt="Close"], button[class*="modal_closeIcon"], [data-test="close-modal"]' - ); - if (closeBtn && closeBtn.offsetParent !== null) { - closeBtn.click(); - return 'closed'; - } - // Sometimes the modal has an X with aria-label - var ariaClose = document.querySelector('[aria-label="Close"]'); - if (ariaClose && ariaClose.offsetParent !== null) { - ariaClose.click(); - return 'aria-closed'; - } - return null; - })() - """) - if closed: - wait(1) - return closed - -# Strategy: extract as much as possible before the modal appears -# If the modal blocks results, dismiss it and try again -result = dismiss_glassdoor_login_modal() -if result: - wait(1) - # Re-run extraction after dismissal -``` - -If the modal is persistent and cannot be closed, switch to Indeed for the same search — it has more accessible public results. - ---- - -## Workflow 6: Stepstone (German) — job extraction - -Stepstone is server-rendered. Most data can be extracted with `http_get` for speed, or via `goto` + `js()` for dynamic content. - -```python -import json, re -from urllib.parse import quote_plus - -keyword = "Sachbearbeiter Einkauf" -city = "Regensburg" - -# Stepstone encodes keyword/city in the path -kw_path = keyword.replace(" ", "-") -city_path = city.replace(" ", "-") - -goto_url(f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html") -wait_for_load() -wait(2) -dismiss_cookie_banner() - -jobs = js(""" -(function() { - // Stepstone result cards - var cards = document.querySelectorAll( - 'article[data-at="job-item"], [data-genesis-element="JOB_CARD"], article.sc-fhzFiK' - ); - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - - // Title - var titleEl = c.querySelector('h2[data-at="job-item-title"] a, [data-at="job-title"], .listing__title a'); - var title = titleEl ? titleEl.innerText.trim() : ''; - var url = titleEl ? (titleEl.href || '') : ''; - - // Company - var compEl = c.querySelector('[data-at="job-item-company-name"], [data-at="company-name"], .listing__company'); - var company = compEl ? compEl.innerText.trim() : ''; - - // Location - var locEl = c.querySelector('[data-at="job-item-location"], .listing__location'); - var location = locEl ? locEl.innerText.trim() : ''; - - // Posting date - var dateEl = c.querySelector('[data-at="job-posting-date"], time, .listing__date'); - var posted = dateEl ? (dateEl.getAttribute('datetime') || dateEl.innerText.trim()) : ''; - - if (title) out.push({title, company, location, posted, url}); - } - return JSON.stringify(out); -})() -""") - -results = json.loads(jobs) -for r in results: - print(r) -``` - -### Stepstone pagination - -```python -import json - -all_jobs = [] -for page in range(1, 4): # pages 1-3 - if page == 1: - url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}.html" - else: - url = f"https://www.stepstone.de/jobs/{kw_path}/in-{city_path}/page-{page}.html" - - goto_url(url) - wait_for_load() - wait(2) - - if page == 1: - dismiss_cookie_banner() - - batch_json = js(""" - (function() { - var cards = document.querySelectorAll('article[data-at="job-item"], [data-genesis-element="JOB_CARD"]'); - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - var titleEl = c.querySelector('[data-at="job-item-title"] a, [data-at="job-title"]'); - var compEl = c.querySelector('[data-at="job-item-company-name"]'); - var locEl = c.querySelector('[data-at="job-item-location"]'); - var dateEl = c.querySelector('time'); - out.push({ - title: titleEl ? titleEl.innerText.trim() : '', - company: compEl ? compEl.innerText.trim() : '', - location: locEl ? locEl.innerText.trim() : '', - posted: dateEl ? dateEl.getAttribute('datetime') || dateEl.innerText.trim() : '', - url: titleEl ? titleEl.href : '', - }); - } - return JSON.stringify(out.filter(j => j.title)); - })() - """) - - batch = json.loads(batch_json) - if not batch: - break - all_jobs.extend(batch) - -print(f"Stepstone: {len(all_jobs)} jobs collected") -``` - ---- - -## Indeed job key (jk) — direct URL construction - -Indeed search result links go through a tracking redirect. **Do not use those redirect URLs.** Instead, extract the `data-jk` attribute directly for the stable canonical URL. - -```python -# Correct approach: extract data-jk from the card -job_keys = js(""" -JSON.stringify( - Array.from(document.querySelectorAll('[data-jk]')) - .map(el => el.getAttribute('data-jk')) - .filter(jk => jk && jk.length > 0) - .filter((jk, i, arr) => arr.indexOf(jk) === i) // dedupe -) -""") -import json -jks = json.loads(job_keys) - -# Canonical job detail URL for any job key: -for jk in jks: - direct_url = f"https://www.indeed.com/viewjob?jk={jk}" - print(direct_url) -``` - -If you already have a redirect URL and need to extract the `jk` from it: - -```python -import re -def extract_jk(url: str) -> str | None: - m = re.search(r'[?&]jk=([a-f0-9]+)', url) - return m.group(1) if m else None -``` - ---- - -## Salary extraction and normalization - -Salary appears in different places and formats depending on the job and site. - -### Indeed salary patterns - -```python -import re - -def parse_indeed_salary(raw: str) -> dict: - """ - Parse Indeed salary strings like: - "$85,000 - $110,000 a year" - "Up to $65 an hour" - "$25 - $30 an hour" - "From $120,000 a year" - "Employer est.: $90,000 - $120,000 a year" - Returns: {low, high, period, source} - """ - if not raw: - return {"raw": raw, "low": None, "high": None, "period": None, "source": None} - - source = None - if "Employer est." in raw: - source = "employer" - raw = raw.replace("Employer est.:", "").strip() - elif "Glassdoor est." in raw: - source = "glassdoor" - raw = raw.replace("Glassdoor est.:", "").strip() - - raw_clean = raw.replace(",", "") - - # Period - period = None - if "a year" in raw or "per year" in raw or "/yr" in raw: - period = "year" - elif "an hour" in raw or "per hour" in raw or "/hr" in raw: - period = "hour" - elif "a month" in raw or "per month" in raw: - period = "month" - - # Range: two dollar amounts - range_m = re.findall(r'\$?([\d]+(?:\.\d+)?)', raw_clean) - low = float(range_m[0]) if len(range_m) >= 1 else None - high = float(range_m[1]) if len(range_m) >= 2 else low - - return {"raw": raw, "low": low, "high": high, "period": period, "source": source} - -# Examples -parse_indeed_salary("$85,000 - $110,000 a year") -# -> {"low": 85000.0, "high": 110000.0, "period": "year", "source": None} - -parse_indeed_salary("Employer est.: $90,000 - $120,000 a year") -# -> {"low": 90000.0, "high": 120000.0, "period": "year", "source": "employer"} - -parse_indeed_salary("Up to $65 an hour") -# -> {"low": 65.0, "high": 65.0, "period": "hour", "source": None} -``` - -### Glassdoor salary note - -Glassdoor shows two types of salary estimates: -- **"Employer est."** — the company provided a range in the job post -- **"Glassdoor est."** — Glassdoor estimated based on similar roles; shown with "(est.)" in the card - -Both are shown as text inside the card. Parse the same way as Indeed. - -If the salary is absent in the search result card, it is only available on the job detail page (requires a click through to the individual listing). - ---- - -## Date normalization ("3 days ago" → actual date) - -All three sites use relative timestamps. Convert to absolute dates when needed. - -```python -import re -from datetime import datetime, timedelta - -def parse_relative_date(text: str, reference_date: datetime = None) -> datetime | None: - """ - Convert relative job posting dates to datetime objects. - Handles: "Just posted", "Today", "1 day ago", "3 days ago", "30+ days ago" - """ - if reference_date is None: - reference_date = datetime.utcnow() - - text = text.strip().lower() - - if not text or text in ("", "unknown"): - return None - if text in ("just posted", "today", "active today"): - return reference_date - if "hour" in text: - m = re.search(r'(\d+)', text) - hours = int(m.group(1)) if m else 1 - return reference_date - timedelta(hours=hours) - if "day" in text: - m = re.search(r'(\d+)', text) - days = int(m.group(1)) if m else 1 - return reference_date - timedelta(days=days) - if "week" in text: - m = re.search(r'(\d+)', text) - weeks = int(m.group(1)) if m else 1 - return reference_date - timedelta(weeks=weeks) - if "month" in text: - m = re.search(r'(\d+)', text) - months = int(m.group(1)) if m else 1 - return reference_date - timedelta(days=months * 30) - if "30+" in text: - return reference_date - timedelta(days=30) - - return None # unparseable - -# Examples -parse_relative_date("3 days ago") # datetime ~3 days before now -parse_relative_date("Just posted") # datetime.utcnow() -parse_relative_date("30+ days ago") # datetime 30 days ago -``` - ---- - -## Workflow 7: Fast bulk extraction with `http_get` (no browser) - -For Indeed, the raw HTML of search results contains structured JSON in a `window.mosaic.providerData` script tag. This is faster and more reliable than DOM extraction. - -```python -import json, re -from urllib.parse import quote_plus - -def indeed_http_search(query: str, location: str = "", fromage: int = 0, start: int = 0) -> list[dict]: - """ - Extract Indeed jobs via HTTP (no browser). Parses the embedded JSON payload. - Returns up to ~15 jobs per call. - """ - params = f"q={quote_plus(query)}&l={quote_plus(location)}&start={start}" - if fromage: - params += f"&fromage={fromage}" - - html = http_get( - f"https://www.indeed.com/jobs?{params}", - headers={ - "Accept-Language": "en-US,en;q=0.9", - "Accept": "text/html,application/xhtml+xml", - } - ) - - # Check for CAPTCHA before parsing - if "captcha" in html.lower() or "robot check" in html.lower(): - return [] # fall back to browser-based extraction - - # Indeed embeds job data in window.mosaic.providerData["mosaic-provider-jobcards"] - m = re.search( - r'window\.mosaic\.providerData\["mosaic-provider-jobcards"\]\s*=\s*(\{.*?\});', - html, re.DOTALL - ) - if not m: - return [] - - try: - data = json.loads(m.group(1)) - except json.JSONDecodeError: - return [] - - results_list = ( - data - .get("metaData", {}) - .get("mosaicProviderJobCardsModel", {}) - .get("results", []) - ) - - jobs = [] - for r in results_list: - jk = r.get("jobkey", "") - jobs.append({ - "jk": jk, - "title": r.get("title", ""), - "company": r.get("company", ""), - "location": r.get("formattedLocation", ""), - "salary": r.get("salarySnippet", {}).get("text", ""), - "posted": r.get("formattedRelativeTime", ""), - "url": f"https://www.indeed.com/viewjob?jk={jk}", - "snippet": r.get("snippet", ""), # short description preview - }) - return jobs - -# Example — last 24h remote jobs -jobs = indeed_http_search("software engineer", "remote", fromage=1) -for j in jobs: - print(j["title"], "|", j["company"], "|", j["salary"]) -``` - -If `http_get` returns 0 results (CAPTCHA or structure change), fall back to the `goto` + `js()` browser workflow above. - ---- - -## Workflow 8: "Easy Apply" vs external application detection - -Some Indeed listings apply on Indeed directly ("Easy Apply") while others redirect to the company site. Detect which type before deciding what to do. - -```python -def get_application_type(jk: str) -> dict: - """Returns {type: 'easy_apply'|'external'|'unknown', external_url: str|None}""" - goto_url(f"https://www.indeed.com/viewjob?jk={jk}") - wait_for_load() - wait(2) - - return js(""" - (function() { - // "Apply now" button pointing to /applystart = indeed-hosted Easy Apply - var easyBtn = document.querySelector( - 'button[data-testid="applyButton"], [id="indeedApplyButton"], button[class*="IndeedApplyButton"]' - ); - // "Apply on company site" button - var extBtn = document.querySelector( - 'a[data-testid="applyButton"][href*="indeed.com/applystart"], a[href*="indeed.com/applystart"]' - ); - // External redirect — check the main CTA - var mainCta = document.querySelector('[data-testid="applyButton"]'); - var ctaHref = mainCta ? mainCta.href : ''; - - if (easyBtn && !ctaHref.includes('apply.indeed')) { - return {type: 'easy_apply', externalUrl: null}; - } - if (extBtn || (ctaHref && !ctaHref.includes('indeed.com/viewjob'))) { - return {type: 'external', externalUrl: ctaHref || null}; - } - return {type: 'unknown', externalUrl: null}; - })() - """) -``` - ---- - -## Bot detection and rate limiting - -Indeed and Glassdoor have active bot detection. Violating these limits leads to CAPTCHA walls, IP blocks, or silently degraded results (cards with empty fields). - -### Safe request cadence - -```python -# Minimum wait between page loads -INTER_PAGE_WAIT = 2.5 # seconds — don't go below 2 - -# Between job detail page fetches -INTER_DETAIL_WAIT = 3.0 # seconds - -# http_get concurrency limit -MAX_HTTP_CONCURRENT = 2 # never more than 2 at once for Indeed/Glassdoor -``` - -### CAPTCHA detection - -```python -def is_captcha_page() -> bool: - """Check if the current page is a CAPTCHA or block page.""" - url = page_info()["url"] - title = js("document.title") or "" - body_text = js("document.body ? document.body.innerText.substring(0, 500) : ''") or "" - - signals = [ - "captcha" in url.lower(), - "robot" in title.lower(), - "are you a human" in body_text.lower(), - "verify you are human" in body_text.lower(), - "unusual traffic" in body_text.lower(), - "indeed.com/error" in url, - "sorry" in title.lower() and "indeed" in url, - ] - return any(signals) - -# Use after every goto: -goto_url(some_url) -wait_for_load() -wait(2) -if is_captcha_page(): - capture_screenshot() - # Wait longer and retry once - wait(10) - goto_url(some_url) - wait_for_load() - wait(3) -``` - -### Glassdoor session hygiene - -Glassdoor's bot detection is more fingerprint-based. If results stop loading: - -1. Take a `capture_screenshot()` — confirm whether it is a login modal vs a block page -2. Dismiss any login modal first (`dismiss_glassdoor_login_modal()`) -3. If a block page appears, pause 30+ seconds before retrying -4. Switch to Indeed for the same query — results are similar and bot tolerance is higher - ---- - -## Filtering by date, job type, and salary - -### Indeed URL filter parameters - -```python -from urllib.parse import quote_plus - -def build_indeed_url( - query: str, - location: str = "", - fromage: int = 0, # days: 1=last 24h, 3=last 3 days, 7=last week - job_type: str = "", # "fulltime", "parttime", "contract", "internship", "temporary" - remote: bool = False, - start: int = 0, -) -> str: - base = f"https://www.indeed.com/jobs?q={quote_plus(query)}&l={quote_plus(location)}" - if fromage: - base += f"&fromage={fromage}" - if job_type: - base += f"&jt={job_type}" - if remote: - base += "&remotejob=032b3046-06a3-4876-8dfd-474eb5e7ed11" - if start: - base += f"&start={start}" - return base - -# Examples -url = build_indeed_url("backend engineer", "Austin, TX", fromage=7, job_type="fulltime") -url = build_indeed_url("data analyst", remote=True, fromage=1) -``` - ---- - -## Collecting N results across pages - -```python -import json -from urllib.parse import quote_plus - -def collect_indeed_jobs(query: str, location: str = "", max_results: int = 20, - fromage: int = 0, job_type: str = "") -> list[dict]: - """ - Collect up to max_results jobs from Indeed across multiple pages. - Waits between pages to avoid bot detection. - """ - all_jobs = [] - seen_jks = set() - page = 0 - - while len(all_jobs) < max_results: - start = page * 10 - url = build_indeed_url(query, location, fromage=fromage, job_type=job_type, start=start) - goto_url(url) - wait_for_load() - wait(2.5) - - if page == 0: - dismiss_cookie_banner() - - if is_captcha_page(): - print(f"CAPTCHA on page {page+1}, stopping") - break - - batch_json = js(""" - (function() { - var cards = document.querySelectorAll('[data-jk]'); - var out = []; - for (var i = 0; i < cards.length; i++) { - var c = cards[i]; - var jk = c.getAttribute('data-jk') || ''; - if (!jk) continue; - var titleEl = c.querySelector('h2.jobTitle span[title], [data-testid="job-title"]'); - var compEl = c.querySelector('[data-testid="company-name"], .companyName'); - var locEl = c.querySelector('[data-testid="text-location"], .companyLocation'); - var salEl = c.querySelector('[data-testid="attribute_snippet_testid"], .salary-snippet-container'); - var dateEl = c.querySelector('[data-testid="myJobsStateDate"], span.date'); - out.push({ - jk, - title: titleEl ? titleEl.innerText.trim() : '', - company: compEl ? compEl.innerText.trim() : '', - location: locEl ? locEl.innerText.trim() : '', - salary: salEl ? salEl.innerText.trim() : '', - posted: dateEl ? dateEl.innerText.trim() : '', - url: 'https://www.indeed.com/viewjob?jk=' + jk, - }); - } - return JSON.stringify(out.filter(j => j.title && j.jk)); - })() - """) - - batch = json.loads(batch_json) - if not batch: - break # no more results - - new_jobs = [j for j in batch if j["jk"] not in seen_jks] - seen_jks.update(j["jk"] for j in new_jobs) - all_jobs.extend(new_jobs) - page += 1 - - return all_jobs[:max_results] - -# Examples -jobs = collect_indeed_jobs("Python developer", "San Francisco", max_results=20) -jobs = collect_indeed_jobs("remote software engineer", fromage=1, max_results=10) -jobs = collect_indeed_jobs("machine learning engineer", max_results=30, fromage=7, job_type="fulltime") -``` - ---- - -## Gotchas - -- **`data-jk` is the job key, not a DOM id** — Always use `[data-jk]` to select cards, not `#job-...` ids which vary by page layout and A/B test variant. - -- **Indeed redirect links are NOT stable URLs** — Anchor `href` values in search results go through `https://www.indeed.com/rc/clk?...` tracking redirects which expire. Always extract `data-jk` from the card and construct `https://www.indeed.com/viewjob?jk={jk}` yourself. - -- **Salary is on the detail page, not the card** — Many listings show no salary in the search result card. If salary is required, fetch the individual `viewjob?jk=` page and extract it there. Budget `wait(3)` per detail page and do not fetch more than 5 detail pages per minute. - -- **"Employer est." vs "Glassdoor est."** — These are two distinct data signals. Employer estimates come from the job post itself; Glassdoor estimates are crowd-sourced. The distinction matters when reporting salary accuracy to users. - -- **Glassdoor login modal appears after 2-3 scrolls** — Extract all visible cards immediately on load before scrolling. If you need to load more results via scroll/infinite scroll, dismiss the modal first. - -- **Glassdoor public results are limited** — Without login, Glassdoor shows ~10-15 cards. If the task requires 30+ results, use Indeed instead (no login required, up to ~15 per page with full pagination). - -- **Stepstone uses path-based URL routing, not query params** — Spaces in keyword or city must be replaced with `-` for the path, not `%20` or `+`. `quote_plus()` is wrong for path segments. Use `.replace(" ", "-")`. - -- **Stepstone pagination is in the path** — `/page-2.html`, `/page-3.html` — not `?page=2`. There is no `&start=N` param as in Indeed. - -- **`http_get` for Glassdoor fails more often** — Glassdoor requires JS to render job cards. Use the browser path for Glassdoor. `http_get` only works reliably for Indeed and Stepstone where server-rendered HTML contains structured data. - -- **Indeed embeds JSON in a `', html, re.DOTALL) - for block in jsonld_raw: - # Strip CDATA wrapper that Letterboxd wraps around JSON-LD - cleaned = re.sub(r'/\*\s*.*?\*/', '', cleaned, flags=re.DOTALL) - try: - data = json.loads(cleaned.strip()) - except json.JSONDecodeError: - continue - if data.get('@type') != 'Movie': - continue - - result['title'] = data['name'] - result['year'] = data['releasedEvent'][0]['startDate'] if data.get('releasedEvent') else None - result['directors'] = [d['name'] for d in data.get('director', [])] - result['genres'] = data.get('genre', []) - result['countries'] = [c['name'] for c in data.get('countryOfOrigin', [])] - result['studios'] = [s['name'] for s in data.get('productionCompany', [])] - result['actors'] = [a['name'] for a in data.get('actors', [])] - result['poster_url'] = data.get('image') - result['url'] = data.get('url') - r = data.get('aggregateRating', {}) - result['rating'] = r.get('ratingValue') # float 0.0–5.0 - result['rating_count'] = r.get('ratingCount') # int, total ratings cast - result['review_count'] = r.get('reviewCount') # int, written reviews only - - # --- OG / meta tags (fast fallback, redundant) --- - og = lambda prop: next(iter(re.findall( - rf']+property="og:{prop}"[^>]+content="([^"]*)"', html)), None) - result['og_title'] = og('title') # includes year: "The Godfather (1972)" - result['synopsis'] = htmllib.unescape(og('description') or '') - result['og_image'] = og('image') # large 1200x675 crop - - # --- Film ID (internal numeric ID) --- - m = re.search(r'data-film-id="(\d+)"', html) - result['film_id'] = m.group(1) if m else None - - # --- Tagline --- - m = re.search(r'

          ([^<]+)

          ', html) - result['tagline'] = htmllib.unescape(m.group(1)) if m else None - - # --- Themes (from tab-genres section) --- - m = re.search(r'

          Themes

          .*?

          (.*?)

          ', html, re.DOTALL) - result['themes'] = re.findall(r'class="text-slug">([^<]+)
          ', m.group(1)) if m else [] - - # --- Languages --- - result['languages'] = re.findall(r'href="/films/language/[^/]+/"[^>]*>([^<]+)', html) - - # --- Fans count --- - m = re.search(r'class="accessory"[^>]*>\s*([\d,KkMm]+)\s*fans', html) - result['fans'] = m.group(1) if m else None # e.g. "133K" - - # --- Popular reviews (top 12 inline on the page) --- - result['reviews'] = [] - for vid, person, block in re.findall( - r'
          ]*data-viewing-id="(\d+)"[^>]*data-person="([^"]+)">(.*?)
          ', - html, re.DOTALL - ): - dm = re.search(r'([^<]+)', block) - tm = re.search(r'class="body-text -prose -reset[^"]*"[^>]*>(.*?)
        5. ', block, re.DOTALL) - lm = re.search(r'data-count="(\d+)"', block) - result['reviews'].append({ - 'viewing_id': vid, - 'username': person, - 'display_name': dm.group(1) if dm else person, - 'review': re.sub(r'<[^>]+>', '', tm.group(1)).strip() if tm else '', - 'likes': int(lm.group(1)) if lm else 0, - }) - - return result -``` - -### Verified output (2026-04-18) - -```python -data = extract_film_data('the-godfather') -# { -# 'title': 'The Godfather', -# 'year': '1972', -# 'directors': ['Francis Ford Coppola'], -# 'genres': ['Crime', 'Drama'], -# 'countries': ['USA'], -# 'studios': ['Paramount Pictures', 'Alfran Productions'], -# 'actors': ['Marlon Brando', 'Al Pacino', 'James Caan', ...], # full cast list -# 'rating': 4.52, -# 'rating_count': 2619662, -# 'review_count': 372579, -# 'fans': '133K', -# 'film_id': '51818', -# 'tagline': "An offer you can't refuse.", -# 'genres': ['Crime', 'Drama'], -# 'themes': ['Crime, drugs and gangsters', 'Gritty crime and ruthless gangsters', ...], -# 'languages': ['English', 'Latin', 'English', 'Italian'], # may have dupes; deduplicate -# 'og_title': 'The Godfather (1972)', -# 'synopsis': 'Spanning the years 1945 to 1955...', -# 'poster_url': 'https://a.ltrbxd.com/resized/film-poster/.../51818-the-godfather-0-230-0-345-crop.jpg...', -# 'og_image': 'https://a.ltrbxd.com/resized/sm/upload/.../the-godfather-1200-1200-675-675-crop-000000.jpg...', -# 'reviews': [ -# {'username': 'wizardchurch', 'display_name': 'Hannah', 'likes': 30944, -# 'review': 'haha they made that scene from zootopia into a movie'}, -# ... # 12 total -# ] -# } - -data = extract_film_data('parasite-2019') -# title: 'Parasite', year: '2019', rating: 4.53, rating_count: 5264520, review_count: 690652 -# fans: '175K', directors: ['Bong Joon Ho'], countries: ['South Korea'] - -data = extract_film_data('inception') -# title: 'Inception', year: '2010', rating: 4.23, rating_count: 3913620 -``` - ---- - -## Path 2: User profile via http_get - -Only the user root page `letterboxd.com/{username}/` is accessible. Sub-pages (`/films/`, `/diary/`, `/lists/`) return 403. - -```python -import re, html as htmllib -from helpers import http_get - -def extract_user_profile(username): - html = http_get(f"https://letterboxd.com/{username}/") - - # Display name - dm = re.search(r'class="displayname tooltip"[^>]*>([^<]+)', html) - - # Stats block (Films / This year / Lists / Following / Followers) - stats = re.findall( - r'(\d[\d,]*)' - r'([^<]+)', - html - ) - - # Favorites from OG description - od = re.search(r']+property="og:description"[^>]+content="([^"]*)"', html) - favorites = [] - if od: - fm = re.search(r'Favorites:\s*([^.]+)\.', od.group(1)) - if fm: - favorites = [f.strip() for f in fm.group(1).split(',')] - - # Film IDs of films shown on profile page (recent activity) - film_ids_on_page = list(set(re.findall(r'data-film-id="(\d+)"', html))) - - return { - 'username': username, - 'display_name': dm.group(1) if dm else None, - 'stats': {label.strip(): int(val.replace(',', '')) for val, label in stats}, - 'favorites': favorites, - 'film_ids_on_page': film_ids_on_page, - } -``` - -### Verified output - -```python -data = extract_user_profile('dave') -# { -# 'username': 'dave', -# 'display_name': 'Dave Vis', -# 'stats': {'Films': 2553, 'This year': 63, 'Lists': 155, 'Following': 77, 'Followers': 34512}, -# 'favorites': ['High and Low (1963)', 'Burning (2018)', 'My Neighbor Totoro (1988)', 'Mulholland Drive (2001)'], -# 'film_ids_on_page': ['51818', '47756', ...] # ~32 film IDs from recent activity blocks -# } -``` - ---- - -## Path 3: Global activity stream from /films/ - -`letterboxd.com/films/` returns the recent global activity feed — approximately 6 full viewing entries, plus many more film slugs from the UI. Use this to discover recently-logged films. - -```python -import re, html as htmllib -from helpers import http_get - -def extract_activity_stream(): - html = http_get("https://letterboxd.com/films/") - entries = [] - for owner, obj_id, block in re.findall( - r'class="production-viewing[^"]*"[^>]*data-owner="([^"]+)"[^>]*data-object-id="([^"]+)"[^>]*>(.*?)', - html, re.DOTALL - ): - film_m = re.search( - r'data-item-name="([^"]*)".*?data-item-slug="([^"]*)".*?data-film-id="(\d+)"', - block, re.DOTALL - ) - if film_m: - entries.append({ - 'owner': owner, - 'film_name': htmllib.unescape(film_m.group(1)), - 'film_slug': film_m.group(2), - 'film_id': film_m.group(3), - }) - return entries - -# Returns ~6 entries. Film names are in "Title (Year)" format. -# Example: [{'owner': 'sidduww', 'film_name': 'The Drama (2026)', -# 'film_slug': 'the-drama', 'film_id': '1205494'}, ...] -``` - ---- - -## Path 4: Browser for list pages and sub-pages (403 via http_get) - -These pages require the browser — use `goto_url()` + `wait_for_load()` + `wait(2)`: - -```python -from helpers import goto, wait_for_load, wait, js -import json - -# Popular films -goto_url("https://letterboxd.com/films/popular/") -wait_for_load() -wait(2) - -films = json.loads(js(""" -(function() { - var items = Array.from(document.querySelectorAll('li.film-list-entry, li[class*="poster-container"]')); - return JSON.stringify(items.slice(0, 30).map(function(el) { - var poster = el.querySelector('[data-item-slug]') || el.querySelector('[data-film-slug]'); - return { - name: poster ? (poster.dataset.itemName || poster.dataset.filmName) : null, - slug: poster ? (poster.dataset.itemSlug || poster.dataset.filmSlug) : null, - film_id: poster ? poster.dataset.filmId : null - }; - }).filter(function(x){ return x.slug; })); -})() -""")) - -# User watched films list (paginated, 72/page) -goto_url("https://letterboxd.com/dave/films/") -wait_for_load() -wait(2) - -films = json.loads(js(""" -(function() { - var items = Array.from(document.querySelectorAll('li[data-film-id]')); - return JSON.stringify(items.map(function(el) { - return { - film_id: el.dataset.filmId, - film_slug: el.dataset.targetLink ? el.dataset.targetLink.replace(/\\/film\\/|\\/$/g,'') : null, - rating: el.dataset.ownerRating || null - }; - })); -})() -""")) - -# User diary entries -goto_url("https://letterboxd.com/dave/diary/") -wait_for_load() -wait(2) - -# For paginated browsing, check next page link -next_page_url = js(""" -(function() { - var a = document.querySelector('a.next'); - return a ? a.href : null; -})() -""") -# Returns URL for next page or null. Load it with goto_url(next_page_url). -``` - ---- - -## Gotchas - -**JSON-LD is wrapped in CDATA comments** — `json.loads(block)` will fail without stripping the wrapper. Always strip `/* */` first: -```python -cleaned = re.sub(r'/\*\s*.*?\*/', '', cleaned, flags=re.DOTALL) -data = json.loads(cleaned.strip()) -``` - -**JSON-LD `name` is bare title, not "Title (Year)"** — `data['name']` returns `'Parasite'`, not `'Parasite (2019)'`. Year is in `data['releasedEvent'][0]['startDate']`. The OG `og:title` meta tag does include the year. - -**OG description contains HTML entities** — `og:description` and `tagline` use `'` etc. Always call `html.unescape()` on them. - -**`languages` list can have duplicates** — e.g. Parasite returns `['Korean', 'English', 'German', 'Korean']`. Call `list(dict.fromkeys(result['languages']))` to deduplicate while preserving order. - -**Disambiguation slugs** — when two films share a title, Letterboxd appends the year to the slug: `parasite-2019` (Bong's film), vs `parasite` (1982 film). If your slug 404s, try appending `-{year}`. - -**403 pages** — `/film/{slug}/reviews/`, `/film/{slug}/ratings/`, `/film/{slug}/cast/`, `/film/{slug}/details/`, `/{username}/films/`, `/films/popular/`, `/films/by/rating/`, `/genre/{slug}/`, `/director/{slug}/`, `/actor/{slug}/` all return 403 to `http_get`. These require the browser. - -**CSI endpoints are 403** — Letterboxd loads the ratings histogram via `/csi/film/{slug}/rating-histogram/` which returns 403 without a session cookie. Access ratings distribution via browser on `/film/{slug}/ratings/`. - -**`/csi/` and `/ajax/` endpoints need session cookies** — these are used to populate the ratings histogram, friend activity, and popular review sections after page load. Only the inline HTML data (top 12 popular reviews) is available via `http_get`. - -**Cloudflare Turnstile is present but passive** — the `configuration.cloudflare.turnstile` object is in the page JS, but it only activates on the login form. It does not block unauthenticated reads on public film/user pages. - -**The official API requires OAuth** — `api.letterboxd.com/api/v0/` returns 401 on all endpoints. Apply for API access at letterboxd.com/api-beta/ to get client credentials. - -**Fans count is abbreviated** — `'133K'`, `'175K'`. Parse with: -```python -def parse_abbrev(s): - s = s.strip().upper() - if s.endswith('K'): return int(float(s[:-1]) * 1000) - if s.endswith('M'): return int(float(s[:-1]) * 1000000) - return int(s.replace(',', '')) -``` - -**Film slug from unknown title** — Letterboxd has no public search API. Construct the slug by lowercasing the title and replacing spaces with hyphens, then `http_get` and check for a 403/404 vs a valid JSON-LD block. diff --git a/packages/bcode-browser/harness/agent-workspace/domain-skills/linkedin/invitation-manager.md b/packages/bcode-browser/harness/agent-workspace/domain-skills/linkedin/invitation-manager.md deleted file mode 100644 index 1d0d8bb66..000000000 --- a/packages/bcode-browser/harness/agent-workspace/domain-skills/linkedin/invitation-manager.md +++ /dev/null @@ -1,109 +0,0 @@ -# LinkedIn — Invitation Manager - -Accept or ignore pending connection invitations in bulk from -`https://www.linkedin.com/mynetwork/invitation-manager/received//`. - -## URL filters - -The trailing slug pre-filters the received invitations. Observed slugs: - -- `PEOPLE_WITH_MUTUAL_CONNECTION` — people who share a mutual connection -- `PEOPLE_WITH_MUTUAL_SCHOOL` — people who share a school -- omit the slug (`.../received/`) for all pending invitations - -The filter chip at the top of the page mirrors the URL and also renders -`All (N)`, `Mutual Connections (N)`, `Your School (N)` — the `(N)` is the -authoritative remaining-count for the active filter and is what you loop on. - -## Button selectors - -Each pending-invitation card contains an Accept and an Ignore control. -**The aria-label formats are different** for the two buttons — don't derive -one from the other: - -- Accept: `aria-label = "Accept 's invitation"` (note: curly `’`, not ASCII `'`) -- Ignore: `aria-label = "Ignore an invitation to connect from "` - -```python -# Match either — both are unique per card -accepts = js("Array.from(document.querySelectorAll('button, a')).filter(b => (b.getAttribute('aria-label')||'').startsWith('Accept ')).length") -ignores = js("Array.from(document.querySelectorAll('button')).filter(b => (b.getAttribute('aria-label')||'').toLowerCase().startsWith('ignore')).length") -``` - -## Trap: "follows you" cards render Accept as ``, not `