Public Notes by reorx Tagged #crawler

pure.md - global cache between LLMs and the web pure.md

Reliably access web content in markdown format by simply prefixing any URL with `pure.md/`. Avoids bot detection, renders JavaScript-heavy websites, and converts HTML, PDFs, images, and more into pure markdown.

#crawler #web #extraction #markdown #api #ai #agent

agent •
ai •
api •
crawler •
extraction •
markdown •
web

Crawlspace - The centralized web crawling platform crawlspace.dev

Crawlspace is a centralized platform for developers to build and deploy web crawlers. Gather fresh data for your apps and agents while contributing to a platform-wide cache for crawler traffic.

via: https://pure.md/

#crawler #api #ai #agent

agent •
ai •
api •
crawler

Apify: Full-stack web scraping and data extraction platform apify.com

apify/crawlee: Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works ... github.com

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation. - apify/crawlee

#nodejs #crawler #scraper #headless-browser #framework

GitHub - getmaxun/maxun: Free, open-source no-code web data extraction platform. Build custom robots to automate data scraping [In Beta] github.com

ai •
crawler •
extraction •
web

caolvchong-top/twitter_download: 推特图片视频爬虫;一键下载 github.com

推特图片视频爬虫;一键下载. Contribute to caolvchong-top/twitter_download development by creating an account on GitHub.

#twitter #crawler #python

Home - Firecrawl www.firecrawl.dev

Firecrawl crawls and converts any website into clean markdown.

#api #crawler #markdown #ai #llm #readability

ai •
api •
crawler •
llm •
markdown •
readability

JustAnotherArchivist/snscrape: A social networking service scraper in Python github.com

A social networking service scraper in Python. Contribute to JustAnotherArchivist/snscrape development by creating an account on GitHub.

#osint #crawler #twitter #telegram #python #scraper

crawler •
osint •
python •
scraper •
telegram •
twitter

mendableai/firecrawl: 🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. github.com

🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API. - mendableai/firecrawl

#crawler #api #ai #markdown #service

ai •
api •
crawler •
markdown •
service

s0md3v/Photon: Incredibly fast crawler designed for OSINT. github.com

Incredibly fast crawler designed for OSINT. Contribute to s0md3v/Photon development by creating an account on GitHub.

#crawler #python #osint #archive

archive •
crawler •
osint •
python

NanmiCoder/MediaCrawler: 小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频｜评论爬虫、微博帖子｜评论爬虫 github.com

小红书笔记 | 评论爬虫、抖音视频 | 评论爬虫、快手视频 | 评论爬虫、B 站视频｜评论爬虫、微博帖子｜评论爬虫 - NanmiCoder/MediaCrawler

#crawler #bilibili #xiaohongshu #python

Web Scraping in Python – The Complete Guide | Hacker News news.ycombinator.com

crawler •
guide •
python

Web Scraping Proxies API for Developers proxiesapi.com

api •
crawler

Web Scraping in Python - The Complete Guide | ProxiesAPI proxiesapi.com

crawler •
guide •
python •
scraper •
scraping •
tutorial

GitHub - xisuo67/XHS-Spider: 小红书数据采集、网站图片、视频资源批量下载工具，颜值超高的数据采集工具（批量下载，视频提取，图片，去水印等） github.com

MechanicalSoup/MechanicalSoup: A Python library for automating interaction with websites. github.com

gaojiuli/gain: Web crawling framework based on asyncio. github.com

Web crawling framework based on asyncio. Contribute to gaojiuli/gain development by creating an account on GitHub. #python #crawler #asyncio

ultrafunkamsterdam/undetected-chromedriver: Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) github.com

Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) - ultrafunkamsterdam/undetected-chromedriver: Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) #headless #chrome #crawler #anti-crawler

Extract web data on big scale. scrapeninja.net

Scrape and Monitor Data from Any Website with No Code www.browse.ai

ai •
crawler •
extraction •
monitoring •
web

telegram群组-电报群搜索- TgSql.com www.tgsql.com

channel •
crawler •
ranking •
search •
telegram

阅读(io.legado.app.release) - 3.21.080316 - 应用 - 酷安 www.coolapk.com

android •
app •
crawler •
ebook •
pub •
reading •
txt

HTTrack Website Copier - Free Software Offline Browser (GNU GPL) www.httrack.com

HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility. It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the 'mirrored' website in your browser, and you can browse the site from link to link, as if you were viewing it online.... #website #archiving #download #crawler #pub

archiving •
crawler •
download •
pub •
website

Teleport -- Offline Browsing Webspider www.tenmax.com

Teleport Pro: The world's most widely used webspider. Fast, reliable, robust, comprehensive webspidering, Teleport Pro by Tennyson Maxwell Information Systems, Inc. #website #archiving #crawler #pub

archiving •
crawler •
pub •
website