PicoScraper is an easy to use, Json-defined scraper
> picoscraper my-scraper.json my-output-db.sqlite3
Define with JSON
- Simple: One JSON file for everything, both configuration and elements.
- CSS Selectors: From pasting directly from your browser inspector, to complex selectors and pseudoclasses.
- Properties: Text, href, classes, ids, sources, custom attributes...
{
"start_url": "file:url-list.txt",
"random_sleep": 3,
"rows_selector": "table[id*=\"post\"]",
"columns_selector": [
{
"field": "thread",
"css_selector": "a[href*=\"thread\"]",
"default": "",
"property": "href"
},
{
"field": "date",
"css_selector": "span:contains(\"AM\"),span:contains(\"PM\")",
"default": "no-date-defined",
"property": "text"
}
]
}
Multiple output choices
- Files: Excel & CSV.
- Databases: SQLite, and basically anything with an ODBC driver.
Can't get more lazy loading URLs
Load from lists
"start_url": "file:url-list.txt"
Load from ranges
"start_url": "http://example.com/sub/[5-15:5]/page/parameter"
http://example.com/sub/5/page/parameter
http://example.com/sub/10/page/parameter
http://example.com/sub/15/page/parameter
http://example.com/sub/10/page/parameter
http://example.com/sub/15/page/parameter
Crawl & Scrape
{
"start_url": "http://example.com/",
"crawl_levels": 5,
"scrape_selector": "a[href*=\"thread_\"]",
"random_sleep": 3,
"rows_selector": "table[id*=\"post\"]",
"columns_selector": []
}
Gets all links from http://example.com, uses any link that matches with
a[href*=\"thread_\"] as starting url for a scrape, saves the output, and repeats the process for the next 5 levels.
Demo VS Full License
License
Demo
Full
Price
Free
Unlimited
Time
Unlimited
Time
10$/m
or
100$/y
or
100$/y
Basic
Row Limit
10K
None
Basic
Pagination
Selector,
Scroll
Scroll
Selector,
Scroll
Scroll
Export
SQLite
✔️
✔️
Export
Excel
✔️
✔️
Export
CSV
✔️
✔️
Export
JSON
✔️
✔️
Export
ODBC
✔️
Load
URL List
✔️
✔️
Load
Ranges
✔️
✔️
Load
Crawl & Scrape
✔️
Advanced
JS Rendering
✔️
Advanced
Rotating
Proxys
Proxys
✔️
Advanced
Browser
Fingerprinting
Fingerprinting
✔️
Advanced
Captcha Solver
Integration
Integration
✔️
Advanced
User / Password
Handling
Handling
✔️