<div class="postcontent">
Mei eu mollis albucius, ex nisl contentiones vix. Duo persius volutpat at, cu iuvaret epicuri mei. Duo posse pertinacia no, ex dolor contentiones mea. Nec omnium utamur dignissim ne. Mundi lucilius salutandi an sea, ne sea aeque iudico comprehensam. Populo delicatissimi ad pri. Ex vitae accusam vivendum pro.What's PicoScraper?
PicoScraper is a standalone, easy to use and JSON-defined scraper that exports to SQLite, Excel, CSV and ODBC.
It incorporates advanced functions like dealing with pagination, infinite scroll, proxys, API and much more.
It's both a GUI and a command-line application, and it's available for Windows, Mac and Linux.
Installing PicoScraper
PicoScraper basics
PicoScraper uses BS4 to handle parsing, so I strongly recommend you take a look at the SoupSieve documentation regarding the available CSS selectors. This one from W3schools is also useful for reference.
If your're unfamiliar with CSS Selectors, you can also pick em with your Browser Inspector in Firefox, Chrome, or any other.
In section Learning by doing there will be a more detailed explanation on how to do the whole process.
Rows and columns
Let's say we have the following page:
<div class="postinfo">
<a class="userid" href="#">John</a>
<span class="date">27th Dec 2020 8:00 AM</span>
<div class="postinfo">
<a class="userid" href="#">Alex</a>
<span class="date">28th Dec 6:00 PM</span>
<div class="postcontent">
His audiam deserunt in, eum ubique voluptatibus te. In reque dicta usu. Ne rebum dissentiet eam, vim omnis deseruisse id. Ullum deleniti vituperata at quo, insolens complectitur te eos, ea pri dico munere propriae. Vel ferri facilis ut, qui paulo ridens praesent ad. Possim alterum qui cu. Accusamus consulatu ius te, cu decore soleat appareat usu.In this situation each post constitutes a row in our database, and we could extract the user name, date, and post content. Let's define it in our example.json.
{
"start_url": "https://example.com/example-thread",
"random_sleep": 5,
"rows_selector": "div[id*=\"post\"]",
"columns_selector": [
{
"field": "userid",
"css_selector": ".userid",
"default": "",
"property": "href"
},
{
"field": "date",
"css_selector": ".date",
"default": "no-date-defined",
"property": "text"
},
{
"field": "content",
"css_selector": ".postcontent",
"default": "",
"property": "text"
}
]
}
So, why do we use div[id*=\"post\"] as a row selector? Because we can safely assume the pattern for each post id attribute would be something like post_number-of-post, and this way we are telling the scraper to select as row each div element with an id attribute that contains the word post.
What about the user id setting? "field": "userid" just names the column in our database, while "css_selector": ".userid" means that our data will be contain in a class named .userid, and since we don't really care if the cell is empty when there's no content in .userid, "default": "" is left empty.
Now, we've chosen "property": "href" because we are looking to get the user profile url. Why? Most websites won't display the user ID as text, but they do link to the user profile, and such link tends to include an unique identifier, and let's say it's too much effort to test if the website allows two users with the same username.
This would be the output:
| userid | date | content |
|---|---|---|
| https://example.com/user-54987 | 27th Dec 2020 8:00 AM | Mei eu mollis albucius, ex nisl contentiones vix. Duo persius volutpat at, cu iuvaret epicuri mei. [...] |
| https://example.com/user-49555 | 28th Dec 6:00 PM | His audiam deserunt in, eum ubique voluptatibus te. [...] |
General attributes
These are attributes or parameters that you set before defining your column selectors.
start_url
There are three main ways to point your scraper to an URL.
| Method | Explanation |
|---|---|
| file:url-list.txt | Just load from a list of URLs |
| http://example.com/sub/[5-15:5]/page/parameter | Goes from page 5, 10, to 15 |
| http://example.com/sub/page/parameter | Just load one starting URL |
random_sleep
Random sleep allows to space requests randomly. This is useful for avoiding pattern blocking that looks at evenly spaced requests. The default value is 2.0 (read random() * 2.0), but you can set 0.0 or any other value.
{
"start_url": "https://example.com/example-thread",
"random_sleep": 5,
"rows_selector": "div[id*=\"post\"]",
"columns_selector": []
}
row_link
Sometimes our target info is not in the same level we're scraping. This attribute allows for one level deeper, alas row_selector > row_link.
{
"start_url": "https://example.com/example-thread",
"random_sleep": 5,
"rows_selector": "div[id*=\"post\"]",
"row_link": "a[title*=\"one_level_deeper\"]",
"next_link": "a[title*=\"Next Page\"]",
"columns_selector": []
}
next_link
Allows use of pagination. Just use a css selector pointing to an element that acts as a next page one. PicoScrapwer will keep clicking that element until it dissapears.
If also works when there are more than one element with the same selector in the same page, as in a next link in the top and the bottom of the current page.
{
"start_url": "https://example.com/example-thread",
"random_sleep": 5,
"rows_selector": "div[id*=\"post\"]",
"next_link": "a[title*=\"Next Page\"]",
"columns_selector": []
}
workers
Workers allow for scraping several URLs concurrently.
Default is set to 1, which means there's only one thread processing data.
{
"start_url": "https://example.com/example-thread",
"random_sleep": 5,
"workers": 2,
"rows_selector": "div[id*=\"post\"]",
"columns_selector": []
}
Keep in mind that multiple workers can stress your target site. Even from a selfish perspective, you want such site to work smoothly so your data extraction job goes without interruption.
Any load you apply to your target site will be added to the original load of the site. Typically, site load is seasonal. That is, a site's traffic spikes may happen at a different time, day, or month than the time you are testing.
This means that even if your target site handles n workers smoothly at this time, it may be stressful at another time, and generate a data loss scenario that you don't want.
Consider local load as well. A simple scraper should not be a problem, but if you have multiple workers and you are using pseudoclasses like contains(), it can stress you machine quickly.
rows_selector
This parameter simply selects the element where the columns are contained. In your database each of these elements will represent a row.
For:
<div class="row-class">
<div class="column1">Text1</div>
<div class="column2">Text2</div>
<div class="column3">Text3</div>
<div>irrelevant data</div>
</div>
<div class="row-class">
<div class="column1">Text4</div>
<div class="column2">Text5</div>
<div class="column3">Text6</div>
<div>irrelevant data</div>
</div>
You'll need:
{
"start_url": "https://example.com/example-thread",
"rows_selector": ".row-class",
"columns_selector": [
{
"field": "column1-field",
"css_selector": ".column1",
"default": "",
"property": "text"
},
{
"field": "column2-field",
"css_selector": ".column2",
"default": "",
"property": "text"
},
{
"field": "column3-field",
"css_selector": ".column3",
"default": "",
"property": "text"
}
]
}
Output would be:
| Column1-field | Column2-field | Column3-field |
|---|---|---|
| Text1 | Text2 | Text3 |
| Text4 | Text5 | Text6 |
columns_selector
Allows to define the columns in your dataset. Attributes/parameters will be explained below.
{
"start_url": "https://example.com/example-thread",
"rows_selector": ".row-class",
"columns_selector": [
{
"field": "column1-field",
"css_selector": ".column1",
"default": "",
"property": "text"
},
{
"field": "column2-field",
"css_selector": ".column2",
"default": "",
"property": "text"
},
{
"field": "column3-field",
"css_selector": ".column3",
"default": "",
"property": "text"
} ⚠ Last one without comma!
]
}
Column attributes
field
This parameter will designate the name of the column in your dataset. If there are any limitations in the database or file you will use in the output, you should take this into account.
css_selector
This is the most important parameter for columns. Just place the selector where the information for your column is located.
If you're still not familiar with CSS selectors, spend some time reading this documentation, as it's explained in many places.
default
This parameter allows a value to be entered in the field if the selector does not return content.
It is useful for further processing of the data. Perhaps, for example, you would rather enter N/A or a 0 than having an empty field.
property
The property parameter is pretty flexible. Most common values would be text allowing to retrieve content inside a selected tag and href, allowing to retrieve a link url.
<div class="threadinfo" title="Ok so let's go through the cheapest options for new equipment that you'd be willing to use for training."></div>
<span data-ts="1618944790" class="ts visible" title="sent: 20:53"> 20:53</span>
Now, let's take a look at the code above. title could be used to retrieve the information of such attribute from any element.
Now, let's say that title and text for ts visible are Javascript generated, and we want to avoid dealing with that. We could simply retrieve the unix date contained in the data-ts attribute, passing it as a value in the property parameter.
{
"field": "thread_info",
"css_selector": ".threadinfo",
"default": "",
"property": "title"
},
{
"field": "unix_date",
"css_selector": ".ts",
"default": "",
"property": "data-ts"
}
Export options
When you run picoscraper from the command console, in addition to specifying the name of your configuration file, you will need to specify the name of the output, and its extension. The name can be anything, and the extension will tell PicoScraper what type of file you want.
picoscraper configuration.json output.type
| Value | Comment |
|---|---|
| empty | Defaults to sqlite. |
| output.db | PicoScraper defaults to SQLite, but you may want to specify the name and/or directory of the output database. |
| output.json | You'll get both JSON and SQLite. |
| output.csv | You'll get both CSV and SQLite. |
| output.xlsx | You'll get both XLSX and SQLite. |
| ODBC Driver | Any database allowing for ODBC connection. Keep in mind you need privileges for creating tables and populating them. ⚠ Not implemented yet, check roadmap in the index page. |
What's SQLite? how do I use it?
If you didn't know SQLite, it is, simply put, a database contained in a file. You can just copy the database, and paste it somewhere else.
There is no client-server architecture, as there is with MySQL (you may remember this database from software such as Wordpress, forums, etc).
You can explore your SQLite database with this software:
- SQLite Studio (Windows)
- Heidi SQL (Windows)
- DBeaver (Windows, Mac, Linux)
- SQLite Browser (Windows, Mac, Linux) -- Recommended
- Knime ETL (Windows, Mac, Linux) -- Recommended
You can explore, edit and export your database with this software. It's also possible to use SQL language to interact with it. There are plenty SQL tutorials for SQLite online.
And why is it better than Excel/CSV?
Theoretically, Excel has a limit of 1,048,576 rows and 16,384 columns, but you will encounter problems much earlier, as it slows down excessively if your dataset is large, especially on modest computers.
With CSV the problem can be similar if you open it with Excel, but even if you use software designed to deal with large datasets, you'll find difficulties when dealing with human text.
CSV files delimit fields with a character (usually quotation marks), which is a problem if you are extracting text that includes quotation marks, or tabs etc.
Databases solve both of these problems, and SQLite solves them in a particularly simple way, as it is just a file you can move around.
How do I use other databases? How do I export via ODBC?
Missing documentation?
Please, contact us in our community.