totally reworked process argument tokenization, allow to use both single and double quotes and removing some other limitation

This commit is contained in:
Martin Rotter 2023-10-09 11:12:14 +02:00
parent 60bacf395a
commit edf696bae2
4 changed files with 183 additions and 57 deletions

View File

@ -19,8 +19,24 @@ However, if you choose `Script` option, then you cannot provide URL of your feed
Any errors in your script must be written to [**error output** (stderr)](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)). Any errors in your script must be written to [**error output** (stderr)](https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)).
```{warning} :::{warning}
As of RSS Guard 4.2.0, you cannot separate your arguments with `#`. If your argument contains spaces, then enclose it with DOUBLE quotes, for example `"my argument"`. DO NOT use SINGLE quotes to do that. If your path to executable contains backslashes as directory separators, make sure to escape them with another backslash. Quote each individual argument with double quotes `"arg"` or single quotes `'arg'` and separate all arguments with spaces. You have to escape some characters inside double-quoted argument, for example double quote itself like this `"arg with \"quoted\" part"`.
Examples (one per line):
```
C:\\MyFolder\\My.exe "arg1" "arg2" "my \"quoted\" arg3" 'my "quoted" arg4'
bash "%data%/scripts/download-feed.sh"
%data%\jq.exe '{ version: "1.1", title: "Stars", items: map( . | .title=.full_name | .content_text=.description | .date_published=.pushed_at)}'
```
:::
RSS Guard offers [placeholder](userdata.md#data-placeholder) `%data%` which is automatically replaced with full path to RSS Guard user data folder and you can use this placeholder anywhere in your script call line.
```{attention}
Working directory of process executing the script is set to point to RSS Guard [user data](userdata) folder.
``` ```
Format of post-process script execution line can be seen on picture below. Format of post-process script execution line can be seen on picture below.
@ -31,25 +47,7 @@ If everything goes well, script must return `0` as the process exit code, or a n
Executable file must be always be specified, while arguments do not. Be very careful when quoting arguments. Tested examples of valid execution lines are: Executable file must be always be specified, while arguments do not. Be very careful when quoting arguments. Tested examples of valid execution lines are:
| Command | Explanation | ## Dataflow
| :--- | --- |
| `bash -c "curl 'https://github.com/martinrotter.atom'"` | Download ATOM feed file using Bash and Curl. |
| `Powershell Invoke-WebRequest "https://github.com/martinrotter.atom" \| Select-Object -ExpandProperty Content` | Download ATOM feed file with Powershell. |
| `php tweeper.php -v 0 "https://twitter.com/NSACareers"` | Scrape Twitter RSS feed file with [Tweeper](https://git.ao2.it/tweeper.git). Tweeper is the utility that produces RSS feed from Twitter and other similar social platforms. |
```{note}
The above examples are cross-platform. You can use exactly the same command on Windows, Linux or macOS, if your operating system is properly configured.
```
RSS Guard offers [placeholder](userdata.md#data-placeholder) `%data%` which is automatically replaced with full path to RSS Guard user data folder, allowing you to make your configuration fully portable. You can, therefore, use something like this as a source script line: `bash %data%/scripts/download-feed.sh`.
```{attention}
Working directory of process executing the script is set to point to RSS Guard [user data](userdata) folder.
```
There are [examples of website scrapers](https://github.com/martinrotter/rssguard/tree/master/resources/scripts/scrapers). Most of them are written in Python 3, so their execution line is similar to `python script.py`. Make sure to examine each script for more information on how to use it.
----
After your source feed data is downloaded either via URL or custom script, you can optionally post-process it with one more custom script, which will take **raw source data as input**. It must produce valid feed data to standard output while printing all error messages to error output. After your source feed data is downloaded either via URL or custom script, you can optionally post-process it with one more custom script, which will take **raw source data as input**. It must produce valid feed data to standard output while printing all error messages to error output.
Here is little flowchart explaining where and when scripts are used: Here is little flowchart explaining where and when scripts are used:
@ -76,6 +74,10 @@ Typical post-processing filter might do things like CSS formatting, localization
It's completely up to you if you decide to only use script as `Source` of the script or separate your custom functionality between `Source` script and `Post-process` script. Sometimes you might need different `Source` scripts for different online sources and the same `Post-process` script and vice versa. It's completely up to you if you decide to only use script as `Source` of the script or separate your custom functionality between `Source` script and `Post-process` script. Sometimes you might need different `Source` scripts for different online sources and the same `Post-process` script and vice versa.
## Example Scrapers
There are [examples of website scrapers](https://github.com/martinrotter/rssguard/tree/master/resources/scripts/scrapers). Most of them are written in Python 3, so their execution line is similar to `python "script.py"`. Make sure to examine each script for more information on how to use it.
## 3rd-party Tools
Third-party tools for scraping made to work with RSS Guard: Third-party tools for scraping made to work with RSS Guard:
* [CSS2RSS](https://github.com/Owyn/CSS2RSS) - can be used to scrape websites with CSS selectors. * [CSS2RSS](https://github.com/Owyn/CSS2RSS) - can be used to scrape websites with CSS selectors.
* [RSSGuardHelper](https://github.com/pipiscrew/RSSGuardHelper) - another CSS selectors helper. * [RSSGuardHelper](https://github.com/pipiscrew/RSSGuardHelper) - another CSS selectors helper.

View File

@ -175,45 +175,166 @@ QString TextFactory::capitalizeFirstLetter(const QString& sts) {
} }
} }
QStringList TextFactory::tokenizeProcessArguments(QStringView command) { enum class TokenState {
// We are not inside argument, we are between arguments.
Normal,
// We have detected escape "\" character coming from double-quoted argument.
EscapedFromDoubleQuotes,
// We have detected escape "\" character coming from spaced argument.
EscapedFromSpaced,
// We are inside argument which was separated by spaces.
InsideArgSpaced,
// We are inside argument.
InsideArgDoubleQuotes,
// We are inside argument, do not evaluate anything, just take it all
// as arw text.
InsideArgSingleQuotes
};
QStringList TextFactory::tokenizeProcessArguments(const QString& command) {
// Each argument containing spaces must be enclosed with single '' or double "" quotes.
// Some characters must be escaped with \ to keep their textual values as
// long as double-quoted argument is used.
if (command.isEmpty()) {
return {};
}
// We append space to end of command to make sure that
// ending space-separated argument is processed.
QString my_command = command + u' ';
TokenState state = TokenState::Normal;
QStringList args; QStringList args;
QString tmp; QString arg;
int quote_count = 0;
bool in_quote = false;
for (int i = 0; i < command.size(); ++i) { for (QChar chr : my_command) {
if (command.at(i) == QL1C('"')) { switch (state) {
++quote_count; case TokenState::Normal: {
switch (chr.unicode()) {
case u'"':
// We start double-quoted argument.
state = TokenState::InsideArgDoubleQuotes;
continue;
if (quote_count == 3) { case u'\'':
quote_count = 0; // We start single-quoted argument.
tmp += command.at(i); state = TokenState::InsideArgSingleQuotes;
continue;
case u' ':
// Whitespace, just go on.
continue;
default:
// We found some actual text which marks
// beginning of argument, we assume spaced argument.
arg.append(chr);
state = TokenState::InsideArgSpaced;
continue;
}
break;
} }
continue; case TokenState::EscapedFromDoubleQuotes: {
} // Previous character was "\".
arg.append(chr);
if (quote_count) { state = TokenState::InsideArgDoubleQuotes;
if (quote_count == 1) { break;
in_quote = !in_quote;
} }
quote_count = 0; case TokenState::EscapedFromSpaced: {
} // Previous character was "\".
arg.append(chr);
if (!in_quote && command.at(i).isSpace()) { state = TokenState::InsideArgSpaced;
if (!tmp.isEmpty()) { break;
args += tmp; }
tmp.clear();
case TokenState::InsideArgSpaced: {
switch (chr.unicode()) {
case u'\\':
// We found escaped!
state = TokenState::EscapedFromSpaced;
continue;
case u' ':
// We need to end this argument.
args.append(arg);
arg.clear();
state = TokenState::Normal;
continue;
default:
arg.append(chr);
break;
}
break;
}
case TokenState::InsideArgDoubleQuotes: {
switch (chr.unicode()) {
case u'\\':
// We found escaped!
state = TokenState::EscapedFromDoubleQuotes;
continue;
case u'"':
// We need to end this argument.
args.append(arg);
arg.clear();
state = TokenState::Normal;
continue;
default:
arg.append(chr);
break;
}
break;
}
case TokenState::InsideArgSingleQuotes: {
switch (chr.unicode()) {
case u'\'':
// We need to end this argument.
args.append(arg);
arg.clear();
state = TokenState::Normal;
continue;
default:
arg.append(chr);
break;
}
break;
} }
}
else {
tmp += command.at(i);
} }
} }
if (!tmp.isEmpty()) { switch (state) {
args += tmp; case TokenState::EscapedFromSpaced:
case TokenState::EscapedFromDoubleQuotes:
throw ApplicationException(QObject::tr("escape sequence not completed"));
break;
case TokenState::InsideArgDoubleQuotes:
throw ApplicationException(QObject::tr("closing \" is missing"));
break;
case TokenState::InsideArgSingleQuotes:
throw ApplicationException(QObject::tr("closing ' is missing"));
break;
default:
break;
} }
return args; return args;

View File

@ -36,7 +36,7 @@ class TextFactory {
static QString decrypt(const QString& text, quint64 key = 0); static QString decrypt(const QString& text, quint64 key = 0);
static QString newline(); static QString newline();
static QString capitalizeFirstLetter(const QString& sts); static QString capitalizeFirstLetter(const QString& sts);
static QStringList tokenizeProcessArguments(QStringView args); static QStringList tokenizeProcessArguments(const QString& command);
// Shortens input string according to given length limit. // Shortens input string according to given length limit.
static QString shorten(const QString& input, int text_length_limit = TEXT_TITLE_LIMIT); static QString shorten(const QString& input, int text_length_limit = TEXT_TITLE_LIMIT);

View File

@ -7,6 +7,7 @@
#include "exceptions/networkexception.h" #include "exceptions/networkexception.h"
#include "exceptions/scriptexception.h" #include "exceptions/scriptexception.h"
#include "miscellaneous/iconfactory.h" #include "miscellaneous/iconfactory.h"
#include "miscellaneous/textfactory.h"
#include "network-web/networkfactory.h" #include "network-web/networkfactory.h"
#include "services/abstract/category.h" #include "services/abstract/category.h"
#include "services/standard/definitions.h" #include "services/standard/definitions.h"
@ -260,11 +261,12 @@ void StandardFeedDetails::onUrlChanged(const QString& new_url) {
} }
} }
else if (sourceType() == StandardFeed::SourceType::Script) { else if (sourceType() == StandardFeed::SourceType::Script) {
if (new_url.simplified().isEmpty()) { try {
m_ui.m_txtSource->setStatus(LineEditWithStatus::StatusType::Error, tr("The source is empty.")); TextFactory::tokenizeProcessArguments(new_url);
m_ui.m_txtSource->setStatus(LineEditWithStatus::StatusType::Ok, tr("Source is ok."));
} }
else { catch (const ApplicationException& ex) {
m_ui.m_txtSource->setStatus(LineEditWithStatus::StatusType::Ok, tr("The source is ok.")); m_ui.m_txtSource->setStatus(LineEditWithStatus::StatusType::Error, tr("Error: %1").arg(ex.message()));
} }
} }
else { else {
@ -273,11 +275,12 @@ void StandardFeedDetails::onUrlChanged(const QString& new_url) {
} }
void StandardFeedDetails::onPostProcessScriptChanged(const QString& new_pp) { void StandardFeedDetails::onPostProcessScriptChanged(const QString& new_pp) {
if (QRegularExpression(QSL(SCRIPT_SOURCE_TYPE_REGEXP)).match(new_pp).hasMatch() || !new_pp.simplified().isEmpty()) { try {
TextFactory::tokenizeProcessArguments(new_pp);
m_ui.m_txtPostProcessScript->setStatus(LineEditWithStatus::StatusType::Ok, tr("Command is ok.")); m_ui.m_txtPostProcessScript->setStatus(LineEditWithStatus::StatusType::Ok, tr("Command is ok."));
} }
else { catch (const ApplicationException& ex) {
m_ui.m_txtPostProcessScript->setStatus(LineEditWithStatus::StatusType::Ok, tr("Command is empty.")); m_ui.m_txtPostProcessScript->setStatus(LineEditWithStatus::StatusType::Error, tr("Error: %1").arg(ex.message()));
} }
} }