searx/searx/engines/digg.py

# SPDX-License-Identifier: AGPL-3.0-or-later
"""
 Digg (News, Social media)
"""
# pylint: disable=missing-function-docstring

from urllib.parse import urlencode
from datetime import datetime

from lxml import html
from searx.utils import eval_xpath, extract_text

# about
about = {
    "website": 'https://digg.com',
    "wikidata_id": 'Q270478',
    "official_api_documentation": None,
    "use_official_api": False,
    "require_api_key": False,
    "results": 'HTML',
}

# engine dependent config
categories = ['news', 'social media']
paging = True
base_url = 'https://digg.com'
results_per_page = 10

# search-url
search_url = base_url + (
    '/search'
    '?{query}'
    '&size={size}'
    '&offset={offset}'
)

def request(query, params):
    offset = (params['pageno'] - 1) * results_per_page + 1
    params['url'] = search_url.format(
        query = urlencode({'q': query}),
        size = results_per_page,
        offset = offset,
    )
    return params

def response(resp):
    results = []

    dom = html.fromstring(resp.text)

    results_list = eval_xpath(dom, '//section[contains(@class, "search-results")]')

    for result in results_list:

        titles = eval_xpath(result, '//article//header//h2')
        contents = eval_xpath(result, '//article//p')
        urls = eval_xpath(result, '//header/a/@href')
        published_dates = eval_xpath(result, '//article/div/div/time/@datetime')

        for (title, content, url, published_date) in zip(titles, contents, urls, published_dates):
            results.append({
                'url': url,
                'publishedDate': datetime.strptime(published_date, '%Y-%m-%dT%H:%M:%SZ'),
                'title': extract_text(title),
                'content' : extract_text(content),
            })

    return results
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`Digg (News, Social media)`
			`"""`
[mod] digg - pylint searx/engines/digg.py Eliminate redundant file names which are tested by test.pylint and ignored by test.pep8 Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-11-22 11:37:12 +01:00			`# pylint: disable=missing-function-docstring`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00
Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 17:42:46 +02:00			`from urllib.parse import urlencode`
[fix] update digg engine 2019-10-16 15:11:27 +02:00			`from datetime import datetime`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`from lxml import html`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`from searx.utils import eval_xpath, extract_text`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 11:31:25 +01:00			`# about`
			`about = {`
			`"website": 'https://digg.com',`
			`"wikidata_id": 'Q270478',`
			`"official_api_documentation": None,`
			`"use_official_api": False,`
			`"require_api_key": False,`
			`"results": 'HTML',`
			`}`

Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00			`# engine dependent config`
			`categories = ['news', 'social media']`
			`paging = True`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`base_url = 'https://digg.com'`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`results_per_page = 10`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00
			`# search-url`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`search_url = base_url + (`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`'/search'`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`'?{query}'`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`'&size={size}'`
			`'&offset={offset}'`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`)`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00
			`def request(query, params):`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`offset = (params['pageno'] - 1) * results_per_page + 1`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`params['url'] = search_url.format(`
			`query = urlencode({'q': query}),`
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`size = results_per_page,`
			`offset = offset,`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00			`)`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00			`return params`

			`def response(resp):`
			`results = []`

Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`dom = html.fromstring(resp.text)`
[refactor] digg - improve results and clean up source code - strip html tags and superfluous quotation marks from content - remove not needed cookie from request - remove superfluous imports Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2020-12-02 21:54:27 +01:00
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`results_list = eval_xpath(dom, '//section[contains(@class, "search-results")]')`
[enh] reduce the number of http outgoing connections. engines that still use http : gigablast, bing image for thumbnails, 1x and dbpedia autocompleter 2015-05-02 11:43:12 +02:00
Fix digg engine (#3150) 2022-01-30 16:41:53 +01:00			`for result in results_list:`

			`titles = eval_xpath(result, '//article//header//h2')`
			`contents = eval_xpath(result, '//article//p')`
			`urls = eval_xpath(result, '//header/a/@href')`
			`published_dates = eval_xpath(result, '//article/div/div/time/@datetime')`

			`for (title, content, url, published_date) in zip(titles, contents, urls, published_dates):`
			`results.append({`
			`'url': url,`
			`'publishedDate': datetime.strptime(published_date, '%Y-%m-%dT%H:%M:%SZ'),`
			`'title': extract_text(title),`
			`'content' : extract_text(content),`
			`})`
Digg + Twitter corrections Digg engines, with thumbnails Add pubdate for twitter 2014-12-28 22:57:59 +01:00
			`return results`