GoToSocial/internal/web/robots.go

// GoToSocial
// Copyright (C) GoToSocial Authors admin@gotosocial.org
// SPDX-License-Identifier: AGPL-3.0-or-later
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program.  If not, see <http://www.gnu.org/licenses/>.

package web

import (
	"net/http"

	"github.com/gin-gonic/gin"
)

const (
	robotsPath          = "/robots.txt"
	robotsMetaAllowSome = "nofollow, noarchive, nositelinkssearchbox, max-image-preview:standard" // https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#robotsmeta
	robotsTxt           = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go
# More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro

# Before we commence, a giant fuck you to ChatGPT in particular.
# https://platform.openai.com/docs/gptbot
User-agent: GPTBot
Disallow: /

# As of September 2023, GPTBot and ChatGPT-User are equivalent. But there's no telling
# when OpenAI might decide to change that, so block this one too.
User-agent: ChatGPT-User
Disallow: /

# And a giant fuck you to Google Bard and their other generative AI ventures too.
# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
User-agent: Google-Extended
Disallow: /

# Block CommonCrawl. Used in training LLMs and specifically GPT-3.
# https://commoncrawl.org/faq
User-agent: CCBot
Disallow: /

# Block Omgilike/Webz.io, a "Big Web Data" engine.
# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/
User-agent: Omgilibot
Disallow: /

# Block Faceboobot, because Meta.
# https://developers.facebook.com/docs/sharing/bot
User-agent: FacebookBot
Disallow: /

# Well-known.dev crawler. Indexes stuff under /.well-known.
# https://well-known.dev/about/
User-agent: WellKnownBot
Disallow: /

# Rules for everything else.
User-agent: *
Crawl-delay: 500

# API endpoints.
Disallow: /api/

# Auth/login endpoints.
Disallow: /auth/
Disallow: /oauth/
Disallow: /check_your_email
Disallow: /wait_for_approval
Disallow: /account_disabled

# Well-known endpoints.
Disallow: /.well-known/

# Fileserver/media.
Disallow: /fileserver/

# Fedi S2S API endpoints.
Disallow: /users/
Disallow: /emoji/

# Settings panels.
Disallow: /admin
Disallow: /user
Disallow: /settings/

# Domain blocklist.
Disallow: /about/suspended`
)

// robotsGETHandler returns a decent robots.txt that prevents crawling
// the api, auth pages, settings pages, etc.
//
// More granular robots meta tags are then applied for web pages
// depending on user preferences (see internal/web).
func (m *Module) robotsGETHandler(c *gin.Context) {
	c.String(http.StatusOK, robotsTxt)
}
[chore] Improve copyright header handling (#1608) * [chore] Remove years from all license headers Years or year ranges aren't required in license headers. Many projects have removed them in recent years and it avoids a bit of yearly toil. In many cases our copyright claim was also a bit dodgy since we added the 2021-2023 header to files created after 2021 but you can't claim copyright into the past that way. * [chore] Add license header check This ensures a license header is always added to any new file. This avoids maintainers/reviewers needing to remember to check for and ask for it in case a contribution doesn't include it. * [chore] Add missing license headers * [chore] Further updates to license header * Use the more common // indentend comment format * Remove the hack we had for the linter now that we use the // format * Add SPDX license identifier 2023-03-12 16:00:57 +01:00			`// GoToSocial`
			`// Copyright (C) GoToSocial Authors admin@gotosocial.org`
			`// SPDX-License-Identifier: AGPL-3.0-or-later`
			`//`
			`// This program is free software: you can redistribute it and/or modify`
			`// it under the terms of the GNU Affero General Public License as published by`
			`// the Free Software Foundation, either version 3 of the License, or`
			`// (at your option) any later version.`
			`//`
			`// This program is distributed in the hope that it will be useful,`
			`// but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`// GNU Affero General Public License for more details.`
			`//`
			`// You should have received a copy of the GNU Affero General Public License`
			`// along with this program. If not, see <http://www.gnu.org/licenses/>.`
[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 12:03:17 +02:00
			`package web`

[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`import (`
			`"net/http"`

			`"github.com/gin-gonic/gin"`
			`)`

[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 12:03:17 +02:00			`const (`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`robotsPath = "/robots.txt"`
			`robotsMetaAllowSome = "nofollow, noarchive, nositelinkssearchbox, max-image-preview:standard" // https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#robotsmeta`
			robotsTxt = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00			`# More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro`

			`# Before we commence, a giant fuck you to ChatGPT in particular.`
			`# https://platform.openai.com/docs/gptbot`
			`User-agent: GPTBot`
			`Disallow: /`

[feature] Block a bunch of "AI" crawlers (#2239) * [feature] Block Google Bard/AI crawlers * [feature] Block the other OpenAI crawler * [feature] Block Common Crawl crawler This is used in research, but also gleefully advertises itself as the training source used in all LLMs and GPT-3. Fixes: #2240 * [feature] Block Omgilikebot Used by some shady big web data engine company. * [feature] Block Meta's language model crawler * [feature] Block well-known.dev crawler 2023-09-30 21:44:57 +02:00			`# As of September 2023, GPTBot and ChatGPT-User are equivalent. But there's no telling`
			`# when OpenAI might decide to change that, so block this one too.`
			`User-agent: ChatGPT-User`
			`Disallow: /`

			`# And a giant fuck you to Google Bard and their other generative AI ventures too.`
			`# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers`
			`User-agent: Google-Extended`
			`Disallow: /`

			`# Block CommonCrawl. Used in training LLMs and specifically GPT-3.`
			`# https://commoncrawl.org/faq`
			`User-agent: CCBot`
			`Disallow: /`

			`# Block Omgilike/Webz.io, a "Big Web Data" engine.`
			`# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/`
			`User-agent: Omgilibot`
			`Disallow: /`

			`# Block Faceboobot, because Meta.`
			`# https://developers.facebook.com/docs/sharing/bot`
			`User-agent: FacebookBot`
			`Disallow: /`

			`# Well-known.dev crawler. Indexes stuff under /.well-known.`
			`# https://well-known.dev/about/`
			`User-agent: WellKnownBot`
			`Disallow: /`

[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00			`# Rules for everything else.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`User-agent: *`
			`Crawl-delay: 500`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# API endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /api/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Auth/login endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /auth/`
			`Disallow: /oauth/`
			`Disallow: /check_your_email`
			`Disallow: /wait_for_approval`
			`Disallow: /account_disabled`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Well-known endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /.well-known/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Fileserver/media.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /fileserver/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Fedi S2S API endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /users/`
			`Disallow: /emoji/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Settings panels.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00			`Disallow: /admin`
			`Disallow: /user`
[feature] Public list of suspended domains (#1362) * basic rendered domain blocklist (unauthenticated!) * style basic domain block list * better formatting for domain blocklist * add opt-in config option for showing suspended domains * format/linter * re-use InstancePeersGet for web-accessible domain blocklist * reword explanation, border styling * always attach blocklist handler, update error message * domain blocklist error message grammar 2023-01-25 18:06:41 +01:00			`Disallow: /settings/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 13:16:34 +02:00
			`# Domain blocklist.`
[feature] Public list of suspended domains (#1362) * basic rendered domain blocklist (unauthenticated!) * style basic domain block list * better formatting for domain blocklist * add opt-in config option for showing suspended domains * format/linter * re-use InstancePeersGet for web-accessible domain blocklist * reword explanation, border styling * always attach blocklist handler, update error message * domain blocklist error message grammar 2023-01-25 18:06:41 +01:00			Disallow: /about/suspended`
[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 12:03:17 +02:00			`)`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 13:10:50 +01:00
			`// robotsGETHandler returns a decent robots.txt that prevents crawling`
			`// the api, auth pages, settings pages, etc.`
			`//`
			`// More granular robots meta tags are then applied for web pages`
			`// depending on user preferences (see internal/web).`
			`func (m Module) robotsGETHandler(c gin.Context) {`
			`c.String(http.StatusOK, robotsTxt)`
			`}`