Onion Information

http://wgq3bd2kqoybhstp77i3wrzbfnsyd27wt34psaja4grqiezqircorkyd.onion/posts/2021/03/10/search-engines-with-own-indexes/

A look at search engines with their own indexes - Seirdy

A cursory review of all the non-metasearch, indexing search engines I have been able to find

Onion Details

Page Clicks: 1

First Seen: 03/11/2024

Last Indexed: 10/21/2024

Domain Index Total: 190

Onion Content

Preface This is a cursory review of all the indexing search engines I have been able to find. The three dominant English search engines with their own indexes note 1 are Google, Bing, and Yandex ( GBY ). Many alternatives to GBY exist, but almost none of them have their own results; instead, they just source their results from GBY. With that in mind, I decided to test and catalog all the different indexing search engines I could find. I prioritized breadth over depth, and encourage readers to try the engines out themselves if they’d like more information. This page is a “living document” that I plan on updating indefinitely. Check for updates once in a while if you find this page interesting. Feel free to send me suggestions, updates, and corrections; I’d especially appreciate help from those who speak languages besides English and can evaluate a non-English indexing search engine. Contact info is in the article footer. I plan on updating the engines in the top two categories with more info comparing the structured/linked data the engines leverage (RDFa vocabularies, microdata, microformats, JSON-LD, etc.) to help authors determine which formats to use. Toggle table of contents About the list I discuss my motivation for making this page in the Rationale section . I primarily evaluated English-speaking search engines because that’s my primary language. With some difficulty, I could probably evaluate a Spanish one; however, I wasn’t able to find many Spanish-language engines powered by their own crawlers. I mention details like “allows site submissions” and structured data support where I can only to inform authors about their options, not as points in engines’ favor. See the Methodology section at the bottom to learn how I evaluated each one. General indexing search-engines Large indexes, good results These are large engines that pass all my standard tests and more. Google The biggest index. Allows submitting pages and sitemaps for crawling, and even supports WebSub to automate the process. Powers a few other engines: A former version of Startpage , possibly the most popular Google proxy. Startpage now uses Bing note 2 GMX Search , run by a popular German email provider. (discontinued) Runnaroo Mullvad Leta SAPO (Portuguese interface, can work with English results) DSearch 13TABS Zarebin (Persian, can return English results) A host of other engines using Programmable Search Engine’s client-side scripts. Bing The runner-up. Allows submitting pages and sitemaps for crawling without login using the IndexNow API , sharing IndexNow page submissions with Yandex and Seznam. Its index powers many other engines: Yahoo (and its sibling engine, OneSearch) DuckDuckGo note 3 (offers a Tor onion service, a JS-free version, and a TUI-browser-friendly “lite” version making it a good way to use Bing anonymously) AOL Qwant (partial) note 4 Ecosia Ekoru Privado Findx Disconnect Search note 5 PrivacyWall Lilo SearchScene Peekier (not to be confused with Peekr, a metasearch engine with its own index) Oscobo Million Short Yippy search note 6 Lycos Givero Swisscows Fireball Netzzappen You.com note 7 Partially powers MetaGer by default; this can be turned off At this point, I mostly stopped adding Bing- based search engines. There are just too many. Yandex Originally a Russian search engine, it now has an English version. Some Russian results bleed into its English site. It allows submitting pages and sitemaps for crawling using the IndexNow API, sharing IndexNow page submissions with Bing and Seznam. Powers: Epic Search (went paid-only as of June 2021) Occasionally powers DuckDuckGo’s link results instead of Bing (update: DuckDuckGo has “paused” its partnership with Yandex, confirmed in Hearing on “Holding Big Tech Accountable: Legislation to Protect Online Users” Petal, for Russian users only. Mojeek Seems privacy-oriented with a large index containing billions of pages. Quality isn’t at GBY’s level, but it’s not bad either. If I had to use Mojeek as my default general search engine, I’d live. Partially powers eTools.ch . At this moment, I think that Mojeek is the best alternative to GBY for general search. Google, Bing, and Yandex support structured data such as microformats1, microdata, RDFa, Open Graph markup, and JSON-LD. Yandex’s support for microformats1 is limited; for instance, it can parse h-card metadata for organizations but not people. Open Graph and Schema.org are the only supported vocabularies I’m aware of. Mojeek is evaluating structured data; it’s interested in Open Graph and Schema.org vocabularies. Smaller indexes or less relevant results These engines pass most of the tests listed in the “methodology” section. All of them seem relatively privacy-friendly. I wouldn’t recommend using these engines to find specific answers; they’re better for learning about a topic by finding interesting pages related to a set of keywords. Stract My favorite generalist engine on this page. Stract supports advanced ranking customization by allowing users to import “optics” files, like a better version of Brave’s “goggles” feature. Stract is fully open-source , with code released under an AGPL-3.0 license. The index is isn’t massive but it’s big enough to be a useful supplement to more major engines. Stract started with the Common Crawl index, but now uses its own crawler. Plans to add contextual ads and a subscription option for ad-free search. Discovered in my access logs. Right Dao Very fast, good results. Passes the tests fairly well. It plans on including query-based ads if/when its user base grows. note 8 For the past few months, its index seems to have focused more on large, established sites rather than smaller, independent ones. It seems to be a bit lacking in more recent pages. Alexandria A pretty new “non-profit, ad free” engine, with freely-licensed code . Surprisingly good at finding recent pages. Its index is built from the Common Crawl; it isn’t as big as Gigablast or Right Dao but its ranking is great. Yep An ambitious engine from Ahrefs, an SEO/backlink-finder company, that “shares ad profit with creators and protects your privacy”. Most engines show results that include keywords from or related to the query; Yep also shows results linked by pages containing the query. In other words, not all results contain relevant keywords. This makes it excellent for less precise searches and discovery of “related sites”, especially with its index of hundreds of billions of pages. It’s far worse at finding very specific information or recent events for now, but it will probably improve. It was known as “FairSearch” before its official launch. SeSe Engine Although it’s a Chinese engine, its index seems to have a large-enough proportion of English content to fit here. The engine is open-source; see the SeSe back-end Python code and the SeSe-ui Vue-based front-end . It has surprisingly good results for such a low-budget project. Each result is annotated with detailed ranking metadata such as keyword relevance and backlink weight. Discovered in my access logs. greppr Its tagline is “Search the Internet with no filters, no tracking, no ads.” At the time of writing, it has over 3 million pages indexed. It’s surprisingly good at finding interesting new results for broad short-tail queries, if you’re willing to scroll far enough down the page. It appears to be good at finding recent pages. Yep supports Open Graph and some JSON-LD at the moment. A look through the source code for Alexandria and Gigablast didn’t seem to reveal the use of any structured data. The surprising quality of results from SeSe and Right Dao seems influenced by the crawlers’ high-quality starting location: Wikipedia. Smaller indexes, hit-and- miss These engines fail badly at a few important tests. Otherwise, they seem to work well enough for users who’d like some more serendipity in less-specific searches. Peekr (formerly SvMetaSearch, not to be confused with Peekier) Originally a SearxNG metasearch engine that also included results from its own index, it’s since diverged. It now appears to return all results from its own growing ElasticSearch index. Open source, with an emphasis on self-hostability. Infotiger My favorite engine in this section. It offers advanced result filtering and sports a somewhat large index. It allows site submission for English and German pages. The fastest-improving engine in this section: I use it often to discover new sites, and look forward to the day it “graduates” to the previous section. Infotiger also has a Tor hidden service . seekport The interface is in German but it supports searching in English just fine. The default language is selected by your locale. It’s really good considering its small index; it hasn’t heard of less common terms, but it’s able to find relevant results in other tests. It’s the second-fastest-improving engines in this section. Exalead Slow, quality is hit-and-miss. Its indexer claims to crawl the DMOZ directory, which has since shut down and been replaced by the Curlie directory. No relevant results for “Oppenheimer” and some other history-related queries. Allows submitting individual URLs for indexing, but requires solving a Google reCAPTCHA and entering an email address. ExactSeek Small index, disproportionately dominated by big sites. Failed multiple tests. Allows submitting individual URLs for crawling, but requires entering an email address and receiving a newsletter. Webmaster tools seem to heavily push for paid SEO options. It also powers SitesOnDisplay and Blog- search.com . Burf.co Very small index, but seems fine at ranking more relevant results higher. Allows site submission without any extra steps. ChatNoir An experimental engine by researchers that uses the Common Crawl index. The engine is open source . See the announcement on the Common Crawl mailing list (Google Groups). Secret Search Engine Labs Very small index with very little SEO spam; it toes the line between a “search engine” and a “surf engine”. It’s best for reading about broad topics that would otherwise be dominated by SEO spam, thanks to its CashRank algorithm . Allows site submission. Gabanza A search engine from a hosting company. I found few details abou the search engine itself, and the index was small, but it was suitable for discovering new pages related to short broad queries. Jambo Docs, blog posts, etc. have not been updated since around 2006 but the engine continues to crawl and index new pages. Discovered in my access logs. Has a bias towards older content. Fledgling engines Results from these search engines don’t seem particularly relevant; indexes in this category tend to be small. Yessle Seems new; allows page submission by pasting a page into the search box. Index is really small but it crawls new sites quickly. Claims to be private. Bloopish Ex...

Blog

Advertise