Follow

Design goals for a FOSS search engine:

- FOSS but centralized, don't bother with distributed crawling or federation or anything, at least not until we can prove that we can do it the easy way. Apply the SourceHut model.

- Don't blithely crawl the whole WWW. Tier 1 sites opt-in to being crawled and are manually reviewed by a human to establish lack of shit. Backlinks from tier 1 sites are crawled to form a tier 2 graph. There is no tier 3. Searches prioritize tier 1 results. (De-emphasizes blogspam in favor of primary sources, entirely eliminates SEO gaming).

- Attempt to index any data source (HTML, Gemtext, man pages, a CSV file) and return any results that either (1) the User-Agent can present, or (2) we have some code to render in a format the U-A can present, or (3) the user explicitly asked for

Problem: who's going to pay for it? Answer: probably SourceHut if we ever get to, say, 10x our current revenue.

"Why centralized?"

1. Decentralized search engines have already been shown not to work.

2. Centralized is more user-friendly and more likely to catch on with noobs.

3. Gives a central authority to hold accountable for things like curbing spam and abuse.

FOSS but centralized still brings some benefits:

- You can send a patch
- A third party with sufficient resources can stand up their own engine based on the code

And such a project would encourage embracing open standards, open data, and generally making it so that you can get out of it what you need from it, even if it is centralized.

Another bonus: ideally you could just stand up a little crawler which indexed a few topics you care about and make a domain-specific search engine pretty easily

@qorg11 searx is a meta search engine, which does not count.

@qorg11 I might rename it though, Hidden Isle sounds much cooler

@baobab @qorg11 i don't think @sir would like the ugly interface littleisle has

@sir Not what I consider "proof" though ...

@ck by all means go make your decentralized search engine, good luck to you.

@sir To be fair, I agree that "decentralized" in the sense that yacy tried to apply it is a dead end. Too hard for ((non-)tech-)users to run and use.
In the sense of building it on a federated basis, having a large enough number of tech savy / idealistic volunteer and/or organizations running nodes, both crawlers and index, on the network, is an approach that I belive might work. I applied to @PrototypeFund last round to get initial funding, but no luck :(

@ck @PrototypeFund a federated search engine would not work very well. Some or all of these problems would apply:

A federated index would have to consult at least *most* of the instances for *every* search.

Searches would have to wait for responses from the rest of the federation.

Searches would "finish" immediately, Yacy style, and the results would re-order or jump around while waiting.

Many forms of abuse would be trivial to implement.

A novel sharding and crawling algorithm would need to be figured out to distribute searches democratically and be tolerant of entities leaving/entering/being cut off from a subset of the rest at any time

@sir Neat idea, something I think many of us have thought about once Google results began deteriorating in quality.

And the first thing I tried resulted in this... meta.stackexchange.com/questio

@sir I’ve been thinking about this recently too, was going to hack something together but not had the time yet

@sir jivesearch was promising (can use yandex but also has its own crawler based on colly) but the project has been dead for a while, it’s where I was going to start anyway

@sir doing my part to contribute to SourceHut revenue :). One of the two services I subscribe to that I really enjoy.

@sir so where is the project where I can contribute?

@sir just was thinking about that this week lol. I concluded the very same like you did. Doing some research, found this about cloudflare and othets big corps gigablast.com/blog.html#anti

Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!