Design goals for a FOSS search engine:
- FOSS but centralized, don't bother with distributed crawling or federation or anything, at least not until we can prove that we can do it the easy way. Apply the SourceHut model.
- Don't blithely crawl the whole WWW. Tier 1 sites opt-in to being crawled and are manually reviewed by a human to establish lack of shit. Backlinks from tier 1 sites are crawled to form a tier 2 graph. There is no tier 3. Searches prioritize tier 1 results. (De-emphasizes blogspam in favor of primary sources, entirely eliminates SEO gaming).
- Attempt to index any data source (HTML, Gemtext, man pages, a CSV file) and return any results that either (1) the User-Agent can present, or (2) we have some code to render in a format the U-A can present, or (3) the user explicitly asked for
Problem: who's going to pay for it? Answer: probably SourceHut if we ever get to, say, 10x our current revenue.
1. Decentralized search engines have already been shown not to work.
2. Centralized is more user-friendly and more likely to catch on with noobs.
3. Gives a central authority to hold accountable for things like curbing spam and abuse.
FOSS but centralized still brings some benefits:
- You can send a patch
- A third party with sufficient resources can stand up their own engine based on the code
And such a project would encourage embracing open standards, open data, and generally making it so that you can get out of it what you need from it, even if it is centralized.
@qorg11 searx is a meta search engine, which does not count.
@qorg11 yacy is decentralized and broken
@sir works for me
@qorg11 doesn't work for anyone else lol
@sir legwork.i2p uses yacy and works lol
@qorg11 I use a Hidden Isle, which is heavily based on searx (^_^) http://www.qpyyki5ivitrlzsw5qhhpahwakuuslegzkq5c6hhl6y6nld4s45yliqd.onion/
@qorg11 Meant to say Little Isle, rip
@qorg11 I might rename it though, Hidden Isle sounds much cooler
Do you have any sources for 1. ?
@ck see yacy
@sir Not what I consider "proof" though ...
@ck by all means go make your decentralized search engine, good luck to you.
@sir To be fair, I agree that "decentralized" in the sense that yacy tried to apply it is a dead end. Too hard for ((non-)tech-)users to run and use.
In the sense of building it on a federated basis, having a large enough number of tech savy / idealistic volunteer and/or organizations running nodes, both crawlers and index, on the network, is an approach that I belive might work. I applied to @PrototypeFund last round to get initial funding, but no luck :(
A federated index would have to consult at least *most* of the instances for *every* search.
Searches would have to wait for responses from the rest of the federation.
Searches would "finish" immediately, Yacy style, and the results would re-order or jump around while waiting.
Many forms of abuse would be trivial to implement.
A novel sharding and crawling algorithm would need to be figured out to distribute searches democratically and be tolerant of entities leaving/entering/being cut off from a subset of the rest at any time
@sir Neat idea, something I think many of us have thought about once Google results began deteriorating in quality.
And the first thing I tried resulted in this... https://meta.stackexchange.com/questions/180656/how-to-access-the-sitemap-xml-file-of-stackoverflow-com
@sir I’ve been thinking about this recently too, was going to hack something together but not had the time yet
@sir jivesearch was promising (can use yandex but also has its own crawler based on colly) but the project has been dead for a while, it’s where I was going to start anyway
@sir doing my part to contribute to SourceHut revenue :). One of the two services I subscribe to that I really enjoy.
@sir so where is the project where I can contribute?
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!