Follow

We really need a FOSS search engine with, and this is important: its own in-house, FOSS crawler

@fakefred @penguin42 federating a search engine would be pretty difficult, but I would be interested in seeing some research around community ownership of the crawled data.

@sir @fakefred I think that storage and access is the challenge - and trust if it's federated, you don't want all searches to get redirected to porn sites or other vendors sites.

@sir a search engine that searches only (independent) blogs would be great too

@simon There's wiby.me/ which searches only non-commercial websites with little-to-no CSS/JavaScript. It's not opensource though.

@sir For a crawler other than Google or Bing there's Mojeek (not opensource, sorry) or Yacy. I'm not sure if there's any others... Maybe there should be?

@sir #searx is a FOSS search engine & #yacy is a FOSS crawler, and they work together.

@selea @sir i just know that I've encountered some searx instances that source indexes from a yacy instance running on the same host. I've not installed it myself.

@aktivismoEstasMiaLuo @selea @sir

yacy.everdot.org/ defaults to only sourcing the global + a private yacy network and searx.everdot.org/ includes a private yacy network by default.

One major problem with using the global yacy network is that you have to decide a cut-off for how long you want to wait for global results and drop slower servers because some use minutes before they respond. That's just too slow. Also, patch is needed to sort results, default is first come first shown.

@sir lets hope spider/ask.moe will free us from this search engine prison. I've been using qwant which seem to make similar promises to ddg, but its also not FOSS which is a shame.

@sir yacy is pretty cool in this respect. i love the idea of it, tho i think the implementation is strange. i do use it though to index all the blogs i subscribe too.

but i agree there's very little in this part of the FOSS world. most things just leverage existing search engines (e.g. searx)

@sir The Gigablast search engine published their source code to a git repository a while back, but it definitely needs an overhaul.

@sir I was literally just working on this! My use-case is that I've contributed lots on GitHub and I want to download all of the repos I've worked on... but I can't get a list of them.

Currently fighting with their GraphQL API, but I'd kill for a "give me a list of all repos where a commit is authored by me" search query.

@christianbundy that's not what I meant. I meant a FOSS search engine for searching the web at large

@sir oh! I haven't looked into those in a while, last I saw I think YaCy was state-of-the-art. If you find anything (or build anything) I'd be happy to test.

@_1751015 where can I play with a search engine powered by this data?

@cuniculus @_1751015 tbh I don't think a distributed search engine is the right approach

@sir @_1751015

Yeah, since it requires loads of storage and fat bandwidth

@cuniculus @sir YaCy has some niche applications that are interesting. Check the writing here and the comments:
susa.net/wordpress/2020/05/per
Personal index of curated URLs + eventually sharing the index - IMO it has advantages over a general purpose search engine.

@sir I don't have information about a search engine using the Common Crawl data. They have a compiled list with references to various small projects that use the data:
commoncrawl.org/the-data/examp

@sir Even if you have a FOSS search engine with a FOSS crawler like what's running on yacy.everdot.org/ you'll quickly run into performance issues and economic issues. Going FOSS won't automatically bring in advertisement revenue and that's what Google/Bing/etc actually do, they are advertisement agencies not search engines. That's how they afford thousands of servers. There's free software but there's no such thing as free hardware.

@sir Agree. But, once and again: Maybe this is not so much a F(L)OSS issue but more an issue of handling a large, potentially decentralized / distributed search index at runtime, keeping things available, stable, performant 24x7. Maybe, finally, a situation to understand our current focus on code and code licensing is important but not *all* it takes to have working technology available.....? 🙂

Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!