You may be interested in this proposal in the Community Wishlist: Physical Wikimedia Commons media dumps (for backups, AI models, more metadata).
Talk:Wikimedia Enterprise/Flow
Whatever credibility signals you end up packaging for this purpose, it would be great to have a Beta widget that shows those somehow to readers of articles.
I recognize the raw data is out there -- lots of data that could inform credibility is! -- but it's not conveniently cached per page, so for me to look up daily readership, the ORES trust of the latest edit, &c are each separate calls via separate interfaces waiting for separate lag times. I don't want to do that for each article I visit, and even experienced readers in general won't know to.
Hi @Sj - just noting that we've seen this comment and @RBrounley (WMF) will be getting back to you soon.
I like this idea @Sj and I think that as we combine this data together into a single view this could be a potentially very strong signal for readers.
However, just to set our expectations, I think Enterprise APIs will make this easy to create on the projects but I'd want to talk with some of the teams closer to those actual project interfaces that might touch actually creating this type of widget.
Initial versions of credibility signals will live in beta to give us some good cycles to really figure out what is an effective means to tell the story of "credibility" in raw data, it something we're excited to start to unpack. But when it gets there, the projects are more than welcome to use what we have built of course.
I would certainly trust your data viz expertise ;) And to be clear I'm only talking about 'beta' use for editors at first, as in the list of beta features.
I'd ideally like to see a progression something like this:
- Someone building "credibility signals" library makes (or commissions, or asks for help with) a gadget showcasing them --> the ideal audience would be people who care about making cred signals effective, and community devs/designers already building tools for editors. It's useful for people building the library to connect with the people telling credibility stories from imperfect data today, with almost no comprehensive data, on the projects. [people maintaining CiteUnseen, RSN + RS:P, other widgets].
- As you design and promote 'usable signals' to E users, have parallel conversations w/ community users
- This gets on the long tail of 'potential beta features' b/c of its obvious potential widespread benefits
Warmly!
Will there be public available Grafana-Dashboards to see how much the services of the Enterprise-API are used. I think that it is interesting to understand the amount of data that is transferred through the API-Services.
Dear @Hogü-456 - the short answer is that there will not be such a product yet, but that we hope to provide a simplified equivalent to that in the future.
One key factor is that the usage data would need to be 'aggregated' across all the customers/users - to suitably anonymise the individual uses - and therefore we would need to have some statistically significant number of customers. Equally, I mention 'simplified' as the Wikimedia grafana dashboard is extremely detailed and specialised. I don't want to over-promise what might be technically or legally possible to do with publishing commercial client's data (as I'm neither a lawyer, not a network engineer!) Finally, as we're building this from scratch, the public dashboard concept is interesting but not critical infrastructure that is technologically required in order for us to launch. I actually think that such a tool would be useful both for general curiosity/transparency and also as a marketing tool: to show potential future customers the scale of the service already in operation! So, not an immediate technical priority but something we should be investigating in the medium term.
I hope this answers your question.
I don’t know if there would be a market for it, but I imagine some customers might want to receive certain subsets of data in their feed/dump, for example:
- just the lead paragraphs before the first headings
- only the structured (unparsed) wikitext from infoboxes (c.f. DBpedia)
- delayed but cleaner feed, e.g. only revisions that have not been reverted or undone for x days
- only particular types of pages, e.g. English Wikipedia has articles, lists, disambiguation, and redirects all within the same namespace
- just the TOCs and hatnotes that describe structure rather than content
Or maybe nobody wants this, especially if our existing downloaders are already geared toward receiving big dumps and running their own parse-and-filter processes?
@Pelagic I think this is a really good point. In the long term we are definitely thinking about how we can provide parsed content like this to reusers (both internal and external). One of the aspects that we have identified in our very early community research interviews is that if we are to undertake this work, there comes with it a responsibility to democratize our data for organizations without the resources of these largest users. We should be about leveling the playing field and not about reinforcing the monopolies and help encourage a competitive and healthy internet. It's not just startups or alternatives to the internet giants that we are considering, but also universities and university researchers; archives and archivists; and non-profits like the Internet Archive. We are a long way off from that but it's definitely within our sights for the future.
I hope everything you are going to develop, would be free to use inside Wikimedia Projects and Tools.
In Telegram chat @Seddon (WMF) said "The team is also working with Wikimedia Technical Engagement to add free community support through cloud services by June 2021. In the interim, access can be provided on a per-request basis." I'll let Seddon elaborate.
Addendum: Found a link to that here m:Wikimedia_Enterprise/Essay#Free_access_for_some_users
FYI - what @Fuzheado said here is correct. As an update - you can follow progress on this ticket
Why were external contractors using open source technologies that are not well supported by Wikimedia's stack/know how, given that there are many open source alternatives that we can support much better? While Redis and Postgres are good options for projects if you started from 0, and the only options in some usages (e.g. PostGIS), there are technologies that the Foundation knows more, or has worked for years to eliminate from our infrastructure (Redis)- mostly to decrease the proliferation of technologies that work for about the same use cases.
Why where the contractors not told to use alternative stacks that are well known and well supported by the wikimedia employees, which will eventually have to support the stack anyway?
I know they are open source, and they are great tools. But using something like S3 instead of OpenStack Swift, I understand, and it won't be as problematic to change if needed. But Postgres and Redis won't be easy to change in an existing application -as our own years of migration showed T212129-, and goes completely opposite what the rest of the organization has been working for years, having obvious alternatives.
Are people aware that we will have to double our staff, services (monitoring, backups, support) and automation, every time a new technology (not matter how good it is) is introduced? Were people in SRE, Security, Performance, etc. consulted about this?
If this is a small-sized project, then why was a new technology used, given any small sized one has lots of flexibility about underlying tech? If this is a large-sized project, why was a new technology used, given it will take lots of effort to migrate away to a supported tech?
Thanks for popping on here - you have good points and it’s definitely something we are considering as we expand the project further.
Redis/Postgres were chosen because we were starting from 0 here and building this separate of the WMF stack on a different service. To add some transparency on the decision-making, the external contractors had a proficiency on Redis/Postgres hence why we went that direction - it was quick and efficient to start building. Representatives from SRE, Core Platform, Architecture, and Data Engineering have been in loop and key members of our technical decision making as part of our every 6-week technical committee. We also host office hours to WMF staff every two weeks, which you are encouraged to attend, and have had plenty of good feedback there. As of now, this is still a prototype building around a potential business case that we’re still exploring and frankly we still maintain a lot of flexibility with what we have built - compared to the scale of mediawiki, this is small potatoes still.
In terms of staffing, we don’t anticipate doubling the staff to support this - the current plan is to add support staff as appropriate on Wikimedia Enterprise that works with WMF SRE in some capacity (we are still ironing out the exact details with leadership) as it is in the spirit of having this service remain separate to everything else but still in the Wikimedia orbit. But oversight is good and we’ve taken steps to get more of that, I am personally welcome to as much as possible - if you would like to join in, I’m happy to have your voice around our work - we can discuss more at the next office hours.
The documentation stated/I read between notes that "we are using contractors for a first phase". Absolutely no issue with that. The worries is that, at some point, the contractors will go away and, like many times happened in the past, the employees will have to handle the load. My concerns are not "our stack is the best" (it is not!:-D), but "supporting a completely different parallel stack". Things like alerting workflow, containers, backups, configuration management, tracking system, orchestration, security incidents workflow, etc. that exist around the core development are many times shared between even separate realms and teams such as wiki production, fundraising, analytics, cloud and office IT. This helps us work better and faster- even if it takes a bit more to start up- and there is always expertise somewhere to help you be productive.
Given "this is still a prototype", wouldn't be wise to encourage them to use similar technologies than the rest of the organization (even if it is a separate organization) when possible, to reduce overhead of technology proliferation and long term maintenance costs? I am not asking them to use PHP or stop using S3, or have a list of preapproved technologies -just avoiding technology and workflow overhead when possible, specially for things such as caching and relational databases, and specially the things I mentioned above (high level workflows) before it is too late to change them. That is my only feedback.
Particularly, both redis and postgres work nicely in small numbers, but they tend to not scale -operationally- in high numbers (long term, with geographic redundancy, upgrade cycle).
There are no older topics