Topic on Talk:Citoid/Archive 2

Confusing publisher and via parameters?

12
SMcCandlish (talkcontribs)

Someone mentioned to me that this tool is incorrectly outputting values like |publisher=Google Books, at least for en.wikipedia citation templates. If it's still doing that, it needs to be fixed ASAP to correctly use the |via= parameter for such intermediary distributors as Google Books, YouTube, Project Gutenberg, PubMed, JSTOR, etc. (if it hasn't been fixed in this regard already). I don't use VE, so I'm not even sure how to test this.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  04:55, 24 July 2016 (UTC)

Jc3s5h (talkcontribs)

Last I checked, Project Gutenberg retypes the text, they don't just scan it. So their books are new editions and they are the publisher. Citation templates don't provide any mechanism to show that an earlier edition was published by a different publisher.

SMcCandlish (talkcontribs)

Wikipedia would still not treat them as a publisher, and having tools automatically do so is misleading and wrong for our implementation of source citations. Project G. is a republisher, and that is what |via= is for, even if they did some hand cleanup of their OCR (and, yes, they do use OCR). It's no different from converting a book to PDF and then eBook format. That doesn't make you magically a new publisher, it just means you've done the work (including any after-automation cleanup) to format-shift something.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  18:50, 27 July 2016 (UTC)

Jc3s5h (talkcontribs)

I think the via parameter would imply that the republication is page-for-page and line-for-line identical to the original publication. Frequently in the past republications would be repaginated, so a passage that appeared on page 100 in the original might be on page 95 in the republication. I believe this was the case in the early days of Project G., although maybe not the more recent publications. Certainly an edition with different page numbering than the original should be treated as new editions. If in doubt, the presumption should be it is a new edition, to avoid making a false claim about what page a passage occurs in the original (which the citing editor has never seen).

SMcCandlish (talkcontribs)

Well, it doesn't imply that. Electronic versions of documents are very often not "page-for-page and line-for-line identical" to the paper version, unless painstakingly made that way, usually in PDF form. If I write a book and release in PDF form through O'Reilly by special arrangement with them, and (within our licensing parameters) you use some tool to convert it to Kindle format, and this changes the layout in some ways, you don't get to claim to be my book's publisher. Doing so would actually reduce the apparent reliability of the source, since you're just some random person, not a well-known publisher. Per en:w:WP:SAYWHEREYOUGOTIT we do want a |via= parameter identifying that this is a copy from some intermediary source and not straight from the actual publisher.

Citing specific page numbers in e-documents is generally pointless unless they are in fact exact PDF scans; we have the |at= parameter to identify where in an electronic document the material can be found. E.g., I would use this to cite the online edition of the Chicago Manual of Style by section number, since it doesn't even have page numbers. Intelligent use of |at= allows people to find the same part in a paper edition, too.  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  22:31, 28 July 2016 (UTC)

Jc3s5h (talkcontribs)

SMcCandlish wrote " If I write a book and release in PDF form through O'Reilly by special arrangement with them, and (within our licensing parameters) you use some tool to convert it to Kindle format, and this changes the layout in some ways, you don't get to claim to be my book's publisher."

Yes, I do. Whatever contractual agreements got put in place among you, me, and O'Reilly allows me to. Of course, if I'm violating copyright, my version shouldn't be cited at all. Or if it happens in the year 2200 and your copyright has expired, then I don't need anyone's permission to create a new edition.

Really no different than Bloomsbury being the original Harry Potter books but Scholastic being the publisher for the English North America editions.

SMcCandlish (talkcontribs)

Taking a file and running a conversion program on it is nothing at all like Scholastic typesetting, design a new cover for, creating new frontmatter for, printing, and distributing a NAm edition of book originally by Bloomsbury. I repeat: What you are talking about is nothing but format-shifting. It is no different from you posting a piece of digital art at DeviantArt, and me (pursuant to permissive licensing terms) putting a copy of it on my Facebook feed; which entails a new copy there, and a re-encoding, i.e. a format shift, and me and Facebook distributing the work to new people. Neither I nor Facebook become the publisher; DeviantArt remains the publisher, Facebook is the |via=. I suppose a philosophical argument can be made that they are two different kinds of publishing really, but who cares? The format-shifting and additional distribution isn't "publishing" for WP citation purposes.

This distinction is the very reason that the |via= parameter was created, to stop mis-attributing format-shifted and other repostings by random pseudo-publishers and content aggregators as the |publisher=, but retain the name of the actual publisher as such, and the name of the online distributor, so that people can find the work in the original form, not just on some possibly short-lived website, but can also use that website for convenience, and not be confused about the difference. For all we know, Google Books or Project Gutenberg could disappear tomorrow forever. The distinction is especially important for any entity that both reformats and distributes (|via=) material on behalf of external, traditional publishers, and also act as the publisher itself, for new (generally amateur) content. Amazon is already doing this, and this kind of business model shift can happen at any time (e.g. HBO, Netflix, and Amazon are all publishers of original television and e-TV series, when formerly they were, respectively, a cable redistributor, a by-mail and later online stream redistributor, and an e-tailer, of previously published content. So, already, any such entity could appear as a |publisher= or a |via=, for different sources in the same article, and the distinction in each case would matter.

When it comes to historical sources, the original publisher information is also often of pertinent, even of crucial value, since significant difference can exist between the 1645 version of something from a London publisher, and a 1672 edition produced in Dublin, without any intermediary e-distributor like Project Gutenberg even being aware of it. Or – and this is telling – they often are aware of it, and so is Google Books, and take pains to note the actual publisher. Neither service claims to be the publisher of such works, and it is a weird form of original research for WP to insist that they are.

With that, I'm kind of tired of arguing round in circles on this stuff, and don't need to keep at it. We have separate parameters for these things for both a citation accuracy and utility reason (helping readers find and use sources) and a policy reason, en:w:WP:SAYWHEREYOUGOTIT, and neither the separation of these parameters nor the rationales for the separation are going to go away just because you don't see it the same way. I could even be totally wrong about every single ting I've said other than the last sentence and it wouldn't make any difference, since there's already a consensus to keep them separate, and it is not necessary for my analysis of why to be correct (though it is).  — SMcCandlish ¢ ≽ʌⱷ҅ʌ≼  10:40, 1 August 2016 (UTC)

SMcCandlish (talkcontribs)
David Eppstein (talkcontribs)

Perhaps the point has been missed here in the back-and-forth. Wikipedia citations use "publisher" to mean the original publisher of an edition of a work. Some of our information providers use the same keyword for a different meaning, the most recent content provider. We should not mix up these two meanings merely because they use the same keyword. If information providers are using "publisher" to mean something different than what we want it to mean, Citoid should not be blindly copying them.

SMcCandlish (talkcontribs)

Agreed, entirely. If the most recent content provider isn't the real publisher of the content, the former should be in the |via parameter. I don't know if there's a practical way to make Citoid aware of a big list of journal aggregators, news aggregators, book scanning sites, etc., to code them as |via instead of |publisher, but I hope so. If WP can maintain a URL blacklist that includes virtually all known URL redirectors (tinyurl.com, etc.), I would think that it could maintain a list of content aggregators (pseudo-republishers).

Jc3s5h (talkcontribs)

It isn't quite as simple as original publisher vs. republisher. A republisher that simply copys images of the original publication and makes them available online should probably be named with the via parameter, or similar. But a publisher who re-typesets, and perhaps repaginates an older work should be regarded as a full-fledged publisher. Some citation styles call for naming the original publisher in this situation, but the Wikipedia citation templates do not have a parameter for this purpose.

SMcCandlish (talkcontribs)

We already covered this above; one of the hazards of "necroposting" on a year-old thread. WP cites sources to help readers identify and find them and to help editors verify our content. We do not do so as a bibliographic database service; the purpose is not to track the history of a work. So, WP has no need of being able to identify a previous publisher's details. If you have a genuinely republished version with new typesetting and pagination, or even just a new foreword/introduction, this is the work you are citing, by that particular publisher. We don't care who published the first edition that had different font, page numbers, or lack of a "50th anniversary" foreword or whatever. It's just not relevant.

[Conceptual aside: It's really no different from a quote being in a New York Times article, perhaps with an "[editorial tweak]" in it, and a reporter's introduction ("According to X. Y. Zounds in The Zounds Method,"). We cite the newspaper article we found the quote in, not the original primary source of the statement (unless we also have that, and have checked it, and it's appropriate to "double-up" the citation for some reason, e.g. because another source misquoted it and caused a controversy). A new edition, a real republication, of a work is a similar matter; the original material being included is essentially a giant quotation, may have been editorially altered in the course of republication, and may have new lead-in material, a big "Foreword" or "Introduction to the Nth Edition" version of a journalist prefacing a quoted statement from a speech or document.]

By contrast, |via is important, for actual WP purposes and in addition to |publisher, to use for cases of pseudo-republishing, i.e. redistribution or format-shifting, such as if you got something via a scanning site or a content aggregator:

  1. That intermediary is incidental and has no effect we care about on the content itself (e.g., we DGaF if it has an aggregator's watermark on it; that isn't substantive and does not constitute an "edition" or a new "publishing" for WP purposes).
  2. The URL or the entire aggregator itself might not be there tomorrow. I have no insider info on the budgets of Project Gutenberg, Internet Archive, Google Books, or the journal aggregators, but these things cost money to operate. We do know that at least the first two of these have had funding struggles in the past, and still publicly seek donations to keep them going. The latter two are things a profit-minded business entity could axe at any moment, or start paywalling, as a simple business decision. The only consequence of such a failure is a dead URL. The actual citation is to the original work and remains valid; the work still exists and can be found. The dead link info is removed from the citation; we do not remove from citations the names of actual publishers who have ceased operation.
  3. It may not be the most convenient or effective way for a particular reader to get the work.Examples: if someone has taken a print-out of the WP article to a public library and all its Internet access kiosks, if there are any, are in use, but the library may have the original work on its shelves; or in a place where Internet access is costly and schlepping down a huge PDF is not practical, but looking at a paper copy you got via inter-library loan is free; or when a journal aggregator is not free for full text, with that only accessible for pay or at institutions with a subscription; and ... insert numerous other scenarios.

[Second conceptual aside: If I have a blog that I publish, and someone cites it, and the site goes down permanently, and it wasn't archived by Wayback machine or something equivalent, then that site is gone; i.e., it cannot be used by readers/editor for verification, ergo it is no longer a valid source citation. A conduit for a copy of a publication (e.g. Wayback.Archive.org), and the publication itself (McCandlishWorldNews.com or whatever): a big and clear difference. People seem to have unreasonable difficulty with the distinction, just "because Internet", i.e. because "a website is a website" in many minds; they're confusing the medium for the message, the delivery format for the content.]

Reply to "Confusing publisher and via parameters?"