Michael Yoshitaka ERLEWINE mitcho@mitcho.com
LingBuzz is a manuscript and preprint repository for the field of linguistics and has become a valuable community resource. Citation of pre-publication manuscripts from LingBuzz occur with some regularity, often referring to the LingBuzz entry number. (More on LingBuzz) Unfortunately the LingBuzz server is sometimes inaccessible and there have been periods in the past where the server has been completely down for extended periods of time. This archive is meant as a public "backup" of these files.
The archive is organized into folders, corresponding to LingBuzz entry numbers and URLs. In each folder, there is a index.html page, which is a recently downloaded copy of the entry's HTML page, and each revision as a separate file (v1.pdf, v2.pdf, etc.). The archive does not contain any material which is not publicly available on the LingBuzz site.
Note: GitHub will only show the first 1000 subdirectories, but you should be able to guess the URLs you need to access a higher ID number.
(Technical detail: This is the master branch. The archives are on the archive branch.)
--
If you are glad this exists, you might also like my LingBuzz RSS feed and twitter account.
No. Search engines (should) respect GitHub's robots.txt file which disallows the crawling of files stored here. (Technical details: master branches on GitHub get crawled by search engines, which is why I put the archive in a branch called archive.)
Yes. GitHub offers a ZIP archive of the current archive, but be careful because this file will be huge. A better idea is to clone this repository locally, after which you can use the Git software to efficiently keep your local copy up-to-date in the future.
I wrote a script called lingcrawl which systematically reads LingBuzz and downloads all the files missing in the local archive. I then push the files in my local archive up to this GitHub repository.
Right now updating is manual. I will probably automate this soon, so it happens maybe once a week. (Overwhelming the LingBuzz server with frequent requests from my crawler would be counterproductive.)