The Ever-Expanding Job of Preserving the Internet's Backpages - Slashdot

2022-10-08 07:56:17 By : Ms. Cindy Kong

Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Also, "security" is not the issue. It's Facebook claiming "security" is the reason they can't make people's Facebook posts, already marked public, as truly visible to the public.

Even before the walled gardens of Facebook we should face the reality that we are in dead trouble already, losing tens of thousands of important, historical documents.

Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes. Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially! Then with timelocked licensing and license expiration it got worse - you cannot legally work with the information, and there is noone to purchase licenses from anymore. Also with poorly documented (if at all!) proprietary data formats, where actual implementations don't even match the companies internal documents. Then we have online license activation, many of which services have been lost to the world. DRM is an archivists nightmare. Just a nightmare. E.g. Microsoft's PlaysForSure. Add in giant cloud datasets, I doubt even Facebook, Apple, Google, or Microsoft have any way to replicate any of their experiments or setups, what hope has a historian? And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy. Yesterday, I literally had to play some old presentations from Microsoft's Sillverlight on a very old machine and record them with OBS to make them survive.

Welcome, one and all, to the new digital dark ages.

You should probably add GitHub to that list. There will be a tipping point eventually on that service. Free forever? Uh huh. Right. Sure.

Yup. I've visited a few repositories and found them deleted, with no forks.

Preservation is something I care deeply about, and digital data has been an apocalyptic nightmare since forever.

Preservation is something I care deeply about, and digital data has been an apocalyptic nightmare since forever.

There are workarounds, if not fixes, for some of the examples you mentioned.

Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes. Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!

Starting from files written in software for a computer that simply doesn't exist in a working state anywhere, anymore. Especially government mainframes. Then when personal computers got cheaper we run into the hard fact that many companies cannot open or reproduce files using their own software, even within the same version numbers. Microsoft office, I am looking at you especially!

So long as they're not encrypted, it should be possible to extract the important stuff out of the file using nothing more low-level than a hex editor. LibreOffice, I think, can open most of the older MS Office documents, even if the formatting isn't 100% accurate.

And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.

And even now bitrot slowly destroys files saved over time unless unusual tech is used, like gold dvds or a redundant, checksumming filesystem like ZFS, with a good backup strategy.

How bit rot affects files depends on the type of file. In media files, the most important parts are the end and start of the file. Any bit rot in between would have to be fairly significant to affect the playability of the file. DRM may or may not mess with this.

Spot on. A few years ago I had to deal with a large dump of documents in all sors of formats. Word Star, Lotus Notes, some formats that I've forgotten but mainly a wide variety of Microsoft Office documents. In the end I used Libre Office (Open Office at the time) to bulk convert as much as possible to HTML so it could at least be mostly read. When Libre Office couldn't open the file we had to try and at least extract the text using "strings". What was especially annoying about the exercise was that Mi

Every single time I try to access a "gone" webcomic, the internet archive manages to have missed most or all of the pictures every single time it crawled the site.

So, an article about the difficulties of archiving when sites are walled off is itself walled off.

I'll double your irony: You can read it here [archive.ph].

It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines. You can find a lot of wisdom of the ancients [xkcd.com] there, or in some cases unanswered questions.

Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like. In some cases like Twitter and Instagram, the content is still indexed by search engines and is still searchable to a degree. But in most cases, the walled garden blocks access either via robots.txt or by requiring authentication by a logged-in account on the site. You'll never see any Discord messages show up in a google search result.

Some good samaritans are exporting and preserving some Discord servers' logs for archival purposes, but those are the exception and not the rule. It's a real problem for historians. Summoning Salt [youtube.com] for instance has had to dive into Discord archives and do a lot of 1:1 DMing to reconstruct the histories of many games' speedrunning records.

Is this the sad future we're destined for?

It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines... Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like.

It used to be the case that much knowledge and history was created on forums, bulletin boards, and the like, all of which were easily indexed by search engines...

Nowadays it seems like everyone's on closed ecosystems like Discord, Facebook groups, and the like.

Yes. It usually takes the money-grubbers a few decades to catch on to a new "opportunity". Then, a few decades after that, people in government start to see the light.

Maybe the Amazon specials. You can write 2TB but you're never gonna get it back :D

I think the internet archive is more deserving of a $250 2TB 2242 NVME drive anyway. . . Which is about the same size and better in almost every way.

2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.

2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.

That is currently not possible, AFAIK, unless you get those "1TB"/"2TB" drives, that report a fake number, then get full and/or fail at at much smaller number. It's a problem on amazon for SSD, microsd and usb flash drives that Amazon should fix.

Moreover, this is approaching the scenario of the old, old SF short story in which all the knowledge of the glaxy is crammed into a tiny, almost microscopic computer - and then someone loses it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

White House Unveils AI 'Bill of Rights'

Drinking Several Cups of Coffee a Day May Be Linked To Longer Lifespan in Study

What is mind? No matter. What is matter? Never mind. -- Thomas Hewitt Key, 1799-1875