What to do with the scraped threads?

Current scraping is finished.

931 UG threads: -
https://justpaste.it/e6uio

31 OG threads: -

https://justpaste.it/evlk9

@stevezur has also done excellent work and scraped a number of OG megathreads: -

https://og.discourse.group/t/i-have-scraped-a-number-of-threads-from-the-ug-happy-to-take-more-requests/143/191

If the threads are to be become available either here or elsewhere then there is a still a lot of work to do but the most important thing is that the data has been captured.

My original plan was to host them here as part of the forums they belong to but @Onikage is not keen on the idea, preferring to have them archived in an Oldground forum. @DY3 and others on the boxing forum want the old boxing thread resurrected. I am very keen on the idea.

This all works off of the assumption that it can be done. It would take a fair amount of work to get it done as well as help and permission from @Onikage.

So I am throwing it out there. What do you guys want?

9 Likes

In!

1 Like

Who’s the techie guy doing the work? Yourself?

1 Like

While it would be nice to have those threads restored directly in here I think that an archive ground may make some sense. In the old threads a lot of users don’t even exist anymore, and that may cause some heartburn when trying to move posts with them over.

Some of the threads have already been re-made, so maybe an Archiveground is the best option.

Talking out of my ass and doing no research, if discourse has an API to do the heavy lifting, it would be nice to pull them in here.

1 Like

The users not existing thing is going to be an issue regardless because a post and thread will need an author otherwise it will be non-conforming data. I think we can get around that by: -

  1. Automatically filling in as many from active users here as possible.
  2. Creating a single dummy user for all other threads/posts where the user doesn’t exist here “MissingThreadUser”.
  3. Placing at the bottom of the post “Archive post. Originally created by WhatEverTheirNameWasHere”.

I need to get an example Discourse DB to see its structure as the data will need remodelled back into a DB-compliant format. Will also need to look into ID clashes etc.

3 Likes

I would be worried about the liability of posting content owned by Alta and giving them some kind of opening against us. I’m not a lawyer though.

2 Likes

@Onikage I think I am now leaning more towards Oldground/Archiveground but I would like the boxing thread resurrected.

@DY3 , @FloppyDivac, @James_III, @SmilinLikeRashad, @MMApurist, @FerrisWheeler_CTT, @TexDeuce… what say you?

4 Likes

I have thought about this but I would assume it is publicly available content, created by their users?

1 Like

2 Likes

Am a fan of that idea

3 Likes

Yeah I think that is a really good idea.

2 Likes

I think it’s a great idea too. I’ve seen a lot of recreated Threads so far and the OldGround has a good ring to it.

3 Likes

I have all 140k+ json files and images from the “pics to troll sjw…”

It’s about 35GB.

I wrote scripts to parse the json and download the images. Usernames, dates, comments, and images. 99% of it could be reconstructed if desired.

2 Likes