[ home / overboard ] [ soy / qa / mtv / dem ] [ int / pol ] [ a / asp / biz / fit / k / r9k / sude / tech / tv / v / x ] [ q / news / chive / rules / pass / bans ] [ wiki / booru / irc ]

A banner for soyjak.party

/sci/ - Soyence and Technology

I fucking love science!
Catalog
Email
Subject
Comment
File
Password (For file deletion.)

File: seetherald2.png ๐Ÿ“ฅ๏ธŽ (66.99 KB, 206x255) ImgOps

 โ„–24271[Quote]

>try to use go's http library to access https://soyjak.party/soy/threads.json
>error 403, cloudflare has decided the request is le bad and a captcha must be solved first
>try to access it through curl, wget and python's requests library
>all run into the same issue
>have the idea to include my browser's user-agent in the request header
>this fixes it for curl, wget and python but go is still running into the 403 error
how do i get around this? i've tried adding other headers like accept-language and referer to no avail
it doesn't make sense that it's rejecting all GET requests sent through go's http client no matter what yet everything else just werks once i add a user-agent header

 โ„–24272[Quote]

File: ClipboardImage.png ๐Ÿ“ฅ๏ธŽ (6.58 KB, 299x63) ImgOps

ive had this problem in the past. it didnt involve the sharty but the problem was, essentially, the same, cloudflare breaking my balls when i wanted to scrape stuff.
what i ended up doing was creating a Python microservice I used to route all my HTTP requests through that used this package
https://pypi.org/project/cloudscraper/
I have no idea of what this thing does internally, but it does work.
in pic related you can see how i've used this thing in the microservice

 โ„–24273[Quote]

>>24272
thanks, i'll give this a try

 โ„–24274[Quote]

>http
just stop being a contrarian and use python

 โ„–24312[Quote]

>>24273
did it work?

 โ„–24314[Quote]

File: ClipboardImage.png ๐Ÿ“ฅ๏ธŽ (129.42 KB, 573x602) ImgOps

>>24312
the library he linked works great though i realised the microservice approach was overkill for what i'm doing since after getting the json i'm just throwing the data i needed from it straight into a sql database
so i've instead opted to take the advice of >>24274 and just write the scraper part in python since i can access the database it writes to with go code anyway
pic related, /soy/ post ids and urls selected from the database and sorted by date

 โ„–24317[Quote]

>>24314
why are you scraping html files of threads when vichan already makes .json versions of them?

 โ„–24318[Quote]

>>24317
that's not what i'm doing at all, i'm downloading threads.json, then downloading the json of every thread in that list
then i use that data to generate a link to every post on the board, even replies (you can't lookup replies through the json api, only threads)

 โ„–24322[Quote]


>>24314
nice, whatever works for you i guess
i went the microservice way because i dislike python enough that i didnt want to have to deal with it at all lel, but yea it was kinda overkill for my usecase too
are you the same guy who was building a 4stats for the sharty?
>>24318
you actually can look up the replies too. replace ".html" with ".json" and you'll see.

 โ„–24323[Quote]

>>24322
>are you the same guy who was building a 4stats for the sharty?
yep

>you actually can look up the replies too. replace ".html" with ".json" and you'll see.

that's looking up a thread, which i'm already making heavy use of
the issue isn't getting a thread's replies, it's finding out which thread a reply was posted in
if you try to do that with a post number that's not the OP of a thread you'll get a 404 error because it's a reply and not a thread

threads.json contains the post number, page number and last modified timestamp of every thread on the board
[thread_number].json contains every post in a thread, including both the op and replies

the problem with this that i'm working around by scraping the entire board and building a local index is that there's no API for finding posts by their post numbers that works for replies
the reason i want to be able to look up post numbers and find their threads is because i also want have a page like get watcher's most recent view where you can see the most recent posts on the board including sages

 โ„–24326[Quote]

>>24323
>there's no API for finding posts by their post numbers that works for replies
you can do https://soyjak.party/search.php?search=id%3A24323&board=sci

 โ„–24328[Quote]

>>24326
oh cool, didn't realise you could use search.php to search by id
going to stick to my current method though because it's faster and i needed to build a local index to get PPH stats anyway

 โ„–27223[Quote]

bump

 โ„–27225[Quote]

>>24326
I know this is a 3 month old post but scraping via search.php is an excellent way to get it removed.

 โ„–27232[Quote]

File: 1711680566534x.gif ๐Ÿ“ฅ๏ธŽ (11.94 MB, 350x640) ImgOps

>>>24326
>I know this is a 3 month old post but scraping via search.php is an excellent way to get it removed.

 โ„–27234[Quote]

File: ClipboardImage(7).png ๐Ÿ“ฅ๏ธŽ (75.89 KB, 636x822) ImgOps

>>>24326
>I know this is a 3 month old post but scraping via search.php is an excellent way to get it removed.

 โ„–27257[Quote]

Ever wonder why
these blatant troll
threads that violate
the only rule in the
sticky are never
deleted while the
mod is literally in
these threads all day
deleting posts?
>>100032795 (Dead)

Because the
apple nigger mรณd
is the OP posting
from his iphone.
He spends 20 hours
a day of his worthless
NEET life posting and
babysitting these falseflag
anti-linux anti-windows
flamewar troll threads,
deleting posts
exposing him as
the OP, and saving
his troll threads
from page 10 after
every 1-4 hours
of no bites.

The only way to purge these shit threads off /g/ is to put a bullet in his head.
Jayy
Louis
lrwin
DoB: 2/2/1983
Age: 41
3004 Nor.folk Dr.
Austin TX 78745
(864) 421-3980
2020 Toyota Camry SE Nightshade VIN# 4T1G11AK8LU913695
jay.irwin@draftfcb.com
jlirwin@gmail.com
thascourge@gmail.com
thascourge@yahoo.com
https://web.archive.org/web/20220323094011/https://twitter.com/invisibro

Relatives:
Larry Richard Irwin
8 Manly Drive, Greenville, SC 29609
(864) 232-2849
DoB: 3/20/1958

Alicia Hilley Irwin
DoB: 8/5/1956
Died at 58 on
April 3, 2015 when
she realized her
son was a faggot.
He then immediately
blew his inheritance
on fag virtue
signaling toys:
https://desuarchive.org/g/thread/50602486

Lawrence Richard Irwin
DoB: 5/19/1985
Cocain addict
shoplifter brother

Leila Alexandra Scogin
DoB: 1/27/1991
Thot halfsister
between his mom and
Richard Gordon Scogin.
TL;DR mom was a whore,
dad was a cuck.
And in one trip to DC,
Rachel Bieder
simultaneously
dumped Jay and
engaged David Woolston
(must have been
going on for a
long time for
it to all happen
in one trip),
making Jay
a cuck as well.
This is why cuck
in all caps is
word banned.

 โ„–27280[Quote]


 โ„–27465[Quote]


 โ„–27547[Quote]

>>24314
is that the marge.moe operator?



[Return][Catalog][Go to top][Post a Reply]
Delete Post [ ]
[ home / overboard ] [ soy / qa / mtv / dem ] [ int / pol ] [ a / asp / biz / fit / k / r9k / sude / tech / tv / v / x ] [ q / news / chive / rules / pass / bans ] [ wiki / booru / irc ]