|
|
|
|
now that my list of product ids is in the millions and ive used about 40gb of proxy bandwidth scraping maybe 50k pages from that data, i have to carefully weigh out how much i want to spend on proxies (spent about $30) on this experiment that could result in just a simple takedown notice to stop the method. granted i can always reuse and modify this data. but i guarantee if you had a million page site based directly around real ecommerce products you would make good money if it stays up |
|
|
|
There are no conversations. |
|
|
cauz |
May 21, 2017, 7:24 p.m. |
|
|
|
Mike Davidson |
Our old site did not have very good support for the disabled, but our new site should soon have much better support. With all of our content in divs now, we can hide all but the relevant chunks of content and navigation with a simple alternate CSS file. |
Tony Fadell |
At the end of the day, customer choice is essential. And we don't make products that compete with Apple, nor make products that compete with Google. Our customers come in both iOS and Android flavors, and I hope our customers can still buy the products they want to purchase wherever they want to purchase them. |
Guillermo Cabrera Infante |
If you look closely, there is no book more visual than Three Trapped Tigers, in that it is filled with blank pages, dark pages, it has stars made of words, the famous magical cube made of numbers, and there is even a page which is a mirror. |
Stephen Hawking |
If the rate of expansion one second after the Big Bang had been smaller by even one part in a hundred thousand million million, it would have recollapsed before it reached its present size. On the other hand, if it had been greater by a part in a million, the universe would have expanded too rapidly for stars and planets to form. |
Dave Eggers |
When I was on the bestseller list with the first book, everyone who knows me knows that every week it continued to be on the list was a very dark week for me. Everyone knows that all I wanted was to be off that list. |
Homaro Cantu |
Whether you are new to the scene or a long-time grillmaster, everyone has unique preferences when it comes to their cooking method of choice. From propane to charcoal to wood, people take their method of grilling quite seriously, and some argue quite passionately about the pros and cons of each method. |
Karel Capek |
Relativism is neither a method of fighting, nor a method of creating, for both of these are uncompromising and at times even ruthless; rather, it is a method of cognition. |
Nicole Kidman |
I was walking around legally blind. Now I have 20-20 vision. I can't believe I spent so many years blurry, but I think that coincides with how I was feeling. Now I notice if people are watching me, but I also smile right back if someone waves, which helps. |
Teri Garr |
When you hear the word 'disabled,' people immediately think about people who can't walk or talk or do everything that people take for granted. Now, I take nothing for granted. But I find the real disability is people who can't find joy in life and are bitter. |
Jeff Garlin |
First off, I don't do self-deprecation comedy based on being fat. I would always talk about it honestly. Secondly, I don't care how much I weigh. |
|
|
have more than a half million product urls (which is really the hard part with amazon, they make it extremely difficult for scrapers to crawl their entire site). after cleaning up this list and potentially trying to get even more products, i will continue to modify my php scraper, this time with use for amazon. it rotates through proxies and user agents so it has worked well in google maps, yelp,. and your university's student directories, so it should bypass amazons no problem. my scraper nowadays saves all the data into xml so i can import through certain plugins, but also have a super easy way to convert to any form i need. originally my scraper rotated through tor proxies and saved all data directly into mysql, over time i created sql files for importing and now that wordpress is used so extensively and doesnt recieve penalties in the search engine like it used to, i can just throw all the data in there and make as many copies and variations of the sites as i want. and make it loo...
This post is a comment.
|
|
|
|
Scraping Every Product on Amazon to Make a Million Page Affiliate Site
|
|
|
|
This list of 400k product ids include lots of copies of the same product with a different tracking number. im only getting maybe 30k off that list total. was gonna scrape more after so my next run of my id gathering, ill find better ways to remove redundancy and save some money. ive used almost 30g of bandwidth through those proxies the past few days. but i also download huge high rez images too
|
|
|
|
i bet i could scrape the images using scrapebox with free proxies to save on costs. the only reason i used paid proxies for the data is because i want to be sure that it's US data to get US results for each product id. and theyre more reliable
This post is a comment.
|
|
|
|
oih yes. also, now a days i run my scripts from a server or even my localhost through WGET and remove the output i use for testing. also another reason i use xml and import into wordpress is because they can manage a database of that sizes efficiency way better than i can. i tried to make a million page site a long time ago and it would take for ever to load my data i put in mysql directly off the scraper
This post is a comment.
|
|
|
|
Maybe I should make some youtube videos. I would make one about the real concerns of AI, one about basic data science and data analysis, and one that is an introduction to neural networks.
|
|
|
|
To perform a man-on-the-side attack, the NSA observes a target?s Internet traffic using its global network of covert ?accesses? to data as it flows over fiber optic cables or satellites. When the target visits a website that the NSA is able to exploit, the agency?s surveillance sensors alert the TURBINE system, which then ?shoots? data packets at the targeted computer?s IP address within a fraction of a second.
In one man-on-the-side technique, codenamed QUANTUMHAND, the agency disguises itself as a fake Facebook server. When a target attempts to log in to the social media site, the NSA transmits malicious data packets that trick the target?s computer into thinking they are being sent from the real Facebook. By concealing its malware within what looks like an ordinary Facebook page, the ...
This post is a comment.
|
|
|
|
Human CAPTCHA services cost money don't they? Not a lot but like-- pennies. Reed is saying you could have your program generate a new model by taking the word of the thing they ask you to identify and scraping Google images to get training data.
This post is a comment.
|
|
|
|
So I scraped 450k amazon product urls, and now i finally finished writing my scraper and finally kicked it off with some fresh proxies. downloading massive amounts of data and images from the big A hole
|
|
|
|
one of my modules silently edits the registry for a systemwide web proxy. i'm looking for a good solution that might change amazon pub-ids, adsense ids, and most importantly redirect according to what site they are visiting. i don't care about banking, i'm already grabbing passwords. i want to be able to dynamically inject javascript into every page they visit etc
|
|