Hacker Newsnew | past | comments | ask | show | jobs | submit | theschreon's commentslogin

There's no sure-fire way to make a human who is processing a document manually never leak the information they read.


Sure, but if there's a law forbidding you to exceed a particular speed on the road, you can't do it anyway and say "you can't be perfectly safe anyway".

The analogy here is: there are laws regarding confidentiality that probably were broken here.


I agree. But at the same time, insisting on 100% certainty/safety would mean to not do anything at all and stick to the status quo forever. It boils down to cost-benefit-calculations.

While I agree that it is unacceptable to use customer data without consent (as suggested by OP's post), I disagree with the implicit assumption behind the comment that I responded to:

Namely the implicit assumption that human/biological intelligence/agents are somehow superior to artificial intelligence/agents in face of secrecy/confidentiality.

It boils down to the question whether it's possible or not to create algorithms that outperform humans in tasks involving secrecy and confidentiality.

While I can't think of any reason why that should not be possible in general, I agree that the current SOTA of generative LLMs is not sufficient for that.

Is throwing lots and lots of data and RLHF training on an LLM enough in order to make the probability of customer data leaks small enough to be acceptable?

I don't know. But I don't trust MBAs who salivate with dollar signs in their eyes to know either. And I fear that their lack of technical understanding will lead to bad decisions. And I fear those might lead to scandals that make Gemini's weird biases in image generation pale in comparison.


> It boils down to cost-benefit-calculations.

Yes, the user bears the cost when their confidential data is leaked and the company derives the economic benefit of mishandling it, which is why this keeps happening.


I used to work with extremely sensitive data. My employer made it a point to hire people with memory disorders and intellectual disabilities to deal with raw data.

There was a young lady I had to reintroduce myself to every week or so. I think of her every so often.

I’m certain she doesn’t think of me.


Actually there is. But let us not go there.


"Search is all well and good when we are counting words, which is what data analytics and machine learning are really all about."

There are machine learning models which go far beyond counting words, for example see https://arxiv.org/abs/1502.01710


Parent links to Yann LeCunns's "Text understanding from Scratch" paper, from 2015, where the authors uses a conv-net, originally build for image recognition to do text categorisation.

The NN techniques falls squarely in the "counting words"-bracket, although this one is actually counting characters.

It is a great paper, with great results, but none of those models therein have an opinion on ISIS, an ability to converse or anything the author of TFA calls cognition.


Does this also work for larger numbers of features? (like, 2000, as opposed to 6-9 in the demo)


Only if your dataset is really small. It only supports up to a low millions of points


Once you have identified the most significant individuals on a field, you can use google scholar alerts to get notified when they publish something new. (http://scholar.google.com/scholar_alerts?view_op=list_alerts). In the case of machine learning that would be Geoffrey Hinton, Yann LeCun, Yoshua Bengio, Andrew Ng (this list is not exhaustive of course).


You could try the following improvements to speed up neural network training:

- Resilient Propagation (RPROP), it significantly speeds up training for full batch learning: http://davinci.fmph.uniba.sk/~uhliarik4/recognition/resource...

- RMSProp, introduced by Geoffrey Hinton, also speeds up training but can also be used for mini-batch learning: https://class.coursera.org/neuralnets-2012-001/lecture/67 (sign up to view the video)

Please consider more datasets when benchmarking methods:

- MNIST ( 70k 28x28 pixel images of handwritten digits ): http://yann.lecun.com/exdb/mnist/ . There are several wrappers for Python on github.

- UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets.html


Definitely a lot to read and improvements to make. I will probably do a more complete benchmark with more datasets on a later post.

Thanks for the suggestions.


You may be interested in this ICML 2006 paper, which empirically compared many standard algorithms across a combination of metrics and UCI datasets - http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icm...


Will this software focus solely on Random Forests? Hoped to see e.g. Deep Convolutional Neural Networks as an option :(


Sometimes the underlying processes an application is designed for are too complex to be self-explanatory. Look at e.g. Photoshop, would not work without tutorials.


Photoshop doesn't have overlay help such as this.

I'm not arguing against any help, I'm simply stating that if it's easy enough to explain with an overlay then it's probably possible to make the interface intuitive enough without it.


iPhoto on the ipad has overlays. Same sort of app, focus on usability, on a usability focussed device, but still overlays are needed...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: