Just memorising data is simple, use a file on disk. The hard part is recognising data when it is slightly different than what the model 'memorised', deciding which of the millions of
things it learned best fits the answer we desire.
> Most credible machine learning systems work well on unseen data, which by definition isn't memorizing.
Sorry, but no. ML models don't generalise well outside the training data, but they can interpolate inside. This question becomes very interesting in the case of GPT-3 which has had a huge corpus of text to train on, so it's probably seen 'everything'. It's still memorising for GPT-3 but also learning to manipulate data, like software algorithms.
ML models don't generalise well outside the training data, but they can interpolate inside.
I'm unsure if you just misstated this or don't know, but this is wrong.
ML models don't generalise well on data outside the distribution of their training data. But that's an entirely different thing, and doesn't mean at all they are memorising data.
Imagine something training on the US unemployment rate until 2020 being hit with the COVID rate. It wouldn't know what to do, but that doesn't mean it wouldn't work fine on a rate of 5.342% even if it had never seen that rate before.
This is a simplified example, but applies to everything.
GPT-3 generation of text does pull from memorised training data. There's a lot of stuff going on there, and amongst other things there has never really been a system that does textual generation well. It's also hugely overparameterised, so lots of potential for overfitting. I don't think it's a good example of a "good" AI system - it's very interesting, full of potential, but there are lots of issues.
It generalises to almost every face, and the ones it doesn't its failure mode is safe.
Or something like word embeddings. Works incredibly well, and most "failure" modes are around things like bias, where the behavior reflects the real world.
Or something like AlphaZero. Not only is every new game of Go it plays brand new, it learnt to play Chess without knowing the rules. That just isn't memorization.
and note that they all involve looking at a small number of points. It is easy to reproduce plots like that but if you try to increase the number of points the result breaks down completely.
It is a curve fitting problem: for a small enough set of points compared to the number of dimensions, you can find a matrix that projects a set of random points to an exactly specified set of points in the plane. If you relax the problem to something like "put colors on the left side, put smells on the right side" you will get better than random performance from that kind of model, but not that much better than random.
Word embeddings are a strategy that approaches an asymptote. Systems that are destined to low performance will perform better if you use a word embedding, but they throw away information up front that makes high performance impossible.
This is true, but I don't think you are using word embedding like most people use them.
The linear relationship between things like king/queen etc is a cute demo but not really useful or used in practice.
The real usefulness of word embeddings is that similar concepts are close to each other so they make a great representation for other models (vs something like TF-IDF). These days they have been mostly surpassed in terms of state of the art by full language models, but the point is that simple techniques like average embedding of words in sentences generalised really well to unseen data.
And if you add in subword embeddings they generalise to unseen words, too.
We could talk about how context lets language models do this even better, but I'm still back trying to persuade the OP that this isn't just memorisation and good ML models work well on unseen data!
It's not so straightforward to go from a word representation to a query, sentence, or document representation.
If you come from the tfidf direction you can first tune up BM25 or something based on the ks-divergence, then you can use a random matrix, LDA, or the deep-network autoencoder that I worked on that crushed conventional tfidf vectors to 50-d vectors.
(Like many things people want to apply word vectors to, you go from 50% accuracy here to 70%, but we know it because we tested it on TREC gov2)
Today I'm interested in systems that have an input-to-action orientation and there you have to be able to put together a story like: "these 10 messages are parsed correctly and not by accident" and that requires that certain 'king/queen' inferences be done correctly or alternately the system has paths to recover from missing an inference.
Often there is no path to go from "popular models in the new A.I." to "something that can serve customers off the leash" and that's the problem.
Now I do like subword embeddings, but that just points out the problem that there is no such thing as a "word".
Let me justify that.
You can split up English into words like "some text".split() but it is not easy to do it from audio. Speech is punctuated by silences, often in the middle of words whenever you make a "[st]op" sound enough that separating words is equivalent to the whole speech understanding problem.
We can turn words into subwords and mash them together with subwords to make words. (e.g. "Fourthmeal", "Juneteenth", "Nihilego")
Also there are many cases you can replace a phrase with a word or a word with a phrase. Putting 'word' at the center of a model means the system is going to be in trouble w/ linguistic phenomena that happen 30% of the time.
Not sure what you think this shows, but there's a lot of reasons why the results they show don't really matter much - or at least might actually reflect an accuracy approximatly what a human would also achieve.
Their headline claim is a "1 pixel change reduces accuracy by 30%". The test process for that number is this:
We choose a random square within the original image and resize the square to be 224x224. The size and location of the square are chosen randomly according to the distribution described in (Szegedy et al., 2015). We then shift that square by one pixel diagonally to create a second image that differs from the first one by translation by a single pixel.
So... they are taking a random square, downsampling to 224, moving and then predicting on that subset of the original image, and measuring the performance against the accuracy of the original prediction.
What this seems to show is that "CNNs aren't as accurate at making predictions on subsets of an image as on the whole image". This is of course to be expected, and is exactly how a human would perform.
I'm very familiar with the BlazeFace and FaceMesh models (which are related to this in Google's MediaPipe framework).
They have weaknesses - they aren't designed for running upside down for example, so if they get data that is that way oriented they will tend to fail.
They aren't designed for "life" detection, so you can show printed pictures of a face and it will detected it.
But they give you confidence scores etc, so if you give it something like a caricature of a face it will return reasonable confidence numbers indicating it isn't as sure of its predictions.
Actually, Google eventually wants users to find everything with predictive AI giving it to them before they search. That's not really a secret, they've announced more than once in the past that that is what they are increasingly working toward.
Is the process of making smaller scales possible redundant or does it depend on new findings in science that no one anticipated?
I'd really like to know if the fab producers repeat some kind of procedure to create smaller chips or if a new prodigy to make breakthrough findings has to be found every other year to adhere to Moore's Law.
Making transistors and connections is a bit like Minecraft: you need blocks to create structures and in this case the blocks are atoms of semiconductors.
It means that, leaving whatever physics considerations aside, the size of structures is limited by the size of the building blocks.
1 nm is about 5 silicon atoms so cannot be physically shrunk much further. I suspect that there are further physics limitations (e.g. you may be an isolation channel to be wide enough to actually isolate by preventing things like tunneling, etc.)
Each additional step requires solving a lot of new engineering problems and usually some new physics too, and it's not a linear process: many techniques are proposed that become cost effective only when previous techniques hit their size limits, and then a lot of research is done and most techniques are never gotten to work at scale and just a few pan out and become the next process.
I'm not expert. I think lithographic processes are very important.
That's why companies like ASML are so critical. The technology to work at these scales and wavelengths is very tricky.
From Wikipedia: "ASML manufactures extreme ultraviolet lithography machines that produce light in the 13.3-13.7 nm wavelength range. A high-energy laser is focused on microscopic droplets of molten tin to produce a plasma, which emits EUV light." (that's UV close to the X-ray range...)
I used to work for AWS and had to deep dive into IAM to build a feature.
Basically Everytime you touch AWS your session is tagged with your credentials and has a unique ID. So everything downstream you touch has your session ID associated with it.
Now say somebody from Redshift wants to access the customer's data. They will then need to access to the encryption key in KMS. The trail will be there since KMS lives in the customer's account (you can audit your own access). And for production services, human actors cannot access these keys - only production credentials can. An engineer who can log into a prod host in theory can grab the temporary credentials there but it expires in 15 minutes so your trail will be rather visible. Also access to prod host has a high bar - only senior people can do it.
Now in theory somebody can coordinate with a malicious user in KMS team - but the bar is high. Also the actual master key never leaves the premise for KMS so your attack surface is very limited.
Of course there are some core teams like IAM and KMS where if they become vulnerable the whole thing falls apart. But that's a big stretch for those systems since they are the core to the business.
I think perhaps you misunderstand the architecture of KMS. KMS master keys are used to remotely decrypt the symmetric encryption keys for encrypted data that are stored alongside the encrypted data. KMS master keys don't ever leave the KMS servers themselves, and servers can't be accessed directly by anyone. AFAIK they don't have open ports except for handling production traffic and are hardened against opening a shell. An engineer on a different team with access to a host running a customer workload could potentially run off with a temporary customer credential being used by the customer workload, which they could then use to call KMS to decrypt encryption tokens for as long as the credential lasted. But they couldn't get at the KMS key itself or retain access past the expiration of the stolen credential, and all of the aforementioned audit logs would report all of the activity of the stolen credential.
I think you misunderstand my concern. What I'm missing in the above scenario is that a resource that should be 100% under the control of the customer and nobody else can be accessed by AWS personnel to open up a door that should be closed unless the customer permits access.
What the technical implications are is moot, the process that hands out these credentials should not be accessible to anybody but the customer. It implies that AWS personnel can impersonate customer representatives or processes run on behalf of those customers. That's a serious problem.
In all the years that I've been co-locating I do not remember a single instance where a representative of the hosting facilities that I've used gained access to our data or hardware without my very explicit permission.
As for audit logs: they are only as useful as those inspecting them, and more often than not are entirely passive until required for evidentiary purposes.
> It implies that AWS personnel can impersonate customer representatives or processes run on behalf of those customers. That's a serious problem.
Rather than being a serious problem I think it's more on an obvious fact. AWS personnel build services that specifically exist to act on the customer's behalf with delegated credentials. Any time you configure a managed service to run with an IAM role, that service assumes the role and acts with the credentials granted to the role. AWS personnel have access for emergencies to the systems running their services, and by their very nature those services are in possession of customer credential sets for the IAM roles that the service is configured to use.
For example, a Lambda Function can be configured to run with a particular role. When the Lambda service goes to run the function, it fetches the role credentials from IAM and makes them available to the running Function. It could not be otherwise, because the purpose of a managed service like Lambda is to carry out actions on behalf of the customer. The role's credential set is as much a piece of data as the code of the function to be executed.
But leaving all of this aside, of course AWS personnel can access any and all data you store in their systems. They are legally obligated to turn whatever you have stored over to the courts in response to a warrant. So not only could they gather up your data by this roundabout method of misappropriating credential sets, they must have a way to simply access all of the data directly in a way that doesn't appear in audit trails. I assume for simplicity that the IAM service simply has an endpoint accessible to the company's lawyers that will serve up forged customer credentials on demand.
I believe youre misunderstanding how KMS works and is exposed. You probably want to look at the concept of “kms grants.” Thoese regulate which principals, including service principals, can use CMK materials. The customer controls those grants. There are also substantial public docs, and more available on request, around the implementation, certification, and compliance of KMS infrastructure. If KMS is insufficient for your needs CloudHSM is availble for something even closer to “hosted HSM” than “key service.”
In short IAM controls everything, there is no “back door” or universal admin access, and KMS is used to perform sensitive operations NOT handing secrets to arbitrary (internal or external) consumers.
some1 with the right access to the kms service could change a key policy to allow access to a bad guy. in theory. bcuz some1 has to have access to key policies since customers lock themselves out of their keys all the time.
but no 1 can export the private key itself. and key policy changes are vry heavily audited by aws (and can be by the customer, too). this is all proven by the 3rd party audits aws receives
Yes, they can. However, that will leave their trails in their KMS service CloudTrail - unless they manage to exploit CloudTrail as well. That's a lot of barrier to bypass, especially because accessing all these services require you to be in the correct permission group with a hardware MFA token.
Somebody can access the key hardware but they can't extract the actual key out of that. However, I've never met anyone with that level of access - and AFAIK you have to go through various security clearance and approval before such human intervention is permitted.
There's no such thing as perfect security - but KMS is as solid as I can see with centralized key management at the moment. And customer can roll out their own key server as well that is managed in your own data center.
Plus, if there is any legitimate concern about AWS having access to KMS keys (at this point it would be that they own the servers, and that's about it), you can roll a CloudHSM and import your own keys.
KMS is very clear about it's usage and what it involves. It's obvious that with Symmetrical Encryption AWS obviously needs to know the other end of the key at some point so that it can decrypt the data.
However, as customers can't even export these keys and the whole system is based on using KMS to actually perform the decrypt operations it is a non-starter. It's a lot more secure than most infrastructure which probably encrypts locally but is stored in a broom cupboard with a $10 lock.
> It's obvious that with Symmetrical Encryption AWS obviously needs to know the other end of the key at some point so that it can decrypt the data.
Its worth noting that even symmetric keys dont imply direct access to the secret itself. You can instead use the highly controlled secret material to derive less sensitive material. For example a hash derived from a known input + the secret. A third party can use this to prove that two other parties both have/had access to the shared secret. But the third party never needs to access the secret itself.
I can tell you generally how this works in Azure, I can't speak for AWS, but unless a customer is using BYOK for encryption of their data, I can't imagine how AWS c o u l d n ' t be capable of accessing data, and even then I wouldn't gurantee they couldn't still get your data. In Azure (as of a couple years ago), in order to access a customer's tenant it required VP approval, the support engineer was granted access for a specific amount of time, and typically only to specific services, all with the customers knowledge beforehand. It may have changed since the last time I had to go through this process and was restricted to blue badge employees. I have worked support cases since then and the support engineer would not even do a log me in/WebEx, etc session as they said they were not allowed to see the portal. But it may have been that they were not a blue badge and/or bcuz the customer was a critical infrastructure customer.
In order for AWS to comply with LEO's they must have some way of accessing data, that is NOT to say they do this for business purposes.
At the end of the day there's obviously nothing other than remotely storing your keys that will keep your data opaque. Even supposing that the IAM team doesn't have a way to forge a valid credential if they need to, the confirm/deny response of their service to authorization checks is the source-of-truth for whether a credential is valid, and they could update their service endpoint to affirm bad credentials if they wanted to. Presumably for law enforcement purposes they have a way to forge a credential that doesn't show up in audit logs.
Other than the data each service actually retains themselves (i.e. the Lambda service themselves store your Lambda Functions because they need to execute them) customer data is generally stored encrypted at rest with KMS keys belonging to the customer (or sometimes managed by the storage team). It wouldn't be possible to peer into unencrypted data without persuading the KMS API to authenticate your access to the key. Presumably this capability exists, because otherwise Amazon wouldn't be able to honor warrants for customer data, but the premise that KMS is handing out decryption tokens for customer data for the benefit of Amazon Retail's business analysts is pretty silly.
And of course, you're always vulnerable to someone with access to the physical host of an EC2 instance where your workload is running. Only GCP AFAIK offers an encrypted-in-processing compute service, and it's like a week old.
As I said below this is something that they will talk a about like every freaking day. They talk about customer’s data as the most important thing to take care of.
Basically is preferable to get a bullet in the head than to ever reveal or tamper with customer’s data.
I cannot answer your question about who has access or not but I’m telling you what’s the culture when it comes to customer’s data.
At the end of the day I was just another IC doing menial work so probably not a good reference, but that was my experience